Tuesday, March 24, 2026

T'OMIM (Tanakh Observable Matches of Intertextual Mimesis, from Hebrew תאומים meaning "twins") is an open-access dataset of labeled parallel passages in the Hebrew Bible, compiled for computational and literary research on inner-biblical intertextuality. The archive pairs two distinct corpora of known parallels: 554 narrative verse pairs drawn from the Chronicles synoptic tradition, cataloged by Bendavid (2013) and Endres et al. (1998), and 256 poetic half-verse pairs identified by Berlin (2008), Fokkelman (2001), Kugel (1981), Watson (1994), and Tsumura (2023). Each corpus is provided at two levels of granularity. Verse-level tables contain the paired Hebrew texts with their source citations. Word-level tables expand each passage into its constituent tokens, preserving the full morphological annotation of the ETCBC Biblia Hebraica Stuttgartensia Amstelodamensis (van Peursen, Sikkel, and Roorda 2015): part of speech, verbal stem and tense, gender, number, person, lexeme, English gloss, and hierarchical syntactic structure. The four resulting tables are distributed as Apache Parquet files under a CC-BY-4.0 license, suitable for training and evaluating models for semantic similarity, text reuse detection, and intertextual retrieval in Biblical Hebrew. 
 

No comments:

Post a Comment