Wednesday, June 21, 2023

Corpus of the Epigraphy of the Italian Peninsula in the 1st Millennium BCE


The Corpus of the Epigraphy of the Italian Peninsula in the 1st Millennium BCE, or CEIPoM, is a linguistic database focusing on the Italian peninsula in the first millennium BCE. Currently, it covers Messapic, Venetic, the Sabellic languages and epigraphic Latin up to about 100 BCE.

The acronym CEIPoM represents the genitive plural of an archaic form of the Latin cippus, which can be reconstructed (De Vaan p. 115) and may also be attested (see Token_ID 418500). With a little interpretative licence, the form may be translated as “pertaining to inscriptions”, and thus succinctly expresses the focus of this database.

This database is a work in progress!

Purpose of this corpus

Ancient Italy in the first millennium BCE presents a unique trove of linguistic information. Rarely, if anywhere, in the ancient world is such a diversity of languages preserved in so small a region, and their study is of great potential interest to Indo-Europeanists, historical linguists and typologists alike. Unfortunately, however, the accessibility of this data is currently limited: most of the languages of ancient Italy are documented only in printed corpora, sometimes without any kind of linguistic analysis. And where digital corpora do exist, they are either incomplete, or lack the level of annotation required for anything more than relatively superficial linguistic research.

This project aims to fill this lacuna by providing a linguistically oriented and publicly available digital research corpus for the languages of Ancient Italy. In its present version, the corpus is (almost) complete for the Sabellic, Messapic and Venetic languages, as well as Latin epigraphy before 100 BCE.

This corpus was created in the context of a research project on language contact in Ancient Italy. It is, therefore, strongly tailored to the needs of linguistic research (whether synchronic or diachronic in nature), and focuses on providing high-resolution and intercomparable linguistic information for each attested token in these ancient Indo-European languages. Although there is less emphasis on the archaeological and epigraphical context of these texts, the Trismegistos link provided for each inscription can be used to track bibliography and metadata, and link the texts in this corpus to other projects, such as EDCS or EDR.

All files are published here as .csv files (utf-16 encoded). The data can be analysed via Python or R, or opened with spreadsheet software such as LibreOffice Calc.


The corpus is structured as a relational database, with four levels of description, each of which stands in a one-to-many relationship with the immediately subordinate level, as follows:

  • Texts contains information pertaining to individual inscriptions as a whole, such as their dating and provenance.
  • Sentences contains information on the individual syntactic units of which an inscription is comprised, including a basic transcription of the text. Many inscriptions comprise only a single sentence.
  • Tokens contains information about the tokens (words and clitics) in a specific sentence, such as their form and their syntactic relations to other tokens.
  • Analysis provides linguistic analysis of each token. This information is provided on a level subordinate to “Tokens” because, particularly in a fragmentary or poorly understood corpus, a token may have more than one possible interpretation. This table offers detailed annotation, ranging from POS-tagging and lemmatisation to semantic categories. It also contains the raw data used for the analysis of the Sabellic TAM system in Pitts (2020).

In addition, the file links allows the texts in the database to be linked to extensive metadata and bibliography via their Trismegistos ID.

A detailed discussion and usage guide for all the fields in the database can be found here.

 View on GitHub Vademecum

No comments:

Post a Comment