Thursday, June 11, 2020

The Index Thomisticus Treebank Project

 [First posted in AWOL 15 May 2019, updates 11 June 2020]

The Index Thomisticus Treebank Project
dependency tree structure
The Index Thomisticus is considered to be one of the pioneer projects in the research areas that today are known as Computational Linguistics, Humanities Computing and Digital Humanities. Started by father Roberto Busa SJ in the second half of the 40s, the Index Thomisticus is a corpus containing the opera omnia (in Latin) of Thomas Aquinas (118 texts) as well as 61 texts by other authors related to Thomas, for a total of approximately 11 million words, each morphologically tagged and lemmatized by hand.
Early in the 70s, Busa started to plan a project aimed at both the morphosyntactic disambiguation of the Index Thomisticus lemmatization and the syntactic annotation of its sentences. Nowadays, these tasks are performed by the Index Thomisticus Treebank project. Started in 2006 at Università Cattolica del Sacro Cuore (Milan, Italy), the Index Thomisticus Treebank is a dependency-based syntactically annotated corpus built upon the texts of Thomas Aquinas provided by the Index Thomisticus corpus. The annotation style of the treebank is based on the guidelines developed in Prague for the so-called 'analytical' layer of annotation of the Prague Dependency Treebank for Czech. Furthermore, specific guidelines for the syntactic annotation of Latin texts were built together with the Latin Dependency Treebank, developed by the Perseus Digital Library at Tufts University in Boston, MA (USA).

Recently, the project has started to perform also the semantic and pragmatic annotation of the already available syntactically annotated data of the Index Thomisticus. This new level of annotation resembles the so called 'tectogrammatical' layer of the Prague Dependency Treebank; it features semantic role labelling, ellipsis resolution, coreference analysis and sentential information structure.
A more general aim of the Index Thomisticus Treebank project is to develop and make available different kinds of advanced language resources for Latin. Beyond the Index Thomisticus Treebank, the project includes also the following resources:
  • a semantically/pragmatically annotated portion of the Latin Dependency Treebank (with the same annotation style used for the Index Thomisticus Treebank), which features texts of authors of the Classical era;
  • a syntactically-based valency lexicon (IT-VaLex) automatically induced from the syntactic layer of annotation of the Index Thomisticus Treebank;
  • a semantically-based valency lexicon (VALLEX) built in close connection with the semantic/pragmatic annotation of both the Latin Dependency Treebank and the Index Thomisticus Treebank.

No comments:

Post a Comment