Tuesday, September 20, 2016

Ancient Greek and Latin Dependency Treebank 2.0

Ancient Greek and Latin Dependency Treebank 2.0
Responsible for the project
Giuseppe G. A. Celano (celano at informatik.uni-leipzig.de) & Gregory Crane (crane at informatik.uni-leipzig.de)

Advisory board
Joakim Nivre
Jonathan Robie

Treebanking is the activity of annotating texts syntactically. It is part of a relatively new field of research exploring the potential of linguistic annotation for a great variety of purposes, ranging from natural language processing tasks, such as machine translation or summarization, to linguistic research, where computational treatment of data has been significantly impacting method and results in linguistics.

Continuing the pioneer work at the Perseus Project, where the first texts were treebanked (Ancient Greek and Latin Dependency Treebank 1.0), the Humboldt Chair for Digital Humanities promotes the building of the Ancient Greek and Latin Treebank 2.0 within the project Treebanking: building a linguistic corpus for Ancient Greek and Latin, started on 2015.

The aim of the project is twofold: (1) produce new treebanked data following a new specification and (2) develop annotation and conversion tools, so that annotation can be as automatic as possible and data can be converted into different formats: this is particularly relevant in that the newly produced data will also be released as part of the project Universal Dependencies.

Currently, our annotation can be performed online through the Perseids platform: users are freely granted access to Arethusa, a new annotation environment currently allowing three layers of linguistic annotation: the morphological layer, the syntactic layer, and the advanced syntax (or semantic) layer.

Morpheus PoS tagger allows semi-automatic annotation for morphology. The annotator is provided with some morphological analyses for each word. S/he can choose one of them or add a new one, if the right one is missing.

The syntactic annotation consists in building syntactic trees according to a dependency grammar model and assigning a grammatical relation label, such as SBJ or OBJ, to each node of a tree on the basis of its relationship with the governor node. The current implemented model builds on the one developed for the Prague Dependency Treebank 2.0.

Ancient Greek can also be annotated for semantics. The advanced syntax (or semantic) layer allows annotation of the categories identified in Smyth’s grammar (where the term “syntax” is used in a broader sense, to also cover semantic roles).  Starting from the morphosyntactic annotation of a word, the annotator is algorithmically guided to the identification of a relevant semantic role (e.g., genitive > genitive proper > genitive of possession).

Currently, a selection of Aesop’s fables, passages from the Bibliotheca (Pseudo-Apollodorus), and the fables of Phaedrus are being annotated. The creation of the corpus is documented on github:

guidelines 2.0 for the annotation

inter-coder agreement for the Greek and Latin texts (work in progress)

repository for the treebank, both AGDT 1.0 and 2.0 (work in progress)

Annotation platform:

Arethusa through Perseids Platform

A few videos on how to use Arethusa to annotate

Screen Shot 2014-10-20 at 18.25.28

No comments:

Post a Comment