New computational tools for Ancient Greek corpus linguistic

Tuesday, March 24, 2020

New computational tools for Ancient Greek corpus linguistic

Posted by Alek Keersmaekers on DIGITALCLASSICIST

I'm excited to announce some new computational tools for Ancient Greek corpus linguistics:

- First of all, the Duke papyrus texts (https://github.com/alekkeersmaekers/duke-nlp) are now not only automatically annotated for lemmas and morphology but for syntax and semantic roles as well, making this the largest diachronic treebank for Ancient Greek so far (about 4.5 million tokens). The accuracy for syntax and semantics (about 85-90% and 81% respectively for letters) is lower than for morphology and lemmatization, but still decent enough to be used in linguistic research.

- DendroSearch (https://github.com/alekkeersmaekers/dendrosearch), a user-friendly query tool for Greek treebanks, including all treebank material that is available to date (if your treebank is still missing, please let me know!)

- An automatic semantic role labeler (https://github.com/alekkeersmaekers/PRL), using the roles of the Pedalion grammar created at the University of Leuven (http://en.pedalion.org/). It also includes an animacy lexicon, partly based on the animacy lexicon of the PROIEL project (many thanks to Dag Haug!) and distributional word vectors for Greek lemmas.

None of this would be possible without the painstaking work of the ancient Greek treebanking community, so many thanks to the people of the PROIEL, AGDT and Sematia projects, Vanessa Gorman, J.M. Harrington and his team, Polina Yordanova, and the job students involved in the Pedalion treebanks!