Saturday, September 23, 2017

Syntacticus: A treebank of early Indo-European languages

Syntacticus: A treebank of early Indo-European languages
Syntacticus provides easy access to around a million morphosyntactically annotated sentences from a range of early Indo-European languages.
Syntacticus is an umbrella project for the PROIEL Treebank, the TOROT Treebank and the ISWOC Treebank, which all use the same annotation system and share similar linguistic priorities. In total, Syntacticus contains 80,138 sentences or 936,874 tokens in 10 languages.
We are constantly adding new material to Syntacticus. The ultimate goal is to have a representative sample of different text types from each branch of early Indo-European. We maintain lists of texts we are working on at the moment, which you can find on the PROIEL Treebank and the TOROT Treebank pages, but this is extremely time-consuming work so please be patient!
The focus for Syntacticus at the moment is to consolidate and edit our documentation so that it is easier to approach. We are very aware that the current documentation is inadequate! But new features and better integration with our development toolchain are also on the horizon in the near future.
Language Size
Ancient Greek 250,449 tokens
Latin 202,140 tokens
Classical Armenian 23,513 tokens
Gothic 57,211 tokens
Portuguese 36,595 tokens
Spanish 54,661 tokens
Old English 29,406 tokens
Old French 2,340 tokens
Old Russian 209,334 tokens
Old Church Slavonic 71,225 tokens

No comments:

Post a Comment