Syntacticus provides easy access to around a million morphosyntactically annotated sentences from a range of early Indo-European languages.
Syntacticus is an umbrella project for the PROIEL Treebank, the TOROT Treebank and the ISWOC Treebank, which all use the same annotation system and share similar linguistic priorities. In total, Syntacticus contains 80,138 sentences or 936,874 tokens in 10 languages.
We are constantly adding new material to Syntacticus. The ultimate goal is to have a representative sample of different text types from each branch of early Indo-European. We maintain lists of texts we are working on at the moment, which you can find on the PROIEL Treebank and the TOROT Treebank pages, but this is extremely time-consuming work so please be patient!
The focus for Syntacticus at the moment is to consolidate and edit our documentation so that it is easier to approach. We are very aware that the current documentation is inadequate! But new features and better integration with our development toolchain are also on the horizon in the near future.
Language Size Ancient Greek 250,449 tokens Latin 202,140 tokens Classical Armenian 23,513 tokens Gothic 57,211 tokens Portuguese 36,595 tokens Spanish 54,661 tokens Old English 29,406 tokens Old French 2,340 tokens Old Russian 209,334 tokens Old Church Slavonic 71,225 tokens
No comments:
Post a Comment