This repository contains the largest open access and scalable collection of Latin texts, Opera Latina Adnotata (316 files, 6,755,191 tokens, and 411,329 sentences) + related resources to analyze Latin (http://ola.informatik.uni-leipzig.de/it/index.html). 🏋️❤️😃
The original Latin texts have been tokenized, sentence split, morphologically, and syntactically annotated using a standoff format (Paula XML), which allows smooth expansion of the corpus via addition of multiple annotation layers.
The repository is organized thus (further details within each directory):
texts
contains the texts annotated (look in here if you are just interested in the annotated texts)original-Latin-files
contains the original Perseus Digital Library texts used as base texts intexts
tokenize
contains Verbator, the tokenizer and sentence splitter used for the texts intexts
(also accessible via a REST API: https://git.informatik.uni-leipzig.de/celano/latinnlp/-/tree/master/tokenize/tokenizer)abbreviations
contains a list of Latin abbreviations used for tokenizationnormalization
contains files useful for normalization of Latin tokensguidelines
contains documentation for the annotation of Latin (in fieri)scripts
contains scripts used to create the present datacase-study
contains a case study documenting standoff annotation for Latinpaula
contains the texts intexts
in Paula XML 1.1 formatrelannis
contains the relannis version of the files inpaula
(http://pcai049.informatik.uni-leipzig.de:52480/annis3/)combo
contains all the files annotated by the COMBO parser (https://git.informatik.uni-leipzig.de/celano/COMBO_for_Latin)annotation-lists
contains some lists to document how tokens should be annotated (in fieri).webanno
contains files related to an annotation example in WebannoThe repository is work-in-progress (aiming to add/improve annotations)
No comments:
Post a Comment