This repository contains the largest open access and scalable collection of Latin texts, Opera Latina Adnotata (316 files, 6,755,191 tokens, and 411,329 sentences) + related resources to analyze Latin (http://ola.informatik.uni-leipzig.de/it/index.html). 🏋️❤️😃
The original Latin texts have been tokenized, sentence split, morphologically, and syntactically annotated using a standoff format (Paula XML), which allows smooth expansion of the corpus via addition of multiple annotation layers.
The repository is organized thus (further details within each directory):
textscontains the texts annotated (look in here if you are just interested in the annotated texts)original-Latin-filescontains the original Perseus Digital Library texts used as base texts intextstokenizecontains Verbator, the tokenizer and sentence splitter used for the texts intexts(also accessible via a REST API: https://git.informatik.uni-leipzig.de/celano/latinnlp/-/tree/master/tokenize/tokenizer)abbreviationscontains a list of Latin abbreviations used for tokenizationnormalizationcontains files useful for normalization of Latin tokensguidelinescontains documentation for the annotation of Latin (in fieri)scriptscontains scripts used to create the present datacase-studycontains a case study documenting standoff annotation for Latinpaulacontains the texts intextsin Paula XML 1.1 formatrelanniscontains the relannis version of the files inpaula(http://pcai049.informatik.uni-leipzig.de:52480/annis3/)combocontains all the files annotated by the COMBO parser (https://git.informatik.uni-leipzig.de/celano/COMBO_for_Latin)annotation-listscontains some lists to document how tokens should be annotated (in fieri).webannocontains files related to an annotation example in WebannoThe repository is work-in-progress (aiming to add/improve annotations)
Tuesday, December 5, 2023
Opera Latina Adnotata
Subscribe to:
Post Comments (Atom)

Stumble It!

No comments:
Post a Comment