This repository contains the largest open access and scalable collection of Latin texts, Opera Latina Adnotata (316 files, 6,755,191 tokens, and 411,329 sentences) + related resources to analyze Latin (http://ola.informatik.uni-leipzig.de/it/index.html). 🏋️‍❤️😃

The original Latin texts have been tokenized, sentence split, morphologically, and syntactically annotated using a standoff format (Paula XML), which allows smooth expansion of the corpus via addition of multiple annotation layers.

The repository is organized thus (further details within each directory):

  1. texts contains the texts annotated (look in here if you are just interested in the annotated texts)
  2. original-Latin-files contains the original Perseus Digital Library texts used as base texts in texts
  3. tokenize contains Verbator, the tokenizer and sentence splitter used for the texts in texts (also accessible via a REST API: https://git.informatik.uni-leipzig.de/celano/latinnlp/-/tree/master/tokenize/tokenizer)
  4. abbreviations contains a list of Latin abbreviations used for tokenization
  5. normalization contains files useful for normalization of Latin tokens
  6. guidelines contains documentation for the annotation of Latin (in fieri)
  7. scripts contains scripts used to create the present data
  8. case-study contains a case study documenting standoff annotation for Latin
  9. paula contains the texts in texts in Paula XML 1.1 format
  10. relannis contains the relannis version of the files in paula (http://pcai049.informatik.uni-leipzig.de:52480/annis3/)
  11. combo contains all the files annotated by the COMBO parser (https://git.informatik.uni-leipzig.de/celano/COMBO_for_Latin)
  12. annotation-lists contains some lists to document how tokens should be annotated (in fieri).
  13. webanno contains files related to an annotation example in Webanno

The repository is work-in-progress (aiming to add/improve annotations)

