Monday, October 2, 2023

The LASLA Latin corpus has been published Open Access under a CC-BY-NC-SA 4.0 license

 From The LASLA and LiLa teams

Dear all,

We are happy to announce that the LASLA Latin corpus has been published Open Access under a CC-BY-NC-SA 4.0 license. The portion of the LASLA corpus published comprises ca 1.7 million tokens of works from the Classical period, manually annotated with the following information: lemma, Part-of-Speech, morphological features, partial syntactic information, and metadata.  The LASLA has ongoing annotation projects, whose results will be uploaded to the Dataverses when they are finalised. We hope to provide a service to the community focusing on Latin linguistics and Latin literary studies, as well as to serve the most recent NLP trends. 

The corpus can be accessed in three Dataverses, each containing one specific format. We recommend using the “Tree View” to have an idea of what files can be found in the Dataverse.

  • DAT and APN (resp. https://doi.org/10.58119/ULG/27VZID  and https://doi.org/10.58119/ULG/QJJ0SA) are published with detailed documentation on the codes used and all the annotation choices implemented by the LASLA across the years. We hope that such documentation can support an optimal exploitation of the data by external researchers.
  • BPN files (https://doi.org/10.58119/ULG/49UQNU), which were previously shared with Data Transfer Agreements with external partners. Beyond documentation purposes, this  Dataverse also provides the original version on which the CoNLL-U format was based (see below)

The LASLA files can be exploited via (free) online interfaces: Opera Latina (for which an account can be requested by contacting Lauren Simon, email L.Simon@uliege.be), which enables structured searches through the files; HyperbaseWeb (Latin bases), for which you find documentation here and here, and that does not require an account. HyperbaseWeb allows to carry out complex statistical queries.

Following the Data Transfer Agreement for BPNs, an intense collaboration with the LiLa ERC project started. The output of this collaboration is the following:

  • The LASLA corpus is linked to the LiLa Knowledge Base and can be queried, jointly with all the other resources linked, via the LiLa Interactive Search Platform and SPARQL endpoint. The triples of the linking are published openly here.
  • The LiLa team has converted the BPN files into CoNLL-U files, enriching the annotation with the URIs of tokens and lemmas as they are found in the LiLa Knowledge Base. This version of the corpus can be found on Zenodo and Github.

We hope that this collaboration will trigger many others, with other partners enriching and providing new exploitation pathways for the LASLA corpus.

For the moment, have fun!

With kind regards,

The LASLA and LiLa teams

 

No comments:

Post a Comment