Sunday, March 29, 2020

Amharic Corpus

[First posted in AWOL 4 October 2018, updates 29 March 2020]

Amharic Corpus
The page you are currently viewing is a web interface for the pilot version of Amharic corpus. The corpus size so far is about 23 millions tokens. The texts of the corpus have been automatically annotated with a part of speech analyzer. There is a disambiguation in the corpus, i. e. each token is annotated with one appropriate analyse.

Most tokens in the current version of the corpus belong to news texts. The rest of the texts include blogs and nonfiction (Wikipedia articles and essays). Eventually we intend to increase the number and diversity of texts and add fiction texts to the corpus.

The latest update
May 25th, 2016.

Created by
Maria Obedkova under guidance of Boris Orekhov within the project of HSE School of Linguistics

Web interface
The search platform of the Eastern Armenian National Corpus (EANC) was used for this corpus. You can read about making search queries at EANC help page.

No comments:

Post a Comment