Monday, May 16, 2022

Multilingual Parallel Bible Corpus

[First posted in AWOL 21 February 2019, update (new host) 16 May 2022]

Multilingual Parallel Bible Corpus

This corpus is now a GitHub project. It is now much easier to submit corrections to the data or add new translations to the corpus.

Here you can find a multilingual parallel corpus created from translations of the Bible. This an effort to create a parallel corpus containing as many languages as possible that could be used for a number of NLP tasks. Using the Book, Chapter and Verse indices the corpus is aligned (almost) at a sentence level. (There are cases where two verses in one language are translated as one in another).

Following a similar effort by Philip Resnik and Mari Broman Olsen at the University of Maryland (website) I have encoded the text of each language in XML files using the Corpus Encoding Standard. Refer to the following paper for more details about the creation of the corpus:

    A massively parallel corpus: the Bible in 100 languages, Christos Christodoulopoulos and Mark Steedman, Language Resources and Evaluation, 49 (2)

The following table contains the XML Bibles in 100 languages (all the languages that an electronic version was freely available online) along with information about each language from Ethnologue.

No comments:

Post a Comment