This is a multilingual parallel corpus created from translations of the Bible compiled by Christos Christodoulopoulos and Mark Steedman.
102 languages, 5,148 bitexts
total number of files: 107
total number of tokens: 56.43M
total number of sentence fragments: 2.84MPlease cite OPUS and A massively parallel corpus: the Bible in 100 languages, Christos Christodoulopoulos and Mark Steedman, *Language Resources and Evaluation*, 49 (2)
Download
Below you can download data files for all language pairs in different formats and with different kind of annotation (if available). You can click on the various links as explained below. In addition to the files shown on this webpage, OPUS also provides pre-compiled word alignments and phrase tables, bilingual dictionaries, frequency counts, and these files can be found through the resources search form on the top-level website of OPUS.
License: CC0 1.0
Bottom-left triangle: download files
- ces = sentence alignments in XCES format
- leftmost column language IDs = tokenized corpus files in XML
- TMX and plain text files (Moses): see "Statistics" below
- lower row language IDs = parsed corpus files (if they exist)
Upper-right triangle: sample files
- view = bilingual XML file samples
- upper row language IDs = monolingual XML file samples
- rightmost column language IDs = untokenized corpus files
No comments:
Post a Comment