Multilingual Parallel Bible Corpus

Multilingual Parallel Bible Corpus

A multilingual parallel corpus created from translations of the Bible. [mehr...]

Aggregation 81–100 von 104

Multilingual Parallel Bible Corpus

A multilingual parallel corpus created from translations of the Bible aligned at the verse level. This is an effort to create a parallel corpus containing as many languages as possible that could be used for a number of NLP and scholarly tasks.

Characteristics of the Corpus in TextGrid Repository

The full corpus in the TextGrid Repository (TGR) contains 103 translations in 102 languages from around the world, including two translations in English. Most of the languages are non-Indo-European, and 39 are spoken by fewer than one million people. 57 of the translations contain all 66 books of the Bible, while 45 contain only partial texts. Many of these (37) contain the 27 books of the New Testament, while seven contain even fewer books. TGR's facets offer a good overview of the number of books published per language.

The texts are split into books or works of the Bible, such as Genesis, Acts, the First Epistle of John, the Song of Solomon, the Psalms, etc. This gives users greater flexibility, enabling them to select different facets (language, author, book, genre, etc.) when querying, searching or downloading.

Conversation and Metadata Enrichment within the Text+ Project

As part of the Text+ Research Data Infrastructure Project, which is part of the German National Research Data Infrastructure (NFDI), the TGR team prioritised the integration of already existing resources, improving the quality of the data and, consequently, their FAIR status, with a particular focus on highly multilingual resources.

This corpus has previously been published on GitHub. Several aspects of the metadata and encoding have been improved for the publication of the corpus in TGR:

  • The original encoding in the Corpus Encoding Standard (CES) has been replaced by the more widely used TEI. All the metadata contained in the CES version has been mapped to TEI elements.

  • Specific translation metadata was researched and integrated into the TEI file.

  • The works of the Bible can now be identified using two types of identifiers: Wikidata and the German-speaking area's authority file system (GND).

  • Moreover, each work was associated with groups of works in the Bible at two different levels. First, it was determined whether the work belonged to the New Testament or the Old Testament. Secondly, groups of works were assigned, such as Epistles, Prophecy, History, Gospels and Acts, Pentateuch, Wisdom and Apocalypse. GND-entities have been used for all these concepts.

  • We have also added information about the author traditionally assigned to this work. Although the authorship of biblical books is in many cases unclear and still demands scientific research, we believe that users would benefit from these names as facets. For reference, we use the information presented in the GND. While Old Testament works do not contain any information about the author, New Testament books do. Authors have also been identified through Wikidata and GND identifiers.

  • The languages are identified using the ISO 639-3 standard.

  • The works have been annotated using the simple TGR schema of genres, with verse applied to the books of Psalms, Lamentations, and the Song of Solomon, and prose applied to the rest.

  • Each TEI file is annotated with two classes from the Basic Classification library classification system: class 11.31 for biblical texts, plus a class from the 18 main classes for the language or group of languages.

  • Each TEI file contains the number of chapters, verses, characters and tokens that it contains, as well as the number of bytes that it represents, in 'measure' elements.

  • Further metadata, including information on the language family, genus, subgenus and number of speakers, has now been integrated into the TEI file. This information was previously contained in a separate spreadsheet.

  • The corpus contributors have been identified using their ORCID IDs.

For several of these steps, we benefited from the valuable collaboration of the Theology and Religious Studies subject specialist at the State and University Library, Göttingen.

Sources

The four main sources used here were the Bible Database, the Unbound Bible, GospelGo and the Bible Gateway websites. Each one offered the Bible in different formats, some containing HTML and others plain text. When multiple versions of the Bible were available for the same language, usually the oldest available one was chosen (e.g. the King James Version for English).

Alignment

While the corpus is verse-aligned, not all canonical verses (i.e. verses that appear in the original Greek, Hebrew and Aramaic) are present even in the official translations. In most cases the missing verses are contained in the verses that come before, or after them. This is usually because in some languages it might not be easy to follow the sentence structure of the original text (e.g. a sentence that is split across two verses). The most commonly missing verse in the New Testament is 2 Corinthians 13:14 (missing from 33 languages where the median is 2) which is a known versification difference.

Citation Suggestion

Here is a suggestion to how cite this corpus in TGR:

TextGrid Repository (2025). Multilingual Parallel Bible Corpus. Christos Christodoulopoulos. https://textgridrep.org/project/TGPR-d862e14d-4df7-052b-00fe-661cb242231c

For further information about the original version (published in GitHub), you can refer to following paper: A massively parallel corpus: the Bible in 100 languages, Christos Christodoulopoulos and Mark Steedman, Language Resources and Evaluation, 49 (2).