Monday, August 1, 2016

SEDRA: The Syriac Electronic Data Research Archive

SEDRA: The Syriac Electronic Data Research Archive
Beth Mardutho


The Syriac Electronic Data Research Archive (SEDRA) is a linguistic and literary database of the Syriac language and literature. Its acronym derives from Syriac word ܣܕܪܐ sedrā whose meanings include 'array', 'series' as well as 'order' and 'rank', all of which are terms that are associated with database theory.

Project History

SEDRA was established in 1988 by Alaph Beth Computer Systems, a one-person firm founded by George A. Kiraz and based in Los Angeles, that developed, inter alia, Syriac fonts. An early brochure of the company stated that SEDRA "will come on floppy disks in ASCII format."

Kiraz wanted to "crowd source"—analog style—the creation of a linguistic database of the Syriac language. Kiraz sent out letters to his clients who used his Syriac fonts with the word processor Multi-Lingual Scholar and asked them to volunteer to type the lexica of Margoliouth, Payne Smith and Brockelmann with specific ASCII tagging. A new entry began with a caret (^) and English glosses of Margoliouth's dictionary were delimited by a percent sign (%). A letter dated March 22, 1990 reports the status of the project and promises to look into the possibilities of using SEDRA "in Artificial Intelligence applications, especially Natural Language Processing." In addition, Kiraz signed an agreement on March 2, 1988 with the Ancient Biblical Manuscript Center for Preservation and Research and obtained a permission to use the Peshitta New Testament Electronic Database, originally developed by The Way International.

SEDRA went through three incarnations. SEDRA I (1989) derived its data from the database provided by the Ancient Biblical Manuscript Center which provided the data as a flat file database. The data was converted to db_VISTA, a database management system that provided a programmable interface in the C programming language for writing database applications.

SEDRA II (1990) contained additional tables and fields necessary for the generation of Kiraz's Concordance to the Syriac New Testament (1993). Moreover, the entire text of the New Testament was vocalized and pointed, punctuation and accent marks were added, and the text was normalized to represent the BFBS edition of the Syriac New Testament as the text used by The Way was based on other manuscripts, primarily from the British Library. To accomplish the vocalization and pointing process, a program was written that skipped over words which had been vocalized before. Hence, the word ܒܝܬܐ 'house,' which appears 201 times in the corpus, is vocalized only once as ܒ݁ܰܝܬ݁ܳܐ. Initial bgdkpt letters were always marked with a quššāyā point; an algorithm was written to convert the quššāyā into rukkākhā if the preceding word, if any, ended in a vowel and was not followed by a punctuation mark. The dot on the feminine object pronominal suffix ܗ̇ was not included in the pointing, and was added later on by another algorithm based on morphological data.
The next incarnation of the project was SEDRA III (1991). The first change was the move from a relational database model to a network model where ordered, one-to-many parent-child relations simplified the process of concordance generation. In this model, a parent record would have a pointer to the first child record in another table. That child record would have a pointer to the next child, and so on. As laptops at the time had small hard drives of about 10 or 20 MB, SEDRA III converted its fields into bit fields. For instance, two bits were sufficient to indicate person (00 for 1st person, 01 for 2nd, 10 for 3rd). SEDRA III contained 2,050 roots, 3,559 lexemes, 31,079 word forms and 6,337 English meanings (particular to the context of the New Testament). It was published in 1993 on the web site of the University of Cambridge, and later on Beth Mardutho's site hosted by The Catholic University of America's Semitics Department, as a non-commercial open source database. A number of developers downloaded SEDRA III and used it for lexical and concordance applications. One such developer was James W. Bennett who used SEDRA III underlying the BFBS Peshitta in his online Syriac Library Browser and General Syriac Tools.

In February 2013, George Kiraz and James Bennett teamed up to develop SEDRA IV (this web site). Starting from SEDRA III, the database was converted back into a relational database, the binary fields where expanded (now person is a numeric fields with textual references), and additional tables and fields were added. The lexemes table was expanded to include all the words in the Brock-Kiraz dictionary (ca. 15,000 words). Source data from printed lexica were imported either in image or text format. More importantly, a morphological generation component was added to the system with a grant from the International Balzan Prize Foundation under the direction of Peter Brown (Princeton University) and in collaboration with

SEDRA IV was launched in March 2015 at the Fourth Hugoye Symposium on Syriac and the Digital Humanities (Beth Mardutho and Rutgers University) and a crowd sourcing call went out asking scholars to tag images of scanned lexical entries to the lexemes of the database. It is expected that SEDRA IV will expand as a crowd sourced project.

As of today (08/01/2016), SEDRA contains 3228 roots, 17905 lexemes, and 44270 words.

No comments:

Post a Comment