Monday, September 9, 2024

Specialised POS Tagged Syriac Corpus for State Morphology

Contributors

Data curator:

Description

Overview

A total of twelve .TXT files each representing a Syriac text that has been transcribed and tagged for part-of-speech (POS). This corpus forms part of a PhD research project on the historical syntax of Aramaic (Syriac) at The Australian National University (2020—current) in Canberra, Australia. This research project is interested in noun state morphology, among other topics, which is reflected in the POS scheme for this corpus.

Method

A detailed summary of this methodology is provided in El-Khaissi (data paper in review with the Journal of Open Data Humanities).

  • Transcriptions are sourced from Digital Syriac Corpus.
  • POS tags are based on word matches using SEDRA IV API (v1.0.0).
  • Selection of Syriac texts was optimised to minimise external influence on Syriac grammar and maximise full coverage of key periods of the Syriac language from 2nd—13th century AD.

POS Format & Abbreviations 

POS tags in the text files follow the following format:

<syntax-category>-<state>_<syriac_word>

Thus, an underscore '_' marks the beginning of a tag sequence while tag values are separated by hyphen(s) '-'. For example (noting text directionality constraints):

ܒܘܪܟܬܐ_EMP-N
 

 

No comments:

Post a Comment