Tuesday, January 20, 2015

Help sought with Metadata for the Open Patrologia Graeca Online

Help sought with Metadata for the Open Patrologia Graeca Online
[draft -- January 19, 2015]
Gregory Crane
Perseus Project and the Open Philology Project
The University of Leipzig and Tufts University

We are looking for help in preparing metadata for the Patrologia Graeca (PG) component of what we are calling the Open Migne Project, an attempt to make the most useful possible transcripts of the full Patrologia Graeca and Patrologia Latina freely available. Help can consist of proofreading, additional tagging, and checking the volume/column references to the actual PG. In particular, we would welcome seeing this data converted into a dynamic index into online copies of the PG in Archive.org, the HathiTrust, Google Books, or Europeana. For now, we make the working XML metadata document available on an as-is basis.

Nick White (normally at Durham in the UK but at the time at Tufts) trained and ran the Tesseract OCR engine and Bruce Robertson (Mount Allison University, Canada) the OCRopus OCR engine on scans of multiple copies of each volume of the Patrologia Graeca. The resulting OCR runs contain among them a very very high percentage of the correct readings and we will be able to support very useful searching, as well as text mining with this corpus.

To support this larger effort, we are working on Metadata for the collection. We have OCRd and begun editing the core index at columns 13-114 of Cavallera’s 1912 index to the PG (which Roger Pearse cites at http://www.roger-pearse.com/weblog/patrologia-graeca-pg-pdfs/). A working TEI XML transcription, which has begun capturing the data within the print source, is available for inspection at: https://www.dropbox.com/s/mldhu4okpq4i7r8/pg_index2.xml.

This XML file will be used to expand the data in the Perseus Catalog (https://github.com/PerseusDL/catalog_data) and the coverage (as well as the rapidly evolving functionality) of the Perseus Catalog (http://catalog.perseus.org/).

All figures are preliminary and subject to modification (that is one motivation for posting this call for help), but we do not expect that the figures will change much at this point. At present, we have identified 658 authors and 4,287 works. The PG contains extensive introductions, essays, indices etc. and we have tried to separate these out by scanning for keywords (e.g., praefatio, monitum, notitia, index). We estimate that there are 204,129 columns of source text and 21,369 columns of secondary sources, representing roughly 90% and 10% respectively. Since a column in Migne contains about 500 words and since the Greek texts (almost) always have accompanying Latin translations, the PG contains up to 50 million words of Greek text but many authors have extensive Latin notes and in some cases no Greek text. We will have better figures on the amount of Greek when we have analyzed the OCR-generated text for various copies of the PG.

The PG was a vast enterprise that built upon centuries of work and cites c. 180 different editions, with work of Andrea Gallandi (1709-1779), Angelo Mai (1782-1854), François Combefis (1605-1679), Giovanni Domenico Mansi (1692-1769), and Leo Allatius (1586-1669) as the most frequently cited sources.

We have focused on the following tasks. For details of the TEI XML tagging, you can consult the XML file:
  1. We have provided VIAF IDs for many ancient authors and modern editors.
  2. Each author has been given its own ID# in the form of MPG + a 3 digit number.
  3. We have tried to distinguish primary sources from the many secondary sources (commentaries, essays, introductions) that the PG contains.
  4. Each primary source has a work ID of the form MPG + a 4 digit number. Together with the author ID, these allow us to define URNs according to the Canonical Text Services/CITE Architecture.
  5. Secondary sources have an ID of the form SSID + 4 digits. Secondary sources are numbered sequentially through the PG index. The main goal here is to distinguish them from the original source texts.
  6. The 56 editors whose names show up three or more times surrounded by parentheses haves been given a VIAF number and tagged.
  7. The headers for each new author normally contain a date, most commonly the century of main activity recorded as a roman numeral but occasionally as an arabic number date range. We have tried to capture and encode these.
  8. We have scanned for references in the form of “volume, column1-column2.” Where the index leaves out the volume number, we assume that we are still using the last recognized volume.
  9. The index commonly truncates number ranges: e.g., 1235-54 instead of 1235-1254. We have tried to expand the truncated numbers and checked to make sure that the resulting ranges are increasing (e.g., column2 does not occur before column1).

No comments:

Post a Comment