Saturday, April 20, 2019

Lace: Greek OCR: 1351 volumes: High-quality OCR of polytonic, or 'ancient', Greek

 [First posted in AWOL 13 December 2013, updated 20 Apr 2019]


Lace: Greek OCR

Overview

This site catalogues the results of our on-going campaign to produce high-quality OCR of polytonic, or 'ancient', Greek texts in a HPC environment. It comprises 1351 volumes, principally from archive.org, but also from original scans and other resources. There are over 12 million pages of OCR output in total, including experimental and rejected results.
Results are presented in a hierarchical organization, beginning with the volume identifier. Each of these are associated with one or more 'runs', or attempts at OCRing this volume. A run has a date stamp and is associated with a classifier and an aggregate best b-score (roughly indicating quality of Greek output.) Each run produces various kinds of output. The most important of these are:
  1. raw hocr output: the data generated by our OCR process, usually with multiple copies for each page, rendered at a range of binarization thresholds
  2. selected hocr output: a filtered version of the data in (1), with each page image represented by a single, best, output page. Output based in an older process also provide the following steps:
  3. blended hocr output: the data in (2), but replaced with the corresponding words from the raw output in (1), should the selected page not comprise a dictionary word and one of the raw pages comprises one.
  4. selected hocr output spellchecked: the data in (3) processed through a weighted levenshtein distance spellchecking algorithm that is meant to correct simple OCR errors
  5. combined hocr output: where archive.org provides OCR output for Latin script (not Greek), this final step pieces together the data in (4) with archive's output, preferring archive's output where our output suggests that the data is Latin. If archive.org provides Greek output, this step is no different from (4)

Code

These data were generated with two different OCR processes. All results since 2014 employed the Ciaconna Greek OCR process. This is based on the Ocropus open source engine, with custom classifiers, image preprocessing and spell-check routines written in Python. Ciaconna's high-level scripts are integrated with Compute Canada's Sharcnet scheduling software, since that facilities' resources were used to generate these results.
The earlier process, used from 2012 - 2014, is named 'Rigaudon' and is based on the Gamera image processing library. All code and classifiers for Rigaudon are posted in a github repository. This holds the modified Gamera source code, ancillary python scripts such as the spellcheck engine, and the bash scripts that coordinate the process in a HPC environment through Sun Grid Engine.
Details of Rigaudon's operation are outlined in a white paper.
Our July 2013 presentation at the London Digital Classicist seminar series is available online from the Institue of Classical Studies.

Web Editing Software

The Lace editing and visualizing software you are now using is available as a package for eXist-db in a GitHub Repository. A previous version of Lace, which used Python Flask is also archived on GitHub.

Context

This is a continuation of efforts begun through the Digging Into Data Round I project Toward Dynamic Variorum Editions, in which -- as the project white paper notes -- we discovered both the tantalizing potential of Greek OCR and the poor results that OCR engines at that time produced when operating at scale.
In order to bootstrap that process, we adapted the most extensible and successful of the frameworks to that date, the Gamera Greek OCR engine by Dalitz and Brandt. Using the AceNET HPC environment we analyzed a sample of the Google Greek and Latin corpus with twenty classifiers composed by Canadian undergraduate students. From this, we produced a quantitative report on the efficacy of our modified OCR code.
On the basis of this work, we received a 2012/2013 Humanities Computing Grant from Compute Canada, making this large-scale processing possible.

No comments:

Post a Comment