|
Commentpress

18thConnect and Open Access Full-Text–Final Report

1Leave a comment on paragraph 1 0 Laura Mandell
Dir., Initiative for Digital Humanities, Media, and Culture
Texas A&M University
Reference Number: 31000125

2Leave a comment on paragraph 2 0 ** Submitted 16 December 2011**

Final Report to the Mellon Foundation:
18thConnect and Open Access Full-Text

4Leave a comment on paragraph 4 0 In working on the development of the OCR engine Gamera for eighteenth-century texts, an engine that could mechanically type 182,000 texts in the ECCO collection, for an earlier Mellon Officer’s Grant, we discovered that training our OCR engine to recognize a particular, highly-used typeface is not enough.  When Gamera is excellent, it is indeed excellent, but when it is bad, it is horrid.  In this report, I will discuss the TypeWright correction tool (available for download and use, https://github.com/collex), how it can be used not only for crowd-sourced correction but also for improving OCR engines and optimizing human-computer interaction.

5Leave a comment on paragraph 5 0 Here is an image from the developer’s view of TypeWright:

Figure 1:

6Leave a comment on paragraph 6 0 William Paley, Advice Addressed to the Young Clergy (TypeWright p. 2)
Book ID Number 0210201700 (also keyed by TCP).

7Leave a comment on paragraph 7 0 This image shows an instance of Gamera reading a line in which it is doing a very good job, despite not having been trained specifically in italic fonts.  There are only two mistakes in this line, an “l” for a “t” and a period for an end-of-line hyphen.  Gamera successfully sorted the “f” and “s” in the phrase “four Gospels,” which demonstrates how well Mike Behrens trained it for recognizing the difference.  And indeed, one can see instances in which Gamera by itself outdoes the excellent OCR produced by combining the best of the three top commercial engines:

Gamera got the s’s in depression, though it mistook an “i” for an “l,” and, while it got “preserve” in the next line, it missed “respect” giving us “refpect” instead. (p. 3)

8Leave a comment on paragraph 8 0

In contrast, the three Gale engines missed both “depression” and “respect”:(p. 3)

9Leave a comment on paragraph 9 0

Figure 2
Gamera vs. Three Commercial Engines + Postprocessing

10Leave a comment on paragraph 10 0 Though “dignity” and “years” are much better in the Gale version, it must be remembered that the Gale OCR has gone through post-processing, whereas the Gamera has not: it would have benefited from a dictionary look-up and some rules about impossible n-grams, “dlg” being one of them.

11Leave a comment on paragraph 11 0 For sheer raw data, Gamera can be very good; it can also be horrid.  Look back at Figure 1.  While Gamera did a beautiful job with the “Sacred Interpreter” line, the preceding and subsequent lines look terrible.  One can see a reason for the subsequent line: the image of the last line in the snippet of page-image, “phrase; and for candidates to Priests orders, carefully to,” is unclear due to quite a bit of bleed-through from ink on the other side of the page.  But if you look at the line preceding “Sacred Interpreter”—what happened? That transcription isn’t even close.  When we scroll back, we can see why:

Figure 3: Gobbledy Gook

12Leave a comment on paragraph 12 0 As can be seen from the red box which is fed by Gamera the coordinates for one “line,” Gamera has mis-identified what counts as a line.  Because it is trying to read two lines as one, it has produced random letters instead of anything close to words.

13Leave a comment on paragraph 13 0 The following images explain how faulty line segmentation is responsible for the major problems that we are encountering.

Figure 4: Gamera’s TEI-A output

14Leave a comment on paragraph 14 0 Here what you see at first seems comprehensible: “distance when matters of state were to . . . ,” – but then what?  It’s important to know that our post-processing engine will in fact catch most of the errors in those first words of this line.  But the sense of the remainder of the sentence makes it impossible for a human at least to guess what the remaining words might be.  However, if you look at the page image, the reason for this incomprehensibility becomes clear:

This document, like most journal pages, has columns.  Getting an OCR program to recognize columns is difficult.

15Leave a comment on paragraph 15 0 The line blurs toward the right-end of the image, so the OCR trips up a bit more, catching “mended her charms” pretty well, but then reading “claimed” as two words unrecognizable beyond the powers of post-processing dictionary look-ups to discern.  The last word of the line could perhaps be read well using post-processing algorithms that connect hyphenated words to the first word of the next line.

16Leave a comment on paragraph 16 0 Unfortunately, that won’t be possible.  There is more going wrong here.  Notice that, after the unrecognizable attempt to render the first part of the word “familiarity,” the next line of the OCR output begins to give single letters as words. When you scan down the XML output page comparing it to an image of the printed page, you can see that from

– that entire chunk of text – is seen by Gamera as two lines of text, not the 27 lines that it actually is.  This is clearly a line segmentation failure.  The program cannot find lines, so it is trying to read whole swatches of text as if it were one line of words.  Instead of seeing the rest of “familiarity” and then “after what had passed between,” it’s trying to read the first marks in successive lines as if it were all one word.  It cannot find the space breaking the word, so it defaults to spitting out what it sees character-by-character.  It pulled out the semicolon and colon that it found, and then what it sees with an “l” on the top and an “o” in the middle or toward the bottom looks to the program like a J, “a” and “y” at the top with other letters stacked on top of each other somewhere lower down, looks like a “Y”:

liarity
 them
   the
   turn
   son

17Leave a comment on paragraph 17 0  

 

17Leave a comment on paragraph 17 0 Figure 5: Bits of lines made into a letter

18Leave a comment on paragraph 18 0 Line segmentation problems cause the program to malfunction in irrecuperable ways; that is, no amount of post-processing could derive words from misrecognized lines.

19Leave a comment on paragraph 19 0 Sometimes randomly output letters indicate that the OCR engine is attempting to read pictures as if they were text.  Printer’s ornaments, maps—they are not only “read” as random letters, they slow down processing time to a crawl.  Were we to run Gamera over the 182,000 ECCO texts while it is having this much difficulty recognizing lines, without limiting the amount of time that it is allowed to process any given page, processing time could extend as much as 6 months, and the output of random letters would prove that time to have been ill-spent.

20Leave a comment on paragraph 20 0 Before turning to some ideas for improving Gamera before we run it, I would like to point out features of the TypeWright Developer’s interface that might be useful for us to use as we work together to improve early modern OCR.  If you look at figure one again, you can see that, toward the top of the screen, one can switch between views of different OCR outputs, here, “Gale” and “Gamera”:

Figure 6: Switching

21Leave a comment on paragraph 21 0 We can load as many different OCR sources as we wish and switch among them, and may even be able to get the keyed text loaded as well, though locating what has been keyed on the page images might be a little bit tricky.[1]

22Leave a comment on paragraph 22 0 Toward the bottom of figure 1, you can also see some “debugging statistics”:

Figure 7: Word Statistics

23Leave a comment on paragraph 23 0 TypeWright’s Developer’s view also allows us to look at word-count statistics in any given OCR rendition, helping us to formulate post-processing rules both automatically and via human intervention.

24Leave a comment on paragraph 24 0 Proposed Solutions to Gamera’s Inadequacies:

25Leave a comment on paragraph 25 0 The first major proposal is to add to TypeWright’s functionality the capacity to adjust the line segmentation using the red box.  Right now, one can delete or add lines from the original list of lines generated by the OCR engine, thus inserting above and below, but one cannot simply redraw the red box.  That capacity becomes crucial when lines are misrecognized in the following kinds of ways:

A. Trying to read handwritten annotations

26Leave a comment on paragraph 26 0

B. Reading individual letters as lines

27Leave a comment on paragraph 27 0

B. (cont’) Reading individual letters as lines

28Leave a comment on paragraph 28 0

C. Including blotch as part of a line, throwing proportions of all letters completely off

29Leave a comment on paragraph 29 0

D. Grabbing too many lines as if one

30Leave a comment on paragraph 30 0

30Leave a comment on paragraph 30 0 Figure 8: Types of Misrecognized Lines

31Leave a comment on paragraph 31 0 Aletheia, the line-segmentation tool developed for IMPACT by Apostolos Antonacopoulos of PRImA (Pattern Recognition and Image Analysis Research Lab), may offer us either an alternative tool for crowd-sourcing line segmentation, and using the crowd’s corrections to revise line-segmentation algorithms, or we may be able to incorporate some of Aletheia’s capacities into TypeWright itself, thereby enlisting users to correct texts and train our OCR engine to segment lines better all at the same time.

32Leave a comment on paragraph 32 0 The second major proposal is to hook up post-processing tools developed by Loretta Auvil at SEASR (Software Environment for the Advancement of Scholarly Research) in such a way that crowd-sourced correction can generate and prioritize principles for correction.  Too many principles will slow processing down to a crawl, and again, one would be looking at six months of processing time, even on a high-performance computing cluster such as Brazos here at Texas A&M.  But the crowd can help us figure out not only principles for correction but also the rate at which such principles need to be invoked.  Those that need to be invoked most often should be automated.

33Leave a comment on paragraph 33 0 Image enhancement, though intensely difficult given the constraints under which we labor (the early modern texts available from ProQuest and Gale Cengage Learning have been digitized from microfilm, and no grey-scale versions exist), is another possibility.  And finally, more training on multiple fonts and in multiple languages could help tremendously.

34Leave a comment on paragraph 34 0 Aids in working on Early Modern Texts:

35Leave a comment on paragraph 35 0 Martin Mueller and Brian Pytlik Zillig have created a ground truth of 2,000 texts that are approaching 100% correct, using the files that have been keyed by the Text Creation Partnership.  They have put these files in the same TEI-A that Brian developed for using in TypeWright, so we can load these texts into TypeWright if we can geo-locate the transcribed lines on the available page images (see note 1).  This data set will also help us compare rates of correctness in output from various OCR engines, should we wish to use other OCR engines in place of or alongside Gamera.  Martin and Brian have also created an enormous 18th-century dictionary containing alternate spellings for post processing.  Gale Cengage Learning may let us use their OCR as part of any voting technology we instigate, including merely replacing their problematic words with possibilities from other OCR engines that are better suited to that particular kind of recognition (Gamera’s long-s success, for instance).

36Leave a comment on paragraph 36 0 As we go back in time from the eighteenth century, particularly from 1720 down to 1450, OCRing early modern texts gets to be more and more problematic.  Blackletter as well as multiple Dutch fonts were used in England.  Gamera’s creator, Ichiro Fujinaga, uses the engine to read musical notation that changed as rapidly as pre-1720 fonts, and he works in ten-year blocks.  It might be possible to do the same with typeface, optimizing OCR engines to work with any given set of texts, grouped according to time-span, printer, city of printing, etc.  Neal Audenaert is proposing, for working with these earlier texts, creating not an engine but a process that will allow any scholar interested in a set of texts the capacity to train OCR engine(s) on those texts.  We could set up and host this process, even helping the scholars create Ground Truth out of the 45,000 documents that will have been keyed by the Text Creation Partnership from the EEBO (Early English Books Online) catalogue.  In other words, if we can establish a process for handling the 182,000 ECCO texts and producing 99%-correct results, we can open that process and aid in the formation of it for early modern textual collections with similar typefaces and layouts, and even, ultimately, manuscripts: it might be worth it to adjust an OCR engine in order to make it “read” one manuscript.


37Leave a comment on paragraph 37 0 [1] Two possibilities suggest themselves here, should actually looking at the keyed text in tandem with OCR outputs seem optimal: using Katrina Fenlon’s XSLTs for recognizing whitespace on pages (Katrina Fenlon, “Exploring the Viability of Semi-Automated Document Markup” http://www.ideals.illinois.edu/handle/2142/15411 [4]), and using Neal Audenaert’s Visual Page Tool.

page 1