OCR Workshop Proposal

Proposal for a Workshop
OCR Summit Meeting: Preserving the Past for the Future
October 17-18, 2011

Narrative  Very few people first coming to new digital tools such as the Google n-gram viewer are aware of the degree to which the mechanically typed text that it searches is deeply flawed, so flawed in fact as to falsify the results of their queries.  And very few digital humanists are aware as they present their new tools to scholars that deeply flawed results based on deeply flawed data make their tools appear broken or useless.  A Mellon Officer’s grant gave us the capacity to train an OCR engine, Gamera, on a specific set of fonts most commonly used in the eighteenth-century.  Though the training proved effective, Gamera cannot work well enough to run on the 182,000 page images that have been loaned to us for OCR research by Gale-Cengage Learning: to run them at this point would be a waste of high-performance computing resources.

Though expert at recognizing the difference between the long-‘s’ and ‘f,’ a problem that cannot be resolved by post processing, Gamera has a great deal of trouble finding the lines it needs to read.  This slows processing down and produces completely inaccurate mechanical transcriptions of the images.  Also, it cannot identify maps, images, printer’s flourishes and decorations, and wastes time trying to read those images as text, again also spoiling the output.[1]

Several groups around the world have been working on the problem of faulty line segmentation in OCR being used to read early-modern and eighteenth-century texts, specifically, from the page images that we have, owned primarily by Gale-Cengage Learning and ProQuest.  There has been of course research in the field of computer science on this problem, but there has not been any workshop held bringing those people together in order to collectively determine a way forward in actually accomplishing the task of getting the EEBO (early English Books online), ECCO (Eighteenth-Century Collections Online), and early nineteenth-century periodicals and texts produced by Gale Cengage Learning, Matthew Adam Digital, the NCSE (Nineteenth-Century Serials Edition), the Burney Newspaper Collection, British-Library produced early nineteenth-century texts, and early modern texts in languages other in English.

The way forward may not be through machines alone but through determining optimum machine-human interaction.  The Mellon Officer’s grant enabled 18thConnect to build TypeWright, a crowd-sourced correction tool, that many groups are now interested in using.  There are two ways that this tool can help us on the large scale of texts and problems that we collectively face:

  1. 6Leave a comment on paragraph 6 0
    1. 6Leave a comment on paragraph 6 0
    2. Multiple installations of the tool can be set up to communicate to one central place: corrections that people make to texts can be saved to a single authoritative version of these texts, and data collected within the TEI encoding of these texts can be used to research and generate:
      1. 6Leave a comment on paragraph 6 0
      2. OCR training
      3. post-processing principles
      4. research into image analysis
  1. 7Leave a comment on paragraph 7 0
  2. TypeWright needs to incorporate or be accompanied by a tool that allows humans to adjust the lines that Gamera thinks it has found and to tag images indicating to the engine that it should skip those images.  That tool has been built by the IMPACT (Improving Access to Texts) group spearheaded by the National Library of the Netherlands: it is called Aletheia, and its creator in the UK, Apostolos Antonacopoulos, is willing and able to come to our summit.  As with TypeWright, Aletheia can be installed in various locations, and humans correcting the lines will generate information crucial to:
    1. 7Leave a comment on paragraph 7 0
    2. OCR training
    3. research into image analysis

8Leave a comment on paragraph 8 0 Here follows a description of the individuals whom we would like to invite to this summit and the broad outlines of a schedule for the event:

9Leave a comment on paragraph 9 0 We would like to hold an OCR summit here at Texas A&M on October 17–18 in order to get the people working on these problems to see the extent to which we can share our work rather than duplicating it and thereby wasting precious resources in humanities and library computing.  All of these people and groups are interested in working together.  Here follows a table of the invitees and their specialization:

Project Name(s) Location Specialty
IMPACT / KB (Improving Access to Text) Clemens NeudeckerHildelies Balk KB: National Library of the Netherlands OCR in general, workflows for OCR, OCR tools
IMPACT / BL (British Library) Niall Anderson British Library Image enhancement for OCR
IMPACT / PRImA(Pattern Recognition and Image Analysis Research Lab) Apostolos Antonacopoulos University of Salford, Manchester, UK Aletheia (tool for crowd-sourced correction of line-segmentation); image enhancement
SEASR (Software Environment for the Advancement of Scholarly Research) Loretta Auvil School of Informatics, Illinois Univ. Post-processing dictionary look-ups and n-gram analysis
Dynamic Variorum Editions (classical texts), with Greg Crane Bruce RobertsonGreg Crane Mount Allison, New Brunswick, CA, and Tufts University Gamera Line Segmentation; Squeegee (a DVE Greek OCR Viewer)
ARTFL Project Peter LeonardTim Allen Univ. of Chicago Gamera & OCRopus
OCRopodium Tobias Blanke*Mike Bryant* King’s College London OCRopus
18thConnect / ARC (Applied Research Consortium) Laura Mandell(contract: Kristin Jensen, Ed Zavada) Texas A&M University Gamera, TypeWright(Performant Software)
Early Modern ARC, Bamboo Corpora Space, ABBOTT Martin Mueller*Doug Downey*Brian Pytlik Zillig* Northwestern Univ.Univ. of Nebraska Textual data from the Text Creation Partnership / Bamboo; Human-Computer Interaction; ABBOTT
CSDL (Center for the Study of Digital Libraries) and Supercomputing Cluster Rick FurutaJames CaverleeFrank ShipmanGuy Almes Texas A&M University Human-Computer Interaction; Line segmentation; image analysis; data management systems at scale
JISC Collections Michael UpshallCaren Milloy JISC Collections, London, UK Distributors of EEBO/ECCO data platforms in UK Universities
Texas A&M University Libraries / Cushing Rare Books Holly MercerJ. Lawrence MitchellEduardo Urbina Texas A&M University Post-processing, lexical analysis, collaboration

10Leave a comment on paragraph 10 0 *may have to attend virtually, via teleconference

11Leave a comment on paragraph 11 0 We will invite Google, Proquest, Gale-Cengage Learning, Adam Matthew Digital, and the Hathi Trust to send representatives if they wish, but of course will not offer to help fund their attendance from any Mellon award.

12Leave a comment on paragraph 12 0 Each of these contributors and participants bring something important to the table in helping us figure out how to best process digital page images of texts published between 1600 and 1900, involving as it will optimizing human and computer interaction. I will discuss each group in turn:

Improving Access to Text (IMPACT): The IMPACT group has been working for the last four years on adapting Optical Character Recognition programs to the unique problems of early modern texts. Directed by Hildelies Balk and Clemens Neudecker, the program has brought together a number of European experts in the field to work on particular aspects of the problem, including Apostolos Antonacopoulos of the Pattern Recognition and Image Analysis Lab (PRImA) who has developed Aletheia, a tool that enables human intervention in the otherwise automatic, and frequently faulty, process of line segmentation and image identification within texts (http://vimeo.com/22074310). Much as they would like to, IMPACT cannot simply give us the OCR engine they have developed because it is built upon ABBY FineReader engine 10, a proprietary piece of software. They have however developed open access tools in relation to this work, such as Aletheia. Another focus of our discussion will be how to transfer training libraries to other OCR engines. European-based projects work heavily in Fraktur and Dutch types from the seventeenth century, all of which were in wide-spread use in England before 1720. Finally, should we get a grant, we will invite Niall Anderson from the British Library to demonstrate image-enhancement tools (de-skewing and border removal).[2]

Software Environment for the Advancement of Scholarly Research (SEASR): The SEASR group at the University of Illinois, now under the leadership of John Unsworth at the Illinois Informatics Institute, created the Meandre Workbench which was used by MONK. They are working closely with Ted Underwood who is performing data-mining operations on early-modern and nineteenth-century digitized texts, helping him to clean up the OCR problems in a way that is sufficient for such tasks. Their post-processing dictionary look-up has been used by 18thConnect to improve data run through Gamera and performs extremely well even given the messiest of data.

The Dynamic Variorum Edition (DVE) was funded by a Digging into Data Challenge Grant and built a prototype of the DVE Greek OCR viewer that may, if we can work together, incorporate into it some TypeWright-like, crowd-sourced correction features. This group has been working on chaining the Gamera segmentation algorithms, using something like smearing to get the entire columns’ rectangle, and then treating each of these blocks as a new image with its own segmentation algorithm, usually macmillan. After some false starts with python seg. faults and so forth (!), my student informs me today that by saving the first stage’s results to a disk, we’re getting proper results, and since our super computing facility provides each node with a small scratch disk, we should be able to apply this. I expect this approach would be of interest to the people you are thinking of gathering. (Email from Bruce Robertson to Laura Mandell 7/25/2011)

ARTFL: ARTFL of the University of Chicago has begun working on OCR’ing the Encyclopedia, an amazing resource once digitized. It will be completely open access. Their group has tried OCRopus without good results and is currently working on Gamera.

The OCRopodium project, directed by Tobias Blanke at King’s College London, by contrast, is based on OCRopus, and we would like to get these groups together to discuss the various merits of each engine, with an eye to ultimately using them both and developing some kind of voting technology to determine most accurate renditions.

18thConnect/ARC (Applied Research Consortium): Miami University was funded by the Mellon Foundation and granted access to the ECCO Collections page images in order to run Gamera on them. It has fully trained Gamera to read Caslon and Baskerville types, but cannot yet run the program on these images until its line segmentation algorithms are much improved. I hope to offer Neal Audenart, a graduate of the CSDL (#8 below), a postdoc or consulting position to begin working on those algorithms, and would much benefit from consulting with the Dynamic Variorum Edition (#3 above).

Bamboo and Early Modern ARC: Martin Mueller has been a primary researcher in the field of text correction, automated and manual, holding conferences, developing tools (Annolex), and working on major big-text projects, from MONK to Bamboo. As part of the ABBOTT project, Martin Mueller and Brian Pytlik Zillig have been transforming into TEI-A the texts that were double-keyed by the Text Creation Partnership. Out of the 2,200 texts, they have found 2000 to be of very high integrity, and they assure me that approximately 2000 can be used as a Ground Truth for training the OCR engine or engines we use.

Center for the Study of Digital Libraries: Located at Texas A&M and directed by Rick Furuta, the professors participating in this center specialize in Human-Computer interaction specifically for the sake of optimizing performance (James Caverlee, Frank Shipman) and text recognition issues. Guy Almes, Director of the High Performance Computing Center at Texas A&M, is interested in conducting research into data management of the sort that would allow us to coordinate corrections coming into a central repository from multiple sites in order to produce researchable data, principles for training and post-processing, and ultimately of course, our goal: clean text.

JISC Collections: Laura Mandell met with JISC Collections this summer in order to discuss the use of TypeWright for the instances of EEBO and ECCO that they have purchased. We would like to continue the discussion as to how we might install TypeWright in the British system and share corrections.

Texas A&M University Libraries / Cushing Memorial Library and Archives: Holly Mercer is Head of Digital Services and Scholarly Communication at Texas A&M; J. Lawrence (Larry) Mitchell is director of the A&M Rare books room and an English Professor who teaches history of the book whose research interests include early modern and eighteenth-century dictionaries.

23Leave a comment on paragraph 23 0 The reason for the October 17–18 date is that the leaders of IMPACT will be in the US October 9–13 attending the ASS&T conference in order to present their OCR work (they have invited Laura Mandell to the panel). Bruce Robertson can come that date, and ARTFL can send one or two people at that time. Apostolos Antonacopoulos of PRIMA is available. King’s College participants cannot come physically but are willing to come via teleconference (they are not included in the budget figures). Others will be invited.

24Leave a comment on paragraph 24 0 Tentative Schedule:
Day 1, October 17:

  1. 25Leave a comment on paragraph 25 0
  2. Present images
  3. Identify issues
  4. Breakout discussions on specific issues
  5. Common discussion

26Leave a comment on paragraph 26 0 Afternoon:

  1. 27Leave a comment on paragraph 27 0
  2. Present OCR engines (Tesseract, OCRopus, Gamera, ABBY engine)
  3. Identify strengths and weaknesses
  4. Breakout discussion on possibilities for engine fortification
  5. Common discussion

28Leave a comment on paragraph 28 0 Day 2, October 18:

  1. 29Leave a comment on paragraph 29 0
  2. Present Aletheia
  3. Present TypeWright
  4. Present Image Enhancement Tools
  5. Present WikiSource

30Leave a comment on paragraph 30 0 Afternoon:

  1. 31Leave a comment on paragraph 31 0
  2. Data Management
  3. Way forward

32Leave a comment on paragraph 32 0 The way forward will include collaborating; we agree on that in advance.  Some of the work we wish to do may be interesting to the National Science Foundation and JISC (not collections, but the funding agency), and we will figure out how to break what we are doing into distinct phases, work groups, and work packages so that the agencies most committed to preserving our cultural heritage can efficiently combine their efforts with the agencies committed to doing research in the fields of image manipulation as well as data flow and management.

33Leave a comment on paragraph 33 0 This OCR Summit meeting resembles a meeting that was held 12 July 2011 at the British Library (http://impactocr.wordpress.com/), and our summit meeting will ideally include the members of IMPACT who attended that meeting.  I will make certain, however, that we all leave with concrete tasks and proposals for how to accomplish them in a way that benefits all collaborators: Bamboo, IMPACT, JISC Collections, and ARC.

34Leave a comment on paragraph 34 0 [1] See Final Report to the Mellon Foundation for Mellon Officer’s Grant “18thConnect and Open Access Full-Text” (No. 31000125), by Laura Mandell, forthcoming August 25, 2011.

35Leave a comment on paragraph 35 0 [2]http://impactocr.wordpress.com/2011/07/12/image-enhancement-talk-from-niall-anderson/

page 11