OCR Process #
A number of our data sources have been processed directly from downloaded PDFs, which has thus frequently required the conversion of those files to plain-text via optical character recognition (OCR). This page documents the ways in which we have performed OCR in the past and present. It is hoped that this information could help users understand something about our rates and types of OCR error.
Current OCR Pipeline #
We begin by converting PDF files to 600-DPI greyscale TIFF images, then OCRing those using the Tesseract OCR system. This produces plain text files with quality comparable to, in our testing, other top-end commercial and non-commercial OCR systems (including ABBYY FineReader, our previous OCR solution for which we no longer have a license). We are currently using Tesseract version 5.0-alpha-20210401, with the English-language training data from the Google-trained tessdata_best collection.
We then post-process the resulting text-files to fix known problems that Tesseract introduces. The full details and source code of those post-processing scripts can be found in our scripts repository, but to summarize them briefly (there is currently only one such script):
- Fix the fact that Tesseract does not join hyphenated words at the end of lines by merging those words. This process is also sensitive to the possibility of hyphenated compound words occurring across a line-break.