OCR and Scene Text

I have been involved in OCR, text recognition, document analysis, and language modeling since the early 1990’s and built a series of systems, including a neural network based handwriting recognition system for the US Census Bureau in 1995, various HMM and probabilistic finite state transducer based systems in the late 1990’s, and beginning in the early 2000’s, my group pioneered the application of recurrent neural networks and LSTMs to large scale and high performance document analysis.

Early work required implementing LSTMs and other neural network layers in Python (OCRopus OCR System in Python) and C++ (Small C++ Implementation of LSTM Networks and CTC).

We also introduced a widely used HTML-based microformat for representing document analysis output (Python tools for working with the hOCR format)

LSTM and hOCR have become the de-facto standard for OCR engines, and widely used open source OCR engines are either direct derivatives or reimplementations of the original source code we released.

OCR and document analysis still play important roles in business processes, both in developed and in developing countries. Scene text recognition is a closely related problem.

Surprisingly, OCR is still not a “solved problem”. While error rates for recognizing clean English text are comparable to, or even lower than, human error rates, human like transcription quality for documents with complex layouts, common forms of noise, etc. still remains to be addressed, in particular at the kinds of throughputs and costs commercial systems address.

OCR is also an interesting theoretical test case for many ideas in pattern recognition and deep learning; unlike mainstream “benchmark tasks”, for OCR, correct answers can be determined unambiguously, sources of errors are well understood, demands for recognition accuracy are very high, and decision theoretic cost and risk analysis is necessary for effective practical use. My group and others have pioneered ideas in areas such as self-supervised learning, semi-supervised learning, and style and context modeling that are only now becoming mainstream in other domains.

The seris of open source OCR engines that I and my group have developed over the years are:

  • ocropus: a C++-based OCR engine using a combination of deep learning, clever geometric algorithms, and statistical language modeling
  • ocropy (aka ocropus2): an LSTM-based OCR engine using a custom deep learning library implemented in Python and C++
  • ocropus3: an LSTM-based OCR and trainable layout analysis system implemented on top of PyTorch and capable of using GPUs
  • ocropus4: the next generation of ocropus3, using more complex models and offering support for scene text and large scale self-supervised training (currently in development)