Release of Tesseract 4.1 text recognition system

Prepared OCR release Tesseract 4.1, which supports recognition of UTF-8 characters and texts in more than 100 languages, including Russian, Kazakh, Belarusian and Ukrainian. The result can be saved both in clear text and in HTML (hOCR), ALTO (XML), PDF and TSV formats. Initially, the system was created in 1985-1995 in the laboratory of Hewlett Packard, in 2005 the code was opened under the Apache license and further developed with the participation of Google employees. Project source code extend licensed under Apache 2.0.

Tesseract includes a console utility and the libtesseract library for embedding OCR functionality in other applications. From Tesseract-supporting third parties GUI interfaces you can note gImageReader, VietOCR ΠΈ YAGF. Two recognition engines are proposed: a classic one that recognizes text at the level of individual character patterns, and a new one based on the use of a machine learning system based on a recurrent neural network LSTM, optimized for recognition of entire lines and allowing a significant increase in accuracy. Ready trained models published for 123 languages. To optimize performance, modules are offered that use OpenMP and SIMD instructions AVX2, AVX or SSE4.1.

All improvements in Tesseract 4.1:

  • Added the ability to output in XML format HIGH (Analyzed Layout and Text Object). To use this format, run the application as "tessaract image_name output_dir alto";
  • Added new rendering modules LSTMBox and WordStrBox, which make it easier to train the engine;
  • Added support for pseudographics in hOCR (HTML) output;
  • Added alternative scripts written in Python to train the engine based on machine learning;
  • Extended optimizations using AVX, AVX2 and SSE instructions;
  • OpenMP support is disabled by default due to problems with performance;
  • Added support for whitelists and blacklists in the LSTM engine;
  • Improved build scripts based on Cmake.

Source: opennet.ru

Add a comment