Release of Tesseract 5.2 text recognition system

The release of the Tesseract 5.2 optical text recognition system has been published, which supports recognition of UTF-8 characters and texts in more than 100 languages, including Russian, Kazakh, Belarusian and Ukrainian. The result can be saved both in clear text and in HTML (hOCR), ALTO (XML), PDF and TSV formats. Initially, the system was created in 1985-1995 in the laboratory of Hewlett Packard, in 2005 the code was opened under the Apache license and further developed with the participation of Google employees. The source texts of the project are distributed under the Apache 2.0 license.

Tesseract includes a console utility and the libtesseract library for embedding OCR functionality in other applications. Tesseract-supporting third-party GUIs include gImageReader, VietOCR, and YAGF. Two recognition engines are proposed: a classic one that recognizes text at the level of individual character patterns, and a new one based on the use of a machine learning system based on a recurrent neural network LSTM, optimized for recognition of entire lines and allowing a significant increase in accuracy. Ready trained models have been published for 123 languages. To optimize performance, modules using OpenMP and SIMD instructions AVX2, AVX, AVX512F, NEON or SSE4.1 are offered.

Key improvements in Tesseract 5.2:

  • Added optimizations implemented using Intel AVX512F instructions.
  • The C API implements a function to initialize tesseract with loading a machine learning model from memory.
  • Added the invert_threshold parameter, which determines the level of inverting text strings. The default value is 0.7. To disable inverting, set the value to 0.
  • Improved handling of very large documents on 32-bit hosts.
  • Switched from using std::regex functions to std::string.
  • Improved build scripts for Autotools, CMake and continuous integration systems.

    Source: opennet.ru

Add a comment