Release of Tesseract 5.1 text recognition system

The release of the Tesseract 5.1 optical text recognition system has been published, which supports recognition of UTF-8 characters and texts in more than 100 languages, including Russian, Kazakh, Belarusian and Ukrainian. The result can be saved both in clear text and in HTML (hOCR), ALTO (XML), PDF and TSV formats. Initially, the system was created in 1985-1995 in the laboratory of Hewlett Packard, in 2005 the code was opened under the Apache license and further developed with the participation of Google employees. The source texts of the project are distributed under the Apache 2.0 license.

Tesseract includes a console utility and the libtesseract library for embedding OCR functionality into other applications. Third-party GUI interfaces that support Tesseract include gImageReader, VietOCR and YAGF. Two recognition engines are offered: a classic one that recognizes text at the level of individual character patterns, and a new one based on the use of a machine learning system based on an LSTM recurrent neural network, optimized for recognizing entire strings and allowing for a significant increase in accuracy. Ready trained models have been published for 123 languages. To optimize performance, modules using OpenMP and SIMD instructions AVX2, AVX, NEON or SSE4.1 are offered.

Key improvements in Tesseract 5.1:

  • The ability to process areas with images and lines when outputting in ALTO, hOCR and text formats has been implemented.
  • Added new parameter curl_timeout lkz curl_easy_setop.
  • Improved build system.
  • Work has been done to remove unused code
  • Fixed crashes caused by incorrect handling of null pointers in the PageIterator::Orientation class.

Source: opennet.ru

Add a comment