Release of Tesseract 5.0 text recognition system

The release of the Tesseract 4.1 optical text recognition system has been published, which supports recognition of UTF-8 characters and texts in more than 100 languages, including Russian, Kazakh, Belarusian and Ukrainian. The result can be saved both in clear text and in HTML (hOCR), ALTO (XML), PDF and TSV formats. Initially, the system was created in 1985-1995 in the laboratory of Hewlett Packard, in 2005 the code was opened under the Apache license and further developed with the participation of Google employees. The source texts of the project are distributed under the Apache 2.0 license.

Tesseract includes a console utility and the libtesseract library for embedding OCR functionality into other applications. Third-party GUI interfaces that support Tesseract include gImageReader, VietOCR and YAGF. Two recognition engines are offered: a classic one that recognizes text at the level of individual character patterns, and a new one based on the use of a machine learning system based on an LSTM recurrent neural network, optimized for recognizing entire strings and allowing for a significant increase in accuracy. Ready trained models have been published for 123 languages. To optimize performance, modules using OpenMP and SIMD instructions AVX2, AVX, NEON or SSE4.1 are offered.

Key improvements in Tesseract 5.0:

  • A significant change in version number is due to changes made to the API that break compatibility. In particular, the publicly available libtesseract API is no longer tied to the proprietary GenericVector and STRING data types, in favor of std::string and std::vector.
  • The source tree has been reorganized. The public header files have been moved to the include/tesseract directory.
  • Memory management has been redesigned, all calls to malloc and free have been replaced with C++ code. A general code upgrade has been carried out.
  • Added optimizations for ARM and ARM64 architectures; ARM NEON instructions are used to speed up calculations. Performed general performance optimization for all architectures.
  • New modes for training models and text recognition based on the use of floating point calculations have been implemented. The new modes offer higher performance and lower memory consumption. In the LSTM engine, float32 fast mode is enabled by default.
  • A transition has been made to using Unicode normalization using the NFC (Normalization Form Canonical) form.
  • Added an option to configure log detail (-loglevel).
  • The build system based on Autotools has been redesigned and switched to build in non-recursive mode.
  • The 'master' branch in Git has been renamed to 'main'.
  • Added support for new releases of macOS and Apple systems based on the M1 chip.

    Source: opennet.ru

Add a comment