The release of the Tesseract 5.5.0 optical character recognition system has been published. It supports Unicode and text recognition in more than 100 languages, including Russian, Kazakh, Belarusian and Ukrainian. The result can be saved as plain text or in HTML (hOCR), ALTO (XML), PDF and TSV formats. The system was initially created in 1985-1995 in the Hewlett Packard laboratory, in 2005 the code was opened under the Apache license and was further developed with the participation of Google employees. The source code of the project is distributed under the Apache 2.0 license.
Tesseract includes a console utility and the libtesseract library for embedding text recognition functions into other applications. Third-party GUI interfaces that support Tesseract include gImageReader, VietOCR, and YAGF. Two recognition engines are offered: a classic one that recognizes text at the level of individual character patterns, and a new one based on the use of a machine learning system based on the LSTM recurrent neural network, optimized for recognizing entire lines and allowing for a significant increase in accuracy. Ready-made trained models are published for 123 languages. To optimize performance, modules are offered that use OpenMP and SIMD instructions AVX2, AVX, AVX512F, NEON, or SSE4.1.
Main improvements:
- Added support for RISC-V vector extensions, which are used to prepare assembler optimizations for systems with RISC-V processors.
- When recording the result in hOCR format, the ocrp_dir and ocrp_lang parameters are set in the created file.
- Modernized the code for detecting available language models.
- Improved code for generating hOCR files and removed file name conversion on the Windows platform.
- Allowed to specify symbolic values in the "--oem" and "--psm" options.
- In the code, the access and _access functions have been replaced with the std::filesystem::exists() method. The tprintf functions have been replaced with the use of the tesserr stream.
- Removed support for the Tensorflow machine learning platform, which was implemented at one time but was never used to run AI recognition models.
- Improved installer for Windows platform.
- The googletest submodule has been updated to version 1.15.2.
Source: opennet.ru
