Kutulutsidwa kwa dongosolo lozindikiritsa zolemba Tesseract 5.1

Kutulutsidwa kwa kachitidwe ka Tesseract 5.1 optical text recognition system kwasindikizidwa, kuthandizira kuzindikira zilembo za UTF-8 ndi zolemba m'zilankhulo zopitilira 100, kuphatikiza Chirasha, Chikazakh, Chibelarusi ndi Chiyukireniya. Zotsatira zitha kusungidwa m'mawu osavuta kapena HTML (hOCR), ALTO (XML), PDF ndi TSV. Dongosololi lidapangidwa koyambirira mu 1985-1995 mu labotale ya Hewlett Packard; mu 2005, code idatsegulidwa pansi pa layisensi ya Apache ndipo idapangidwanso mothandizidwa ndi ogwira ntchito ku Google. Khodi yoyambira polojekitiyi imagawidwa pansi pa layisensi ya Apache 2.0.

Tesseract imaphatikizapo chida chothandizira komanso laibulale ya libtesseract yophatikizira magwiridwe antchito a OCR muzinthu zina. Ma GUI a chipani chachitatu omwe amathandizira Tesseract akuphatikiza gImageReader, VietOCR ndi YAGF. Injini ziwiri zozindikiritsa zimaperekedwa: yachikale yomwe imazindikira zolemba pamlingo wa mawonekedwe amunthu aliyense, ndi yatsopano kutengera kugwiritsa ntchito makina ophunzirira makina otengera LSTM recurrent neural network, yokonzedwa kuti izindikire zingwe zonse ndikulola kuwonjezeka kwakukulu kwa kulondola. Zitsanzo zokonzedwa kale zasindikizidwa m'zinenero 123. Kuti muwongolere magwiridwe antchito, ma modules pogwiritsa ntchito malangizo a OpenMP ndi SIMD AVX2, AVX, NEON kapena SSE4.1 amaperekedwa.

Kusintha kwakukulu mu Tesseract 5.1:

  • Kutha kukonza madera okhala ndi zithunzi ndi mizere mukamatulutsa mu ALTO, hOCR ndi mafomati amtundu wakhazikitsidwa.
  • Wowonjezera parameter yatsopano curl_timeout lkz curl_easy_setop.
  • Kuwongolera dongosolo lomanga.
  • Ntchito yachitika kuchotsa code yosagwiritsidwa ntchito
  • Kuwonongeka kosasunthika komwe kumachitika chifukwa cha kusagwira bwino kwa zolozera zopanda pake mu PageIterator::Oriental class.

Source: opennet.ru

Kuwonjezera ndemanga