Kutulutsidwa kwa dongosolo lozindikiritsa zolemba Tesseract 5.0

Kutulutsidwa kwa kachitidwe ka Tesseract 4.1 optical text recognition system kwasindikizidwa, kuthandizira kuzindikira zilembo za UTF-8 ndi zolemba m'zilankhulo zopitilira 100, kuphatikiza Chirasha, Chikazakh, Chibelarusi ndi Chiyukireniya. Zotsatira zitha kusungidwa m'mawu osavuta kapena HTML (hOCR), ALTO (XML), PDF ndi TSV. Dongosololi lidapangidwa koyambirira mu 1985-1995 mu labotale ya Hewlett Packard; mu 2005, code idatsegulidwa pansi pa layisensi ya Apache ndipo idapangidwanso mothandizidwa ndi ogwira ntchito ku Google. Khodi yoyambira polojekitiyi imagawidwa pansi pa layisensi ya Apache 2.0.

Tesseract imaphatikizapo chida chothandizira komanso laibulale ya libtesseract yophatikizira magwiridwe antchito a OCR muzinthu zina. Ma GUI a chipani chachitatu omwe amathandizira Tesseract akuphatikiza gImageReader, VietOCR ndi YAGF. Injini ziwiri zozindikiritsa zimaperekedwa: yachikale yomwe imazindikira zolemba pamlingo wa mawonekedwe amunthu aliyense, ndi yatsopano kutengera kugwiritsa ntchito makina ophunzirira makina otengera LSTM recurrent neural network, yokonzedwa kuti izindikire zingwe zonse ndikulola kuwonjezeka kwakukulu kwa kulondola. Zitsanzo zokonzedwa kale zasindikizidwa m'zinenero 123. Kuti muwongolere magwiridwe antchito, ma modules pogwiritsa ntchito malangizo a OpenMP ndi SIMD AVX2, AVX, NEON kapena SSE4.1 amaperekedwa.

Kusintha kwakukulu mu Tesseract 5.0:

  • Kusintha kwakukulu kwa nambala yamtunduwu kudachitika chifukwa cha zosintha zomwe zidachitika ku API zomwe zimasokoneza kugwirizana. Makamaka, libtesseract API yomwe ikupezeka pagulu sikumangirizidwanso kumtundu wamtundu wa GenericVector ndi STRING, mokomera std::string ndi std::vector.
  • Mtengo woyambira wakonzedwanso. Mafayilo amutu wapagulu asunthidwa ku include/tesseract directory.
  • Memory management yasinthidwanso, ma malloc onse ndi mafoni aulere asinthidwa ndi C ++ code. Kusintha kwamakono kwa code kwachitika.
  • Kukhathamiritsa kowonjezera kwa zomanga za ARM ndi ARM64; Malangizo a ARM NEON amagwiritsidwa ntchito kufulumizitsa kuwerengera. Kukhathamiritsa kwa magwiridwe antchito komwe kumafanana ndi zomanga zonse kwachitika.
  • Njira zatsopano zophunzitsira ndi kuzindikira mawu potengera kuwerengera kwa malo oyandama akhazikitsidwa. Mitundu yatsopanoyi imapereka magwiridwe antchito apamwamba komanso kuchepetsa kukumbukira kukumbukira. Mu injini ya LSTM, float32 fast mode imayatsidwa mwachisawawa.
  • Kusintha kwapangidwa kuti agwiritse ntchito Unicode normalization pogwiritsa ntchito fomu ya NFC (Normalization Form Canonical).
  • Onjezani njira yosinthira zolemba (-loglevel).
  • Makina omanga otengera Autotools adasinthidwanso ndikusinthidwa kuti amange mwanjira yosabwereza.
  • Nthambi ya "master" ku Git idasinthidwa kukhala "main".
  • Thandizo lowonjezera pazotulutsa zatsopano zamakina a MacOS ndi Apple kutengera chipangizo cha M1.

    Source: opennet.ru

Kuwonjezera ndemanga