Ukukhishwa kohlelo lokuqaphela umbhalo we-Tesseract 4.1 kushicilelwe, okusekela ukuqashelwa kwezinhlamvu ze-UTF-8 nemibhalo ngezilimi ezingaphezu kuka-100, okuhlanganisa isiRashiya, isiKazakh, isiBelarusian nesi-Ukrainian. Umphumela ungagcinwa ngombhalo ongenalutho noma ngefomethi ye-HTML (hOCR), ALTO (XML), PDF kanye ne-TSV. Uhlelo lwaqalwa ngo-1985-1995 elabhorethri ye-Hewlett Packard; ngo-2005, ikhodi yavulwa ngaphansi kwelayisensi ye-Apache futhi yathuthukiswa futhi ngokubamba iqhaza kwabasebenzi bakwa-Google. Ikhodi yomthombo yephrojekthi isatshalaliswa ngaphansi kwelayisensi ye-Apache 2.0.
I-Tesseract ihlanganisa insiza yekhonsoli kanye nelabhulali ye-libtesseract yokushumeka ukusebenza kwe-OCR kwezinye izinhlelo zokusebenza. Izixhumanisi ze-GUI zenkampani yangaphandle ezisekela i-Tesseract zifaka i-gImageReader, i-VietOCR ne-YAGF. Kuhlinzekwa izinjini ezimbili zokuqaphela: eyakudala ebona umbhalo ezingeni lamaphethini omlingiswa ngamunye, nentsha esekelwe ekusetshenzisweni kwesistimu yokufunda yomshini esekelwe kunethiwekhi ye-neural eqhubekayo ye-LSTM, elungiselelwe ukubona wonke amayunithi ezinhlamvu kanye nokuvumela ukwanda okuphawulekayo kokunemba. Amamodeli asevele enziwe aqeqeshiwe ashicilelwe ngezilimi eziyi-123. Ukuze kuthuthukiswe ukusebenza, amamojula asebenzisa imiyalelo ye-OpenMP ne-SIMD ethi AVX2, AVX, NEON noma SSE4.1.
Ukuthuthukiswa okukhulu ku-Tesseract 5.0:
- Ushintsho olubalulekile lwenombolo yenguqulo kungenxa yezinguquko ezenziwe ku-API eziphula ukuhambisana. Ikakhulukazi, i-libtesseract API etholakala esidlangalaleni ayisaboshelwe ku-GenericVector yobunikazi kanye nezinhlobo zedatha ye-STRING, ivuna i-std::string kanye ne-std::vector.
- Isihlahla sombhalo womthombo sihlelwe kabusha. Amafayela anhlokweni asesidlangalaleni ahanjiswe kuhlu lwemibhalo oluhlanganisayo/le-tesseract.
- Ukuphathwa kwememori kuklanywe kabusha, wonke ama-malloc namakholi wamahhala athathelwe indawo ngekhodi ye-C++. Ukwenziwa kwesimanjemanje kwekhodi kwenziwe.
- Ukulungiselelwa okungeziwe kwezakhiwo ze-ARM ne-ARM64; imiyalelo ye-ARM NEON isetshenziselwa ukusheshisa izibalo. Ukuthuthukiswa kokusebenza okujwayelekile kuzo zonke izakhiwo sekwenziwe.
- Kusetshenziswe izindlela ezintsha zamamodeli okuqeqesha nokubonwa kombhalo okusekelwe ekusetshenzisweni kwezibalo zamaphuzu antantayo. Izindlela ezintsha zinikeza ukusebenza okuphezulu nokusetshenziswa kwememori okuphansi. Enjinini ye-LSTM, imodi esheshayo ye-float32 ivulwa ngokuzenzakalelayo.
- Kwenziwe inguquko ekusebenziseni i-Unicode normalization kusetshenziswa ifomu le-NFC (Normalization Form Canonical).
- Kwengezwe inketho yokumisa imininingwane yelogi (--loglevel).
- Isistimu yokwakha esekelwe ku-Autotools iklanywe kabusha futhi yashintshwa ukuze yakheke ngemodi engaphindi.
- Igatsha elithi "master" ku-Git liqanjwe kabusha laba "main".
- Ukwesekwa okungeziwe kokukhishwa okusha kwezinhlelo ze-macOS ne-Apple ngokusekelwe ku-chip ye-M1.
Source: opennet.ru