Sakin tsarin gane rubutu Tesseract 5.0

An buga sakin tsarin gane rubutu na gani na Tesseract 4.1, yana goyan bayan fahimtar haruffa da matani na UTF-8 a cikin harsuna sama da 100, gami da Rashanci, Kazakh, Belarushiyanci da Ukrainian. Ana iya adana sakamakon a cikin rubutu na fili ko a cikin HTML (hOCR), ALTO (XML), PDF da tsarin TSV. An kirkiro tsarin ne a cikin 1985-1995 a cikin dakin gwaje-gwaje na Hewlett Packard; a cikin 2005, an buɗe lambar a ƙarƙashin lasisin Apache kuma an ƙara haɓaka tare da haɗin gwiwar ma'aikatan Google. Ana rarraba lambar tushe na aikin a ƙarƙashin lasisin Apache 2.0.

Tesseract ya ƙunshi kayan aikin wasan bidiyo da ɗakin karatu na libtesseract don shigar da ayyukan OCR cikin wasu aikace-aikace. Hanyoyin haɗin GUI na ɓangare na uku waɗanda ke tallafawa Tesseract sun haɗa da gImageReader, VietOCR da YAGF. Ana ba da injunan fitarwa guda biyu: na al'ada wanda ke gane rubutu a matakin halayen halayen mutum, da kuma sabon dangane da amfani da tsarin koyo na na'ura dangane da hanyar sadarwa ta LSTM mai maimaitawa, wanda aka inganta don gane duka kirtani da ba da izini ga gagarumin karuwa a daidaito. An buga samfuran horarwa na shirye-shiryen don harsuna 123. Don haɓaka aiki, ana ba da samfura masu amfani da OpenMP da umarnin SIMD AVX2, AVX, NEON ko SSE4.1.

Babban haɓakawa a cikin Tesseract 5.0:

  • Babban canji a lambar sigar shine saboda canje-canjen da aka yi ga API wanda ya karya daidaituwa. Musamman, API ɗin libtesseract da ake samu a bainar jama'a ba a haɗa shi da nau'ikan bayanan GenericVector da STRING na mallakar mallaka ba, don goyon bayan std :: kirtani da std:: vector.
  • An sake tsara bishiyar rubutun tushen. An matsar da fayilolin rubutun kan jama'a zuwa ga haɗawa/tesseract directory.
  • An sake fasalin sarrafa ƙwaƙwalwar ajiya, duk malloc da kira kyauta an maye gurbinsu da lambar C++. An aiwatar da sabunta tsarin gaba ɗaya.
  • Ƙara haɓakawa don gine-ginen ARM da ARM64; Ana amfani da umarnin ARM NEON don haɓaka ƙididdiga. An gudanar da inganta ayyukan gama gari ga duk gine-gine.
  • Sabbin hanyoyi don ƙirar horarwa da fahimtar rubutu dangane da yin amfani da lissafin maki masu iyo an aiwatar da su. Sabbin hanyoyin suna ba da mafi girman aiki da ƙarancin amfani da ƙwaƙwalwar ajiya. A cikin injin LSTM, yanayin saurin float32 yana kunna ta tsohuwa.
  • An yi canji zuwa yin amfani da daidaitawar Unicode ta amfani da NFC (Normalization Form Canonical).
  • Ƙara wani zaɓi don saita bayanan log ɗin (--loglevel).
  • An sake fasalin tsarin ginin da ya dogara da Autotools kuma an canza shi don ginawa a cikin yanayin da ba a sake dawowa ba.
  • An sake yiwa reshen "manyan" suna a Git suna zuwa "babban".
  • Supportara tallafi don sabbin abubuwan sakewa na macOS da tsarin Apple dangane da guntu M1.

    source: budenet.ru

Add a comment