Sakin tsarin gane rubutu Tesseract 5.1

An buga sakin tsarin gane rubutu na gani na Tesseract 5.1, yana goyan bayan fahimtar haruffa da matani na UTF-8 a cikin harsuna sama da 100, gami da Rashanci, Kazakh, Belarushiyanci da Ukrainian. Ana iya adana sakamakon a cikin rubutu na fili ko a cikin HTML (hOCR), ALTO (XML), PDF da tsarin TSV. An kirkiro tsarin ne a cikin 1985-1995 a cikin dakin gwaje-gwaje na Hewlett Packard; a cikin 2005, an buɗe lambar a ƙarƙashin lasisin Apache kuma an ƙara haɓaka tare da haɗin gwiwar ma'aikatan Google. Ana rarraba lambar tushe na aikin a ƙarƙashin lasisin Apache 2.0.

Tesseract ya ƙunshi kayan aikin wasan bidiyo da ɗakin karatu na libtesseract don shigar da ayyukan OCR cikin wasu aikace-aikace. Hanyoyin haɗin GUI na ɓangare na uku waɗanda ke tallafawa Tesseract sun haɗa da gImageReader, VietOCR da YAGF. Ana ba da injunan fitarwa guda biyu: na al'ada wanda ke gane rubutu a matakin halayen halayen mutum, da kuma sabon dangane da amfani da tsarin koyo na na'ura dangane da hanyar sadarwa ta LSTM mai maimaitawa, wanda aka inganta don gane duka kirtani da ba da izini ga gagarumin karuwa a daidaito. An buga samfuran horarwa na shirye-shiryen don harsuna 123. Don haɓaka aiki, ana ba da samfura masu amfani da OpenMP da umarnin SIMD AVX2, AVX, NEON ko SSE4.1.

Babban haɓakawa a cikin Tesseract 5.1:

  • An aiwatar da ikon aiwatar da wurare tare da hotuna da layi yayin fitarwa a cikin ALTO, hoOCR da tsarin rubutu.
  • An ƙara sabon siga curl_timeout lkz curl_easy_setop.
  • Ingantaccen tsarin gini.
  • An yi aikin cire lambar da ba a yi amfani da ita ba
  • Kafaffen hadarurruka da suka haifar ta hanyar rashin daidaitaccen sarrafa masu nuni a cikin PageIterator :: Ajin Gabatarwa.

source: budenet.ru

Add a comment