Kuburitswa kweiyo text recognition system Tesseract 5.1

Kuburitswa kweTesseract 5.1 optical text recognition system yakaburitswa, ichitsigira kucherechedzwa kwemavara eUTF-8 uye zvinyorwa mumitauro inodarika zana, kusanganisira chiRussian, Kazakh, Belarusian neUkraine. Mhedzisiro yacho inogona kuchengetwa mumavara akajeka kana muHTML (hOCR), ALTO (XML), PDF uye TSV mafomati. Iyo sisitimu yakatanga kugadzirwa muna 100-1985 murabhoritari yeHewlett Packard; muna 1995, iyo kodhi yakavhurwa pasi perezinesi reApache uye yakagadziridzwa zvakare nekutora chikamu kwevashandi veGoogle. Iyo kodhi kodhi yeprojekiti yakagoverwa pasi peiyo Apache 2005 rezinesi.

Tesseract inosanganisira koni yekushandisa uye libtesseract raibhurari yekumisikidza OCR mashandiro mune mamwe maapplication. Yechitatu-bato GUI inopindirana inotsigira Tesseract inosanganisira gImageReader, VietOCR uye YAGF. Injini mbiri dzekuzivikanwa dzinopihwa: yemhando yepamusoro inoziva zvinyorwa padanho rematanho emunhu ega, uye imwe nyowani yakavakirwa pakushandiswa kwemuchina wekudzidza system yakavakirwa pane LSTM inodzokororwa neural network, yakagadziridzwa yekuziva tambo dzese uye kubvumira kuwedzera kukuru kwechokwadi. Mamodheru akagadzirwa akadzidziswa akatsikiswa mumitauro 123. Kukwenenzvera kuita, mamodule anoshandisa OpenMP uye SIMD mirairo AVX2, AVX, NEON kana SSE4.1 inopihwa.

Kuvandudza kukuru muTesseract 5.1:

  • Iko kugona kugadzirisa nzvimbo nemifananidzo uye mitsetse kana uchiburitsa muALTO, hOCR uye mafomati mameseji akaitwa.
  • Yakawedzera paramende nyowani curl_timeout lkz curl_easy_setop.
  • Yakavandudzwa kuvaka system.
  • Basa rakaitwa kubvisa code isina kushandiswa
  • Kukanganisa kwakagadziriswa kwakakonzerwa nekubata zvisirizvo kweanongedzerwa muPejiIterator::Oriental kirasi.

Source: opennet.ru

Voeg