Ntọhapụ nke sistemụ njirimara ederede Tesseract 5.3.4

Ebipụtala ntọhapụ nke Tesseract 5.3.4 optical text recognition system, na-akwado nnabata nke mkpụrụedemede UTF-8 na ederede n'ihe karịrị asụsụ 100, gụnyere Russian, Kazakh, Belarusian na Ukrainian. Enwere ike ịchekwa nsonaazụ ya na ederede doro anya ma ọ bụ na HTML (hOCR), ALTO (XML), PDF na TSV. Emebere usoro a na 1985-1995 na ụlọ nyocha Hewlett Packard; na 2005, emepere koodu ahụ n'okpuru ikike Apache wee mepụta ya na ntinye aka nke ndị ọrụ Google. A na-ekesa koodu isi mmalite nke ọrụ ahụ n'okpuru ikike Apache 2.0.

Tesseract gụnyere akụrụngwa njikwa yana ọba akwụkwọ libteseract maka itinye ọrụ OCR n'ime ngwa ndị ọzọ. Ndị GUI ndị ọzọ na-akwado Tesseract gụnyere gImageReader, VietOCR na YAGF. A na-enye engines ude abụọ: nke kpochapụrụ nke na-amata ederede n'ogo nke ụkpụrụ omume onye ọ bụla, na nke ọhụrụ dabere na iji usoro mmụta igwe dabere na netwọkụ akwara LSTM na-emegharị ugboro ugboro, nke kachasị maka ịmata ụdọ dum na ikwe ka a mmụba dị ịrịba ama na izi ezi. E bipụtala ụdị a zụrụ azụ maka asụsụ 123. Iji kwalite arụmọrụ, a na-enye modul na-eji ntụziaka OpenMP na SIMD AVX2, AVX, AVX512F, NEON ma ọ bụ SSE4.1.

Isi nkwalite:

  • Emelitere onyonyo site na URL yana nbudata faịlụ site na iji ọba akwụkwọ libcurl. Mgbe ị na-ebunye, edobere nkụnye eji isi mee onye ọrụ. Agbakwunyere oke ọhụrụ curl_cookiefile maka iji faịlụ kuki.
  • Ihe nkesa ScrollView na-eji TCP dị ka protocol masịrị ya.
  • Mgbe ị na-eji iwu "combine_tessdata -d", a na-enye mmepụta na stdout kama stderr.
  • Esemokwu ụlọ edoziziri mgbe ị na-eji autoconf na clang.

isi: opennet.ru

Tinye a comment