Ntọhapụ nke sistemụ njirimara ederede Tesseract 4.1

Kwadoro ntọhapụ nke usoro njirimara ederede anya Ihe ngosi 4.1, na-akwado nnabata nke mkpụrụedemede UTF-8 na ederede n'ihe karịrị asụsụ 100, gụnyere Russian, Kazakh, Belarusian na Ukraine. Enwere ike ịchekwa nsonaazụ ya na ederede doro anya ma ọ bụ na HTML (hOCR), ALTO (XML), PDF na TSV. Emebere usoro a na 1985-1995 na ụlọ nyocha Hewlett Packard na 2005, e meghere koodu ahụ n'okpuru ikike Apache wee mepụtakwu ya na ntinye aka nke ndị ọrụ Google. Isi mmalite nke oru ngo kesaa nyere ikike n'okpuru Apache 2.0.

Tesseract gụnyere akụrụngwa njikwa yana ọba akwụkwọ libteseract maka itinye ọrụ OCR n'ime ngwa ndị ọzọ. Site na ndị ọzọ na-akwado Tesseract GUI interfaces ị nwere ike mara ọzọ gImageReader, VietnamOCR и YAGF. A na-enye engines ude abụọ: nke kpochapụrụ nke na-amata ederede n'ogo nke ụkpụrụ omume onye ọ bụla, na nke ọhụrụ dabere na iji usoro mmụta igwe dabere na netwọkụ akwara LSTM na-emegharị ugboro ugboro, nke kachasị maka ịmata ụdọ niile na ikwe ka a mmụba dị ịrịba ama na izi ezi. A na-ebipụta ụdị a zụrụ azụ nke emebere maka Asụsụ 123. Iji kwalite arụmọrụ, a na-enye ntuziaka SIMD modul na-eji OpenMP na AVX2, AVX ma ọ bụ SSE4.1 SIMD.

Main ndozi na Tesseract 4.1:

  • Agbakwunyere ike iwepụta n'ụdị XML Alto (Nhazi nyochara na ihe ederede). Iji usoro a, ị ga-agba ọsọ ngwa dị ka "tessaract image_name alto output_dir";
  • Agbakwunyere modul nsụgharị ọhụrụ LSTMBox na WordStrBox, na-eme ka ọzụzụ injin dị mfe;
  • Nkwado agbakwunyere maka pseudographics na mmepụta hOCR (HTML);
  • agbakwunyere ederede ọzọ edere na Python maka ịzụ injin ahụ dabere na mmụta igwe;
  • Mgbasawanye njikarịcha site na iji ntuziaka AVX, AVX2 na SSE;
  • Akwụsịghị nkwado OpenMP site na ndabara n'ihi nsogbu na arụpụtaghị ihe;
  • Nkwado agbakwunyere maka ndepụta ọcha na ojii na injin LSTM;
  • Edemede ihe nrụpụta emelitere dabere na Cmake.

isi: opennet.ru

Tinye a comment