Sakin tsarin gane rubutu Tesseract 4.1

An shirya sakin tsarin gane rubutu na gani Takaddun shaida 4.1, goyon bayan gane haruffa da rubutu UTF-8 a cikin harsuna sama da 100, gami da Rashanci, Kazakh, Belarushiyanci da Ukrainian. Ana iya adana sakamakon a cikin rubutu na fili ko a cikin HTML (hOCR), ALTO (XML), PDF da tsarin TSV. An kirkiro tsarin ne a cikin 1985-1995 a cikin dakin gwaje-gwaje na Hewlett Packard; a cikin 2005, an buɗe lambar a ƙarƙashin lasisin Apache kuma an ƙara haɓaka tare da sa hannun ma'aikatan Google. Tushen aikin yada lasisi a ƙarƙashin Apache 2.0.

Tesseract ya ƙunshi kayan aikin wasan bidiyo da ɗakin karatu na libtesseract don shigar da ayyukan OCR cikin wasu aikace-aikace. Daga wasu kamfanoni masu goyan bayan Tesseract GUI musaya za ku iya lura SaikAnKa, VietnamOCR и YAGF. Ana ba da injunan fitarwa guda biyu: na al'ada wanda ke gane rubutu a matakin halayen halayen mutum, da kuma sabon dangane da amfani da tsarin koyo na na'ura dangane da hanyar sadarwa ta LSTM mai maimaitawa, wanda aka inganta don gane duka kirtani da ba da izini ga gagarumin karuwa a daidaito. Ana buga samfuran horarwa da aka shirya don 123 harsuna. Don haɓaka aiki, ana ba da umarni na SIMD masu amfani da OpenMP da AVX2, AVX ko SSE4.1 SIMD.

Main ingantawa a cikin Tesseract 4.1:

  • Ƙara ikon fitarwa a cikin tsarin XML Gyara (Analyzed Layout da Rubutu Abu). Don amfani da wannan tsarin, yakamata ku gudanar da aikace-aikacen azaman “tessaract image_name alto output_dir”;
  • An ƙara sabbin nau'ikan samarwa LSTMBox da WordStrBox, sauƙaƙe horarwar injin;
  • Ƙara goyon baya don pseudographics a cikin fitarwa na hoOCR (HTML);
  • Ƙara madadin rubutun da aka rubuta cikin Python don horar da injin bisa ga koyan na'ura;
  • Fadada haɓakawa ta amfani da umarnin AVX, AVX2 da SSE;
  • An kashe tallafin OpenMP ta tsohuwa saboda matsaloli tare da yawan aiki;
  • Ƙara goyon baya ga jerin fari da baƙi a cikin injin LSTM;
  • Ingantattun rubutun gini bisa Cmake.

source: budenet.ru

Add a comment