Elinye ibhayisikili: sigcina izintambo ze-Unicode ngo-30-60% ziminyene kune-UTF-8

Elinye ibhayisikili: sigcina izintambo ze-Unicode ngo-30-60% ziminyene kune-UTF-8

Uma ungunjiniyela futhi ubhekene nomsebenzi wokukhetha umbhalo wekhodi, i-Unicode cishe iyohlala iyisixazululo esifanele. Indlela ethile yokumelela incike kumongo, kodwa ezikhathini eziningi kuba nempendulo yendawo yonke nalapha - UTF-8. Okuhle ngayo ukuthi ikuvumela ukuthi usebenzise zonke izinhlamvu ze-Unicode ngaphandle kokuchitha futhi amabhayithi amaningi ezimweni eziningi. Yiqiniso, ezilimini ezisebenzisa okungaphezu nje kwezinhlamvu zesiLatini, "hhayi kakhulu" okungenani amabhayithi amabili ngohlamvu ngalunye. Singakwazi yini ukwenza kangcono ngaphandle kokubuyela ekubhalweni kwangaphambi komlando okusikhawulela ezinhlamvu ezitholakalayo ezingu-256 kuphela?

Ngezansi ngiphakamisa ukuthi uzijwayeze umzamo wami wokuphendula lo mbuzo futhi usebenzise i-algorithm elula ekuvumela ukuthi ugcine imigqa ngezilimi eziningi zomhlaba ngaphandle kokwengeza ukuphindaphindeka okuku-UTF-8.

Umshwana wokuzihlangula. Ngizokwenza ngokushesha ukubhuka okumbalwa okubalulekile: isixazululo esichaziwe asinikezwa njengokuthatha indawo ye-UTF-8 yendawo yonke, ifaneleka kuphela kuhlu oluncane lwamacala (okuningi kuwo ngezansi), futhi akufanele isetshenziswe ukusebenzelana nama-API ezinkampani zangaphandle (abangazi nakwazi ngakho). Imvamisa, ama-algorithms wokucindezela wenhloso ejwayelekile (isibonelo, deflate) afanele ukugcinwa okuhlangene komthamo omkhulu wedatha yombhalo. Ngaphezu kwalokho, ngisenqubweni yokudala isisombululo sami, ngithole indinganiso ekhona ku-Unicode ngokwayo, exazulula inkinga efanayo - iyinkimbinkimbi kakhulu (futhi ivame ukubi kakhulu), kodwa noma kunjalo iyindinganiso eyamukelekayo, hhayi nje ukubeka. ndawonye edolweni. Ngizokutshela ngaye futhi.

Mayelana ne-Unicode ne-UTF-8

Okokuqala, amagama ambalwa mayelana nokuthi kuyini Unicode ΠΈ UTF-8.

Njengoba wazi, ama-encodings angu-8-bit kade adumile. Ngabo, yonke into yayilula: izinhlamvu ezingama-256 zingabalwa ngezinombolo ukusuka ku-0 kuye ku-255, futhi izinombolo ezisuka ku-0 ziye ku-255 ngokusobala zingamelwa njengebhayithi eyodwa. Uma sibuyela ekuqaleni, umbhalo wekhodi we-ASCII ukhawulelwe ngokuphelele kumabhithi ayi-7, ngakho-ke ibhithi ebaluleke kakhulu ekumelelweni kwayo i-byte inguziro, futhi amakhodi amaningi angu-8-bit ahambisana nawo (ahluka kuphela "phezulu" ingxenye, lapho ingxenye ebaluleke kakhulu ingenye).

Ihluke kanjani i-Unicode kulokho kubhalwa ngekhodi futhi kungani kunezethulo eziningi kangaka eziqondile ezihlotshaniswa nayo - UTF-8, UTF-16 (BE kanye ne-LE), UTF-32? Masiyilungise ngokulandelana.

Izinga eliyisisekelo le-Unicode lichaza kuphela ukuxhumana phakathi kwezinhlamvu (futhi kwezinye izimo, izingxenye ngazinye zezinhlamvu) nezinombolo zazo. Futhi kunezinombolo eziningi ezingenzeka kuleli zinga - kusuka 0x00 ukuze 0x10FFFF (1 izingcezu). Uma besifuna ukubeka inombolo kububanzi obunjalo ekuguquguqukeni, akukho amabhayithi angu-114 noma angu-112 abengeke asanele. Futhi njengoba amaphrosesa ethu angakhelwanga kakhulu ukusebenza ngezinombolo zamabhayithi amathathu, sizophoqeleka ukuthi sisebenzise amabhayithi angu-1 ngohlamvu ngalunye! Lena yi-UTF-2, kodwa kungenxa yalokhu "kusaphaza" ukuthi le fomethi ayidumile.

Ngenhlanhla, ukuhleleka kwezinhlamvu ngaphakathi kwe-Unicode akuhleliwe. Isethi yabo yonke ihlukaniswe yaba ngu-17 ".izindiza", ngayinye iqukethe 65536 (0x10000) "amaphuzu amakhodi" Umqondo we "code point" lapha umane nje inombolo yomlingiswa, eyabelwe yona yi-Unicode. Kodwa, njengoba kushiwo ngenhla, ku-Unicode akuzona kuphela izinhlamvu ezibalwa ngabanye, kodwa futhi izingxenye zabo namamaki enkonzo (futhi ngezinye izikhathi akukho lutho oluhambisana nenombolo - mhlawumbe okwamanje, kodwa kithina lokhu akubalulekile kangako), ngakho-ke. kulungile njalo khuluma ngqo ngenani lezinombolo ngokwazo, hhayi izimpawu. Nokho, kulokhu okulandelayo, ngenxa yobufushane, ngizovame ukusebenzisa igama elithi β€œuphawu”, okusho ukuthi igama elithi β€œcode point”.

Elinye ibhayisikili: sigcina izintambo ze-Unicode ngo-30-60% ziminyene kune-UTF-8
Izindiza ze-Unicode. Njengoba ubona, iningi lazo (izindiza 4 kuya ku-13) zisasetshenziswa.

Okuphawuleka kakhulu ukuthi "i-pulp" eyinhloko ilele endizeni enguziro, ibizwa ngokuthi "Indiza Eyisisekelo Yezilimi Eziningi". Uma umugqa uqukethe umbhalo ngolunye lwezilimi zesimanje (kuhlanganise nesiShayina), ngeke weqe le ndiza. Kodwa awukwazi ukunqamula yonke i-Unicode - isibonelo, ama-emoji ikakhulukazi atholakala ekugcineni kwe-Unicode. indiza elandelayo,"Indiza Eyengeziwe Yezilimi Eziningi"(kusukela 0x10000 ukuze 0x1FFFF). Ngakho-ke i-UTF-16 yenza lokhu: zonke izinhlamvu ziwela ngaphakathi Indiza Eyisisekelo Yezilimi Eziningi, abhalwe ngekhodi β€œnjengoba enjalo” ngenombolo ehambisanayo yamabhayithi amabili. Kodwa-ke, ezinye zezinombolo kulolu hlu azibonisi izinhlamvu ezithile, kodwa zibonisa ukuthi ngemva kwalokhu kubhanqwa kwamabhayithi kudingeka sicabangele enye - ngokuhlanganisa amanani alawa mabhayithi amane ndawonye, ​​sithola inombolo ehlanganisa lonke uhla lwe-Unicode oluvumelekile. Lo mbono ubizwa ngokuthi "izithandani ezizimele" - kungenzeka ukuthi uke wezwa ngazo.

Ngakho i-UTF-16 idinga amabili noma (ezimweni ezingavamile kakhulu) amabhayithi amane "ngephoyinti lekhodi". Lokhu kungcono kunokusebenzisa amabhayithi amane ngaso sonke isikhathi, kodwa isiLatini (nezinye izinhlamvu ze-ASCII) uma kufakwa ikhodi ngale ndlela kumosha uhhafu wesikhala ngoziro. I-UTF-8 yakhelwe ukulungisa lokhu: I-ASCII kuyo ithatha, njengangaphambili, ibhayithi elilodwa kuphela; amakhodi kusuka 0x80 ukuze 0x7FF - amabhayithi amabili; kusuka 0x800 ukuze 0xFFFF - ezintathu, futhi kusukela 0x10000 ukuze 0x10FFFF - ezine. Ngakolunye uhlangothi, i-alfabhethi yesiLatini isibe yinhle: ukuhambisana ne-ASCII kubuyile, futhi ukusabalalisa "kusabalaliswa" ngokulinganayo kusuka ku-1 kuya ku-4 bytes. Kodwa izinhlamvu ngaphandle kwesiLatini, maye, azizuzisi nganoma iyiphi indlela uma ziqhathaniswa ne-UTF-16, futhi eziningi manje zidinga amabhayithi amathathu esikhundleni samabili - ububanzi obumbozwe irekhodi lamabhayithi amabili bunciphe izikhathi ezingu-32, 0xFFFF ukuze 0x7FF, futhi awekho amaShayina noma, ngokwesibonelo, isiGeorgia afakiwe kuwo. I-Cyrillic nezinye izinhlamvu ezinhlanu - hurray - lucky, 2 bytes ngohlamvu ngalunye.

Kungani lokhu kwenzeka? Ake sibone ukuthi i-UTF-8 iwamela kanjani amakhodi abalingiswa:
Elinye ibhayisikili: sigcina izintambo ze-Unicode ngo-30-60% ziminyene kune-UTF-8
Ngokuqondile ukumela izinombolo, amabhithi amakwe ngophawu asetshenziswa lapha x. Kungabonakala ukuthi kurekhodi lamabhayithi amabili kukhona amabhithi anjalo ayi-11 kuphela (ku-16). Amabhithi aholayo lapha anomsebenzi osizayo kuphela. Endabeni yerekhodi lamabhayithi amane, amabhithi angu-21 kwangu-32 abelwe inombolo yephoyinti lekhodi - kubonakala sengathi amabhayithi amathathu (anikeza isamba samabhithi angu-24) anganele, kodwa omaka besevisi badla kakhulu.

Ingabe kubi lokhu? Akunjalo Empeleni. Ngakolunye uhlangothi, uma sikhathalela kakhulu isikhala, sinama-algorithms wokucindezela angaqeda kalula yonke i-entropy eyengeziwe kanye nokuphindaphinda. Ngakolunye uhlangothi, inhloso ye-Unicode bekuwukuhlinzeka ngamakhodi atholakala emhlabeni wonke. Isibonelo, singaphathisa umugqa obhalwe ngekhodi ku-UTF-8 kukhodi ebisebenza kuphela nge-ASCII, futhi singesabi ukuthi izobona uhlamvu olusuka kububanzi be-ASCII ongekho ngempela (phela, ku-UTF-8 yonke. amabhayithi aqala ngoziro bit - lena kanye i-ASCII eyikho). Futhi uma ngokuzumayo sifuna ukunqamula umsila omncane ochungechungeni olukhulu ngaphandle kokuwukhipha amakhodi kusukela ekuqaleni (noma ukubuyisela ingxenye yolwazi ngemva kwesigaba esilimele), kulula ngathi ukuthola i-offset lapho umlingisi eqala khona (kwanele. ukweqa amabhayithi anesiqalo esincane 10).

Pho kungani usungula into entsha?

Ngesikhathi esifanayo, kunezimo ezithile lapho ama-algorithms okucindezela afana ne-deflate engasebenzi kahle, kodwa ufuna ukuzuza isitoreji esihlangene sezintambo. Ngokwami, ngihlangabezane nale nkinga lapho ngicabanga ngokwakha isihlahla sesiqalo esicindezelwe yesichazamazwi esikhulu esihlanganisa amagama ngezilimi eziphikisanayo. Ngakolunye uhlangothi, igama ngalinye lifushane kakhulu, ngakho ukulicindezela ngeke kusebenze. Ngakolunye uhlangothi, ukuqaliswa kwesihlahla engangikucabangela kwakuklanywe ukuze ibhayithi ngayinye yentambo egciniwe ikhiqize i-vertex yesihlahla ehlukile, ngakho ukunciphisa inombolo yabo kwakuwusizo kakhulu. Emtatsheni wami wezincwadi Az.js (Njengoba ku pymorphy2, lapho kusekelwe khona) inkinga efanayo ingaxazululwa kalula - izintambo ezipakishwe kuzo DAWG-isichazamazwi, esigcinwe lapho CP1251 endala. Kodwa, njengoba kulula ukukuqonda, lokhu kusebenza kahle kuphela ngezinhlamvu zamagama ezinomkhawulo - umugqa wesiShayina awukwazi ukwengezwa kusichazamazwi esinjalo.

Ngokwehlukana, ngingathanda ukuqaphela enye into embi kakhulu ephakama lapho usebenzisa i-UTF-8 kusakhiwo sedatha esinjalo. Isithombe esingenhla sibonisa ukuthi uma uhlamvu lubhalwa njengamabhayithi amabili, amabhithi ahlobene nenombolo yawo awafiki ngokulandelana, kodwa ahlukaniswa ngamabhithi amabili. 10 phakathi: 110xxxxx 10xxxxxx. Ngenxa yalokhu, lapho amabhithi aphansi angu-6 ebhayithi yesibili echichima kukhodi yomlingiswa (okungukuthi, kwenzeka inguquko. 10111111 β†’ 10000000), bese kuba nebhayithi yokuqala nayo iyashintsha. Kuvela ukuthi uhlamvu "p" luboniswa ngamabhayithi 0xD0 0xBF, futhi u-β€œr” olandelayo usevele 0xD1 0x80. Esihlahleni sesiqalo, lokhu kuholela ekwehlukaneni kwenodi yomzali ibe kabili - eyodwa yesiqalo 0xD0, nenye ye 0xD1 (yize zonke izinhlamvu zesiCyrillic zingafakwa ikhodi kuphela ngebhayithi yesibili).

Ngitholeni

Ngibhekene nale nkinga, nganquma ukuzijwayeza ukudlala imidlalo ngezingcezu, futhi ngasikhathi sinye ngijwayelane kangcono nesakhiwo se-Unicode sisonke. Umphumela waba ifomethi yombhalo wekhodi ye-UTF-C ("C" ye icwecwe), engachithi ngaphezu kwamabhayithi angu-3 ngephoyinti lekhodi, futhi ngokuvamile ikuvumela ukuthi usebenzise kuphela ibhayithi eyodwa eyengeziwe yawo wonke ulayini obhalwe ngekhodi. Lokhu kuholela eqinisweni lokuthi kuma-alfabhethi amaningi okungewona awe-ASCII kuvela ukuthi umbhalo onjalo wekhodi ube njalo Ihlangene ngo-30-60% kune-UTF-8.

Ngethule izibonelo zokusetshenziswa kombhalo wekhodi nokuqopha ama-algorithms efomini Imitapo yolwazi ye-JavaScript ne-Go, ungazisebenzisa ngokukhululekile kukhodi yakho. Kodwa ngisazogcizelela ukuthi ngomqondo othile le fomethi ihlala "ibhayisikili", futhi angikukhuthazi ukuyisebenzisa ngaphandle kokuqaphela ukuthi kungani uyidinga. Lokhu kusewukuhlolwa okuningi "kunokuthuthukiswa kwe-UTF-8" okujulile. Noma kunjalo, ikhodi elapho ibhalwe ngobunono, kafushane, nenani elikhulu lamazwana nokufakwa kokuhlolwa.

Elinye ibhayisikili: sigcina izintambo ze-Unicode ngo-30-60% ziminyene kune-UTF-8
Imiphumela yokuhlolwa nokuqhathaniswa ne-UTF-8

Nami ngenza ikhasi ledemo, lapho ungahlola khona ukusebenza kwe-algorithm, bese ngizokutshela kabanzi mayelana nemigomo yayo kanye nenqubo yokuthuthukiswa.

Ukuqeda izingcezu ezingafuneki

Ngithathe i-UTF-8 njengesisekelo, kunjalo. Into yokuqala nesobala kakhulu engashintshwa kuyo ukunciphisa inani lamabhithi esevisi kubhayithi ngayinye. Isibonelo, ibhayithi yokuqala ku-UTF-8 ihlale iqala ngayo noma yikuphi 0, noma nge 11 - isiqalo 10 Amabhayithi alandelayo kuphela anayo. Asimiselenise isiqalo 11 on 1, futhi kumabhayithi alandelayo sizosusa iziqalo ngokuphelele. Kuzokwenzekani?

0xxxxxxx - 1 ibhayithi
10xxxxxx xxxxxxxx - 2 amabhayithi
110xxxxx xxxxxxxx xxxxxxxx - 3 amabhayithi

Ima, ikuphi irekhodi lamabhayithi amane? Kodwa ayisadingeki - uma sibhala ngamabhayithi amathathu, manje sesinamabhithi angama-21 atholakalayo futhi lokhu kwanele kuzo zonke izinombolo kuze kufike. 0x10FFFF.

Sinikele ngani lapha? Into ebaluleke kakhulu ukutholwa kwemingcele yezinhlamvu endaweni engafanele kubhafa. Asikwazi ukukhomba i-byte engafanele futhi sithole isiqalo sohlamvu olulandelayo kuyo. Lokhu kuwumkhawulo wefomethi yethu, kodwa empeleni lokhu akuvamile isidingo. Ngokuvamile siyakwazi ukugijima ku-buffer kusukela ekuqaleni (ikakhulukazi uma kuziwa emigqeni emifushane).

Isimo sokumboza izilimi ngamabhayithi ama-2 sesibuye saba ngcono: manje ifomethi yamabhayithi amabili inikeza uhla lwamabhithi ayi-14, futhi lawa amakhodi afika ku. 0x3FFF. AmaShayina anebhadi (izinhlamvu zawo ngokuvamile zisukela 0x4E00 ukuze 0x9FFF), kodwa abantu baseGeorgia kanye nabanye abantu abaningi bajabule kakhulu - izilimi zabo nazo zingena kumabhayithi angu-2 ngohlamvu ngalunye.

Faka isimo sesifaki khodi

Manje ake sicabange ngezakhiwo zemigqa ngokwayo. Isichazamazwi ngokuvamile siqukethe amagama abhalwe ngezinhlamvu zezinhlamvu ezifanayo, futhi lokhu kuyiqiniso nakweminye imibhalo eminingi. Kungaba kuhle ukukhombisa le alfabhethi kanye, bese ukhombisa kuphela inombolo yohlamvu olungaphakathi kwayo. Ake sibone ukuthi ukuhlelwa kwezinhlamvu etafuleni le-Unicode kuzosisiza yini.

Njengoba kushiwo ngenhla, i-Unicode ihlukaniswe yaba indiza 65536 amakhodi lilinye. Kodwa lokhu akukona ukuhlukaniswa okuwusizo kakhulu (njengoba sekushiwo, ngokuvamile sisendizeni ye-zero). Okuthakazelisa kakhulu ukuhlukaniswa nge amabhlogo. Lobu bubanzi abusenabo ubude obunqunyiwe, futhi bunenjongo ngokwengeziwe - njengomthetho, ngayinye ihlanganisa izinhlamvu zezinhlamvu ezifanayo.

Elinye ibhayisikili: sigcina izintambo ze-Unicode ngo-30-60% ziminyene kune-UTF-8
Ibhulokhi equkethe izinhlamvu zezinhlamvu zesiBengali. Ngeshwa, ngenxa yezizathu zomlando, lesi yisibonelo sokupakishwa okungaminyene kakhulu - izinhlamvu ezingama-96 zihlakazeke ngesiphithiphithi kuwo wonke amaphuzu angama-block block angama-128.

Ukuqala kwamabhulokhi kanye nosayizi bawo kuhlale kuphindaphindeka ka-16 - lokhu kwenzelwa ukuthi kube lula. Ngaphezu kwalokho, amabhulokhi amaningi aqala futhi aphele kumanani angu-128 noma ngisho 256 - isibonelo, izinhlamvu eziyisisekelo zesiCyrillic zithatha amabhayithi angu-256 ukusuka 0x0400 ukuze 0x04FF. Lokhu kulula kakhulu: uma sigcina isiqalo kanye 0x04, khona-ke noma yiluphi uhlamvu lwesiCyrillic lungabhalwa ngebhayithi eyodwa. Yiqiniso, ngale ndlela sizolahlekelwa ithuba lokubuyela ku-ASCII (nanoma yiziphi ezinye izinhlamvu ngokujwayelekile). Ngakho-ke senza lokhu:

  1. Amabhayithi amabili 10yyyyyy yxxxxxxx akusho nje kuphela uphawu olunenombolo yyyyyy yxxxxxxx, kodwa futhi ushintshe izinhlamvu zamagama zamanje on yyyyyy y0000000 (okungukuthi, sikhumbula zonke izingcezu ngaphandle kwalezo ezingabalulekile kangako 7 okuncane);
  2. Ibhayithi elilodwa 0xxxxxxx lolu wuhlamvu lwezinhlamvu zamanje. Idinga nje ukwengezwa ku-offset esiyikhumbule esinyathelweni 1. Nakuba singashintshanga izinhlamvu zamagama, i-offset inguziro, ngakho-ke silondoloze ukuhambisana ne-ASCII.

Ngokufanayo namakhodi adinga amabhayithi angu-3:

  1. Amabhayithi amathathu 110yyyyy yxxxxxxx xxxxxxxx khombisa uphawu olunenombolo yyyyyy yxxxxxxx xxxxxxxx, ushintsho izinhlamvu zamagama zamanje on yyyyyy y0000000 00000000 (wakhumbula konke ngaphandle kwabancane 15 okuncane), bese uqoka ibhokisi esikulo manje eside imodi (uma sishintsha izinhlamvu zamagama zibuyele kwe-double-byte, sizosetha kabusha leli fulegi);
  2. Amabhayithi amabili 0xxxxxxx xxxxxxxx kwimodi ende wuhlamvu lwezinhlamvu zamanje. Ngokufanayo, siyengeza nge-offset kusuka kusinyathelo 1. Umehluko kuphela ukuthi manje sifunda ama-byte amabili (ngoba sishintshele kule modi).

Kuzwakala kukuhle: manje ngenkathi sidinga ukubhala izinhlamvu kububanzi obufanayo be-7-bit Unicode, sisebenzisa ibhayithi e-1 eyengeziwe ekuqaleni kanye nengqikithi yebhayithi eyodwa ngohlamvu ngalunye.

Elinye ibhayisikili: sigcina izintambo ze-Unicode ngo-30-60% ziminyene kune-UTF-8
Isebenza kusukela kwenye yezinguqulo zangaphambili. Isivele ivame ukwehlula i-UTF-8, kodwa sisekhona isikhala sokuthuthuka.

Yini embi? Okokuqala, sinombandela, okungukuthi i-alfabhethi yamanje offset kanye nebhokisi lokuhlola imodi ende. Lokhu kuphinde kusibeke umkhawulo: manje izinhlamvu ezifanayo zingafakwa ngekhodi ngendlela ehlukile ezimeni ezahlukene. Ukusesha ochungechunge abancane, isibonelo, kuzodingeka kwenziwe ngokucabangela lokhu, hhayi nje ngokuqhathanisa amabhayithi. Okwesibili, lapho nje sishintsha izinhlamvu zamagama, kwaba kubi ngokufakwa kwekhodi kwezinhlamvu ze-ASCII (futhi lokhu akuyona nje izinhlamvu zesiLatini kuphela, kodwa futhi nezimpawu zokubhala eziyisisekelo, kuhlanganise nezikhala) - zidinga ukushintsha izinhlamvu futhi zibe ngu-0, okungukuthi, futhi i-byte eyengeziwe (bese kuba nenye ukuze sibuyele ephuzwini lethu eliyinhloko).

I-alfabhethi eyodwa ilungile, ezimbili zingcono

Ake sizame ukuguqula iziqalo zethu kancane, sicindezele kwesinye kube ezintathu ezichazwe ngenhla:

0xxxxxxx β€” 1 ibhayithi kumodi evamile, 2 kwimodi ende
11xxxxxx - 1 ibhayithi
100xxxxx xxxxxxxx - 2 amabhayithi
101xxxxx xxxxxxxx xxxxxxxx - 3 amabhayithi

Elinye ibhayisikili: sigcina izintambo ze-Unicode ngo-30-60% ziminyene kune-UTF-8

Manje kurekhodi lamabhayithi amabili kunebhithi eyodwa etholakala kancane - ikhodi ikhomba phezulu 0x1FFFkodwa cha 0x3FFF. Kodwa-ke, isenkulu ngokuphawulekayo kunamakhodi e-UTF-8 e-double-byte, izilimi ezivame kakhulu zisangena, ukulahlekelwa okuphawuleka kakhulu kuwile. i-hiragana ΠΈ katakana, amaJapane adabukile.

Ithini le khodi entsha? 11xxxxxx? Lena β€œi-stash” encane enezinhlamvu ezingama-64 ngosayizi, ihambisana nezinhlamvu zethu eziyinhloko, ngakho-ke ngiyibize ngokuthi isiza (umsizi) ama-alfabhethi. Uma sishintsha izinhlamvu zamanje, ucezu lwamagama amadala luba usizo. Isibonelo, sisuke ku-ASCII saya ku-Cyrillic - okufihliwe manje sekuqukethe izinhlamvu ezingama-64 Izinhlamvu zesiLatini, izinombolo, isikhala kanye nekhefana (ukufakwa okuvamile emibhalweni engeyona eye-ASCII). Shintshela emuva ku-ASCII - futhi ingxenye eyinhloko yezinhlamvu zamagama zesiCyrillic izoba i-alfabhethi eyisiza.

Ngenxa yokufinyelela kuzinhlamvu ezimbili zezinhlamvu, singakwazi ukuphatha inombolo enkulu yemibhalo enezindleko ezincane zokushintsha izinhlamvu (izimpawu zokubhala ngokuvamile zizoholela ekubuyeleni ku-ASCII, kodwa ngemva kwalokho sizothola izinhlamvu eziningi ezingezona eze-ASCII kusukela ku-alfabhethi eyengeziwe, ukushintsha futhi).

Ibhonasi: prefixing sub-alfabhethi 11xxxxxx nokukhetha i-offset yayo yokuqala ukuthi ibe 0xC0, sithola ukuhambisana okuyingxenye ne-CP1252. Ngamanye amazwi, imibhalo eminingi (kodwa hhayi yonke) yaseNtshonalanga Yurophu efakwe ikhodi ku-CP1252 izobukeka ngendlela efanayo ku-UTF-C.

Nokho, lapha kuphakama ubunzima: indlela yokuthola isisi esivela ku-alfabhethi eyinhloko? Ungashiya i-offset efanayo, kodwa - maye - lapha isakhiwo se-Unicode sesivele sidlala ngokumelene nathi. Ngokuvamile ingxenye eyinhloko yezinhlamvu ayikho ekuqaleni kwebhulokhi (isibonelo, inhloko-dolobha yaseRussia "A" inekhodi. 0x0410, nakuba i-Cyrillic block iqala ngokuthi 0x0400). Ngakho, ngemva kokuthatha izinhlamvu zokuqala ezingu-64 sazifaka ku-stash, singase silahlekelwe ukufinyelela engxenyeni yomsila wezinhlamvu.

Ukuze kulungiswe le nkinga, ngidlule mathupha kumabhulokhi athile ahambisana nezilimi ezihlukene, futhi ngacacisa i-offset yezinhlamvu ezisizayo ngaphakathi kweyinhloko yazo. Izinhlamvu zesiLatini, njengokuhlukile, zazihlelwa kabusha njenge-base64.

Elinye ibhayisikili: sigcina izintambo ze-Unicode ngo-30-60% ziminyene kune-UTF-8

Ukuthintwa kokugcina

Ekugcineni ake sicabange ngendawo lapho esingathuthukisa khona okuthile.

Qaphela ukuthi ifomethi 101xxxxx xxxxxxxx xxxxxxxx ikuvumela ukuthi ubhale izinombolo kuze kufike 0x1FFFFF, futhi i-Unicode iphela ngaphambili, ngo 0x10FFFF. Ngamanye amazwi, iphuzu lokugcina lekhodi lizomelwa njenge 10110000 11111111 11111111. Ngakho-ke, singasho ukuthi uma i-byte yokuqala ingeyefomu 1011xxxx (lapho xxxx okukhulu kuno-0), bese kusho okunye. Isibonelo, ungangeza ezinye izinhlamvu eziyi-15 lapho ezitholakala njalo ukuze zifakwe ikhodi ngebhayithi eyodwa, kodwa nginqume ukukwenza ngendlela ehlukile.

Ake sibheke lawo mabhulokhi e-Unicode adinga amabhayithi amathathu manje. Ngokuyinhloko, njengoba sekushiwo kakade, lezi yizinhlamvu zesiShayina - kodwa kunzima ukwenza lutho ngazo, kunezinkulungwane ezingama-21 zazo. Kodwa i-hiragana ne-katakana nazo zandizela lapho - futhi azisekho eziningi zazo, zingaphansi kwamakhulu amabili. Futhi, njengoba sikhumbule isiJapane, kukhona nama-emojis (empeleni, ahlakazeke ezindaweni eziningi ku-Unicode, kodwa amabhlogo ayinhloko akuluhlu. 0x1F300 - 0x1FBFF). Uma ucabanga ngeqiniso lokuthi manje sekunama-emojis aqoqwe kusuka kumakhodi ambalwa ngasikhathi sinye (isibonelo, i-emoji ‍Elinye ibhayisikili: sigcina izintambo ze-Unicode ngo-30-60% ziminyene kune-UTF-8 iqukethe amakhodi amaningi afinyelela kwangu-7!), bese kuba yihlazo eliphelele ukuchitha amabhayithi amathathu endaweni ngayinye (7Γ—3 = 21 bytes ngenxa yesithonjana esisodwa, iphupho elibi).

Ngakho-ke, sikhetha ububanzi obumbalwa obukhethiwe obuhambisana ne-emoji, i-hiragana ne-katakana, siphinde sibhale ngohlu olulodwa oluqhubekayo bese sibhala amabhayithi amabili esikhundleni samathathu:

1011xxxx xxxxxxxx

Okuhle: i-emoji eshiwo ngenhlaElinye ibhayisikili: sigcina izintambo ze-Unicode ngo-30-60% ziminyene kune-UTF-8, ehlanganisa amakhodi angu-7, ithatha amabhayithi angu-8 ku-UTF-25, futhi siyihlanganisa 14 (amabhayithi amabili ncamashi ephoyinti lekhodi ngalinye). Ngendlela, uHabr wenqaba ukuyigaya (kokubili komdala nakumhleli omusha), ngakho kwadingeka ngiyifake ngesithombe.

Ake sizame ukulungisa enye inkinga futhi. Njengoba sikhumbula, ama-alfabhethi ayisisekelo empeleni amabhithi angu-6 aphezulu, esikukhumbulayo futhi sinamathisele kukhodi yophawu ngalunye oluqanjiwe olulandelayo. Endabeni yezinhlamvu zesiShayina ezisebhulokhini 0x4E00 - 0x9FFF, lokhu kungaba u-0 noma u-1. Lokhu akulula kakhulu: sizodinga ukushintsha njalo izinhlamvu zamagama phakathi kwalawa manani amabili (okungukuthi sebenzisa amabhayithi amathathu). Kodwa qaphela ukuthi kwimodi ende, kusukela kukhodi ngokwayo singasusa inombolo yezinhlamvu esizibhala ngemodi emfushane (ngemuva kwawo wonke amaqhinga achazwe ngenhla, lokhu kungu-10240) - khona-ke uhla lwama-hieroglyphs luzodlulela ku- 0x2600 - 0x77FF, futhi kulesi simo, kulo lonke lolu hlu, ama-bits angu-6 abaluleke kakhulu (kwangu-21) azolingana no-0. Ngakho, ukulandelana kwama-hieroglyphs kuzosebenzisa amabhayithi amabili nge-hieroglyph ngayinye (okuyinto elungele uhla olukhulu kangaka), ngaphandle kubangela ukushintsha kwezinhlamvu.

Ezinye izixazululo: SCSU, BOCU-1

Ochwepheshe be-Unicode, njengoba besanda kufunda isihloko sendatshana, cishe bazoshesha ukukukhumbuza ukuthi ngqo phakathi kwamazinga e-Unicode kukhona. I-Standard Compression Scheme ye-Unicode (SCSU), echaza indlela yombhalo wekhodi efana kakhulu naleyo echazwe esihlokweni.

Ngiyavuma ngobuqotho: Ngafunda ngokuba khona kwayo ngemva kokuba ngicwile ngokujulile ekubhaleni isinqumo sami. Ukube ngangazi ngakho kwasekuqaleni, mhlawumbe ngabe ngizamile ukubhala ukuqaliswa esikhundleni sokuza nendlela yami.

Kuyathakazelisa ukuthi i-SCSU isebenzisa imibono efana kakhulu naleyo engiqhamuke nayo ngedwa (esikhundleni somqondo "wamagama" basebenzisa "amafasitela", futhi miningi etholakalayo kunami). Ngasikhathi sinye, le fomethi nayo inemibi: isondele kancane kuma-algorithms wokucindezela kunawombhalo wekhodi. Ikakhulukazi, indinganiso inikeza izindlela eziningi zokumelela, kodwa ayisho ukuthi ungayikhetha kanjani efanelekile - kulokhu, isifaki khodi kufanele sisebenzise uhlobo oluthile lwe-heuristics. Ngakho-ke, isifaki khodi se-SCSU esikhiqiza ukupakishwa okuhle sizoba yinkimbinkimbi futhi sibe nzima kune-algorithm yami.

Ukuze uqhathanise, ngidlulisele ukuqaliswa okulula kwe-SCSU ku-JavaScript - ngokwevolumu yekhodi kuvele ukuthi kuqhathaniswe ne-UTF-C yami, kodwa kwezinye izimo umphumela waba kubi kakhulu amashumi amaphesenti (ngezinye izikhathi kungase kudlule, kodwa hhayi kakhulu). Ngokwesibonelo, imibhalo yesiHebheru nesiGreki yabhalwa nge-UTF-C 60% kangcono kune-SCSU (mhlawumbe ngenxa yezinhlamvu zabo ezihlangene).

Ngokwehlukana, ngizongeza ukuthi ngaphandle kwe-SCSU kukhona futhi enye indlela yokumela i-Unicode ngokuhlangene - I-BOCU-1, kodwa ihlose ukusebenzisana kwe-MIME (ebengingayidingi) futhi ithatha indlela ehluke kancane ekubhaleni ngekhodi. Angikahloli ukusebenza kwayo, kodwa kubonakala kimina ukuthi mancane amathuba okuthi ibe phezulu kune-SCSU.

Ukuthuthukiswa okungenzeka

I-algorithm engiyethulile ayifani ngokuklanywa kwendawo yonke (lokhu mhlawumbe yilapho imigomo yami yehluka khona kakhulu emigomeni ye-Unicode Consortium). Sengike ngabalula ukuthi yakhelwe ikakhulukazi umsebenzi owodwa (ukugcina isichazamazwi sezilimi eziningi esihlahleni sesiqalo), futhi ezinye izici zayo zingase zingayifanelekeli kahle eminye imisebenzi. Kodwa iqiniso lokuthi akuyona indinganiso kungaba plus - ungakwazi ukuyishintsha kalula ukuze ihambisane nezidingo zakho.

Isibonelo, ngendlela esobala ungasusa ubukhona bombuso, wenze amakhodi angenasisekelo - ungabuyekezi okuguquguqukayo offs, auxOffs ΠΈ is21Bit kusishumeki kanye nesiqophi sekhodi. Kulesi simo, ngeke kwenzeke ukupakisha ngokuphumelelayo ukulandelana kwezinhlamvu zezinhlamvu ezifanayo, kodwa kuzoba nesiqinisekiso sokuthi uhlamvu olufanayo luhlala lubhalwe ngekhodi ngamabhayithi afanayo, kungakhathaliseki umongo.

Ukwengeza, ungakwazi ukuhlela isishumeki sibe nolimi oluthile ngokushintsha isimo esimisiwe - isibonelo, ugxile emibhalweni yesiRashiya, usethe isishumeki kanye nesikhiphi khodi ekuqaleni. offs = 0x0400 ΠΈ auxOffs = 0. Lokhu kunengqondo ikakhulukazi esimweni semodi engenasimo. Ngokuvamile, lokhu kuzofana nokusebenzisa umbhalo omdala wamabhithi ayisishiyagalombili, kodwa ngaphandle kokukhipha ikhono lokufaka izinhlamvu kuwo wonke ama-Unicode njengoba kudingeka.

Okunye okuhlehlayo okukhulunywe ngakho ekuqaleni ukuthi embhalweni omkhulu ofakwe ikhodi ku-UTF-C ayikho indlela esheshayo yokuthola umngcele wezinhlamvu oseduze kakhulu nebhayithi engaqondakali. Uma usika okokugcina, yithi, amabhayithi ayi-100 kusuka kusigcinalwazi esifakwe ikhodi, uzibeka engcupheni yokuthola udoti ongakwazi ukwenza lutho ngawo. Umbhalo wekhodi awuklanyelwe ukugcina amalogi e-multi-gigabyte, kodwa ngokuvamile lokhu kungalungiswa. Byte 0xBF akumele neze ivele njengebhayithi yokuqala (kodwa ingaba eyesibili noma yesithathu). Ngakho-ke, lapho ufaka ikhodi, ungafaka ukulandelana 0xBF 0xBF 0xBF zonke, zithi, 10 KB - khona-ke, uma udinga ukuthola umngcele, kuzokwanela ukuskena ucezu olukhethiwe kuze kutholakale umaka ofanayo. Ukulandela okokugcina 0xBF uqinisekisiwe ukuthi uyisiqalo somlingisi. (Lapho kukhishwa amakhodi, lokhu kulandelana kwamabhayithi amathathu, vele, kuzodinga ukushaywa indiva.)

Ukufingqa

Uma ufunde kuze kube manje, siyakuhalalisela! Ngithemba ukuthi nawe, njengami, ufunde okuthile okusha (noma uvuselele inkumbulo yakho) mayelana nesakhiwo se-Unicode.

Elinye ibhayisikili: sigcina izintambo ze-Unicode ngo-30-60% ziminyene kune-UTF-8
Ikhasi ledemo. Isibonelo sesiHebheru sibonisa izinzuzo ngaphezu kwakho kokubili kwe-UTF-8 ne-SCSU.

Lolu cwaningo oluchazwe ngenhla akufanele luthathwe njengokugxambukela kumazinga. Nokho, ngokuvamile ngenelisekile ngemiphumela yomsebenzi wami, ngakho ngiyajabula ngayo share: isibonelo, ilabhulali ye-JS minified inesisindo samabhayithi angu-1710 kuphela (futhi ayinakho ukuncika, kunjalo). Njengoba ngishilo ngenhla, umsebenzi wakhe ungatholakala ku ikhasi ledemo (kukhona nesethi yemibhalo engaqhathaniswa nayo ne-UTF-8 ne-SCSU).

Ekugcineni, ngizophinda ngidonse ukunaka ezimeni lapho kusetshenziswa khona i-UTF-C akufanelekile:

  • Uma imigqa yakho imide ngokwanele (kusukela ezinhlamvu ezingu-100-200). Kulokhu, kufanele ucabange ngokusebenzisa ama-algorithms wokucindezela njenge-deflate.
  • Uma udinga ASCII obala, okusho ukuthi, kubalulekile kuwe ukuthi ukulandelana okufakwe ikhodi kungaqukathi amakhodi e-ASCII abengekho kuyunithi yezinhlamvu yomthombo. Isidingo salokhu singagwenywa uma, lapho usebenzisana nama-API ezinkampani zangaphandle (isibonelo, ukusebenza nesizindalwazi), udlulisa umphumela wombhalo wekhodi njengesethi engabonakali yamabhayithi, hhayi njengezintambo. Uma kungenjalo, usengozini yokuthola ubungozi obungalindelekile.
  • Uma ufuna ukukwazi ukuthola ngokushesha imingcele yezinhlamvu ngendlela engafanele (isibonelo, lapho ingxenye yomugqa ilimele). Lokhu kungenziwa, kodwa kuphela ngokuskena umugqa kusukela ekuqaleni (noma ukusebenzisa ukuguqulwa okuchazwe esigabeni esandulele).
  • Uma udinga ukwenza imisebenzi ngokushesha kokuqukethwe kwezintambo (zihlele, sesha ochungechunge abancane kuzo, concatenate). Lokhu kudinga ukuthi amayunithi ezinhlamvu aqoshwe kuqala, ngakho i-UTF-C izohamba kancane kune-UTF-8 kulezi zimo (kodwa isheshe kunama-algorithms okucindezela). Njengoba iyunithi yezinhlamvu efanayo ihlale ifakwe ikhodi ngendlela efanayo, ukuqhathanisa okuqondile kokukhipha amakhodi akudingeki futhi kungenziwa ngesisekelo se-byte-by-byte.

buyekeza: umsebenzisi Tyomitch kumazwana angezansi uthumele igrafu egqamisa imikhawulo yokusebenziseka ye-UTF-C. Ibonisa ukuthi i-UTF-C iphumelela kakhulu kune-algorithm yokucindezela yenhloso evamile (okuhlukile kwe-LZW) inqobo nje uma iyunithi yezinhlamvu epakishiwe imfushane. ~Izinhlamvu ezingu-140 (nokho, ngiyaqaphela ukuthi ukuqhathanisa kwenziwa embhalweni owodwa; kwezinye izilimi umphumela ungahluka).
Elinye ibhayisikili: sigcina izintambo ze-Unicode ngo-30-60% ziminyene kune-UTF-8

Source: www.habr.com

Engeza amazwana