Uma ungunjiniyela futhi ubhekene nomsebenzi wokukhetha umbhalo wekhodi, i-Unicode cishe iyohlala iyisixazululo esifanele. Indlela ethile yokumelela incike kumongo, kodwa ezikhathini eziningi kuba nempendulo yendawo yonke nalapha - UTF-8. Okuhle ngayo ukuthi ikuvumela ukuthi usebenzise zonke izinhlamvu ze-Unicode ngaphandle kokuchitha futhi amabhayithi amaningi ezimweni eziningi. Yiqiniso, ezilimini ezisebenzisa okungaphezu nje kwezinhlamvu zesiLatini, "hhayi kakhulu" okungenani amabhayithi amabili ngohlamvu ngalunye. Singakwazi yini ukwenza kangcono ngaphandle kokubuyela ekubhalweni kwangaphambi komlando okusikhawulela ezinhlamvu ezitholakalayo ezingu-256 kuphela?
Ngezansi ngiphakamisa ukuthi uzijwayeze umzamo wami wokuphendula lo mbuzo futhi usebenzise i-algorithm elula ekuvumela ukuthi ugcine imigqa ngezilimi eziningi zomhlaba ngaphandle kokwengeza ukuphindaphindeka okuku-UTF-8.
Umshwana wokuzihlangula. Ngizokwenza ngokushesha ukubhuka okumbalwa okubalulekile: isixazululo esichaziwe asinikezwa njengokuthatha indawo ye-UTF-8 yendawo yonke, ifaneleka kuphela kuhlu oluncane lwamacala (okuningi kuwo ngezansi), futhi akufanele isetshenziswe ukusebenzelana nama-API ezinkampani zangaphandle (abangazi nakwazi ngakho). Imvamisa, ama-algorithms wokucindezela wenhloso ejwayelekile (isibonelo, deflate) afanele ukugcinwa okuhlangene komthamo omkhulu wedatha yombhalo. Ngaphezu kwalokho, ngisenqubweni yokudala isisombululo sami, ngithole indinganiso ekhona ku-Unicode ngokwayo, exazulula inkinga efanayo - iyinkimbinkimbi kakhulu (futhi ivame ukubi kakhulu), kodwa noma kunjalo iyindinganiso eyamukelekayo, hhayi nje ukubeka. ndawonye edolweni. Ngizokutshela ngaye futhi.
Mayelana ne-Unicode ne-UTF-8
Okokuqala, amagama ambalwa mayelana nokuthi kuyini Unicode ΠΈ UTF-8.
Njengoba wazi, ama-encodings angu-8-bit kade adumile. Ngabo, yonke into yayilula: izinhlamvu ezingama-256 zingabalwa ngezinombolo ukusuka ku-0 kuye ku-255, futhi izinombolo ezisuka ku-0 ziye ku-255 ngokusobala zingamelwa njengebhayithi eyodwa. Uma sibuyela ekuqaleni, umbhalo wekhodi we-ASCII ukhawulelwe ngokuphelele kumabhithi ayi-7, ngakho-ke ibhithi ebaluleke kakhulu ekumelelweni kwayo i-byte inguziro, futhi amakhodi amaningi angu-8-bit ahambisana nawo (ahluka kuphela "phezulu" ingxenye, lapho ingxenye ebaluleke kakhulu ingenye).
Ihluke kanjani i-Unicode kulokho kubhalwa ngekhodi futhi kungani kunezethulo eziningi kangaka eziqondile ezihlotshaniswa nayo - UTF-8, UTF-16 (BE kanye ne-LE), UTF-32? Masiyilungise ngokulandelana.
Izinga eliyisisekelo le-Unicode lichaza kuphela ukuxhumana phakathi kwezinhlamvu (futhi kwezinye izimo, izingxenye ngazinye zezinhlamvu) nezinombolo zazo. Futhi kunezinombolo eziningi ezingenzeka kuleli zinga - kusuka 0x00
ukuze 0x10FFFF
(1 izingcezu). Uma besifuna ukubeka inombolo kububanzi obunjalo ekuguquguqukeni, akukho amabhayithi angu-114 noma angu-112 abengeke asanele. Futhi njengoba amaphrosesa ethu angakhelwanga kakhulu ukusebenza ngezinombolo zamabhayithi amathathu, sizophoqeleka ukuthi sisebenzise amabhayithi angu-1 ngohlamvu ngalunye! Lena yi-UTF-2, kodwa kungenxa yalokhu "kusaphaza" ukuthi le fomethi ayidumile.
Ngenhlanhla, ukuhleleka kwezinhlamvu ngaphakathi kwe-Unicode akuhleliwe. Isethi yabo yonke ihlukaniswe yaba ngu-17 ".izindiza", ngayinye iqukethe 65536 (0x10000
) "amaphuzu amakhodi" Umqondo we "code point" lapha umane nje inombolo yomlingiswa, eyabelwe yona yi-Unicode. Kodwa, njengoba kushiwo ngenhla, ku-Unicode akuzona kuphela izinhlamvu ezibalwa ngabanye, kodwa futhi izingxenye zabo namamaki enkonzo (futhi ngezinye izikhathi akukho lutho oluhambisana nenombolo - mhlawumbe okwamanje, kodwa kithina lokhu akubalulekile kangako), ngakho-ke. kulungile njalo khuluma ngqo ngenani lezinombolo ngokwazo, hhayi izimpawu. Nokho, kulokhu okulandelayo, ngenxa yobufushane, ngizovame ukusebenzisa igama elithi βuphawuβ, okusho ukuthi igama elithi βcode pointβ.
Izindiza ze-Unicode. Njengoba ubona, iningi lazo (izindiza 4 kuya ku-13) zisasetshenziswa.
Okuphawuleka kakhulu ukuthi "i-pulp" eyinhloko ilele endizeni enguziro, ibizwa ngokuthi "Indiza Eyisisekelo Yezilimi Eziningi". Uma umugqa uqukethe umbhalo ngolunye lwezilimi zesimanje (kuhlanganise nesiShayina), ngeke weqe le ndiza. Kodwa awukwazi ukunqamula yonke i-Unicode - isibonelo, ama-emoji ikakhulukazi atholakala ekugcineni kwe-Unicode. indiza elandelayo,"Indiza Eyengeziwe Yezilimi Eziningi"(kusukela 0x10000
ukuze 0x1FFFF
). Ngakho-ke i-UTF-16 yenza lokhu: zonke izinhlamvu ziwela ngaphakathi Indiza Eyisisekelo Yezilimi Eziningi, abhalwe ngekhodi βnjengoba enjaloβ ngenombolo ehambisanayo yamabhayithi amabili. Kodwa-ke, ezinye zezinombolo kulolu hlu azibonisi izinhlamvu ezithile, kodwa zibonisa ukuthi ngemva kwalokhu kubhanqwa kwamabhayithi kudingeka sicabangele enye - ngokuhlanganisa amanani alawa mabhayithi amane ndawonye, ββsithola inombolo ehlanganisa lonke uhla lwe-Unicode oluvumelekile. Lo mbono ubizwa ngokuthi "izithandani ezizimele" - kungenzeka ukuthi uke wezwa ngazo.
Ngakho i-UTF-16 idinga amabili noma (ezimweni ezingavamile kakhulu) amabhayithi amane "ngephoyinti lekhodi". Lokhu kungcono kunokusebenzisa amabhayithi amane ngaso sonke isikhathi, kodwa isiLatini (nezinye izinhlamvu ze-ASCII) uma kufakwa ikhodi ngale ndlela kumosha uhhafu wesikhala ngoziro. I-UTF-8 yakhelwe ukulungisa lokhu: I-ASCII kuyo ithatha, njengangaphambili, ibhayithi elilodwa kuphela; amakhodi kusuka 0x80
ukuze 0x7FF
- amabhayithi amabili; kusuka 0x800
ukuze 0xFFFF
- ezintathu, futhi kusukela 0x10000
ukuze 0x10FFFF
- ezine. Ngakolunye uhlangothi, i-alfabhethi yesiLatini isibe yinhle: ukuhambisana ne-ASCII kubuyile, futhi ukusabalalisa "kusabalaliswa" ngokulinganayo kusuka ku-1 kuya ku-4 bytes. Kodwa izinhlamvu ngaphandle kwesiLatini, maye, azizuzisi nganoma iyiphi indlela uma ziqhathaniswa ne-UTF-16, futhi eziningi manje zidinga amabhayithi amathathu esikhundleni samabili - ububanzi obumbozwe irekhodi lamabhayithi amabili bunciphe izikhathi ezingu-32, 0xFFFF
ukuze 0x7FF
, futhi awekho amaShayina noma, ngokwesibonelo, isiGeorgia afakiwe kuwo. I-Cyrillic nezinye izinhlamvu ezinhlanu - hurray - lucky, 2 bytes ngohlamvu ngalunye.
Kungani lokhu kwenzeka? Ake sibone ukuthi i-UTF-8 iwamela kanjani amakhodi abalingiswa:
Ngokuqondile ukumela izinombolo, amabhithi amakwe ngophawu asetshenziswa lapha x
. Kungabonakala ukuthi kurekhodi lamabhayithi amabili kukhona amabhithi anjalo ayi-11 kuphela (ku-16). Amabhithi aholayo lapha anomsebenzi osizayo kuphela. Endabeni yerekhodi lamabhayithi amane, amabhithi angu-21 kwangu-32 abelwe inombolo yephoyinti lekhodi - kubonakala sengathi amabhayithi amathathu (anikeza isamba samabhithi angu-24) anganele, kodwa omaka besevisi badla kakhulu.
Ingabe kubi lokhu? Akunjalo Empeleni. Ngakolunye uhlangothi, uma sikhathalela kakhulu isikhala, sinama-algorithms wokucindezela angaqeda kalula yonke i-entropy eyengeziwe kanye nokuphindaphinda. Ngakolunye uhlangothi, inhloso ye-Unicode bekuwukuhlinzeka ngamakhodi atholakala emhlabeni wonke. Isibonelo, singaphathisa umugqa obhalwe ngekhodi ku-UTF-8 kukhodi ebisebenza kuphela nge-ASCII, futhi singesabi ukuthi izobona uhlamvu olusuka kububanzi be-ASCII ongekho ngempela (phela, ku-UTF-8 yonke. amabhayithi aqala ngoziro bit - lena kanye i-ASCII eyikho). Futhi uma ngokuzumayo sifuna ukunqamula umsila omncane ochungechungeni olukhulu ngaphandle kokuwukhipha amakhodi kusukela ekuqaleni (noma ukubuyisela ingxenye yolwazi ngemva kwesigaba esilimele), kulula ngathi ukuthola i-offset lapho umlingisi eqala khona (kwanele. ukweqa amabhayithi anesiqalo esincane 10
).
Pho kungani usungula into entsha?
Ngesikhathi esifanayo, kunezimo ezithile lapho ama-algorithms okucindezela afana ne-deflate engasebenzi kahle, kodwa ufuna ukuzuza isitoreji esihlangene sezintambo. Ngokwami, ngihlangabezane nale nkinga lapho ngicabanga ngokwakha
Ngokwehlukana, ngingathanda ukuqaphela enye into embi kakhulu ephakama lapho usebenzisa i-UTF-8 kusakhiwo sedatha esinjalo. Isithombe esingenhla sibonisa ukuthi uma uhlamvu lubhalwa njengamabhayithi amabili, amabhithi ahlobene nenombolo yawo awafiki ngokulandelana, kodwa ahlukaniswa ngamabhithi amabili. 10
phakathi: 110xxxxx 10xxxxxx
. Ngenxa yalokhu, lapho amabhithi aphansi angu-6 ebhayithi yesibili echichima kukhodi yomlingiswa (okungukuthi, kwenzeka inguquko. 10111111
β 10000000
), bese kuba nebhayithi yokuqala nayo iyashintsha. Kuvela ukuthi uhlamvu "p" luboniswa ngamabhayithi 0xD0 0xBF
, futhi u-βrβ olandelayo usevele 0xD1 0x80
. Esihlahleni sesiqalo, lokhu kuholela ekwehlukaneni kwenodi yomzali ibe kabili - eyodwa yesiqalo 0xD0
, nenye ye 0xD1
(yize zonke izinhlamvu zesiCyrillic zingafakwa ikhodi kuphela ngebhayithi yesibili).
Ngitholeni
Ngibhekene nale nkinga, nganquma ukuzijwayeza ukudlala imidlalo ngezingcezu, futhi ngasikhathi sinye ngijwayelane kangcono nesakhiwo se-Unicode sisonke. Umphumela waba ifomethi yombhalo wekhodi ye-UTF-C ("C" ye icwecwe), engachithi ngaphezu kwamabhayithi angu-3 ngephoyinti lekhodi, futhi ngokuvamile ikuvumela ukuthi usebenzise kuphela ibhayithi eyodwa eyengeziwe yawo wonke ulayini obhalwe ngekhodi. Lokhu kuholela eqinisweni lokuthi kuma-alfabhethi amaningi okungewona awe-ASCII kuvela ukuthi umbhalo onjalo wekhodi ube njalo Ihlangene ngo-30-60% kune-UTF-8.
Ngethule izibonelo zokusetshenziswa kombhalo wekhodi nokuqopha ama-algorithms efomini
Imiphumela yokuhlolwa nokuqhathaniswa ne-UTF-8
Nami ngenza
Ukuqeda izingcezu ezingafuneki
Ngithathe i-UTF-8 njengesisekelo, kunjalo. Into yokuqala nesobala kakhulu engashintshwa kuyo ukunciphisa inani lamabhithi esevisi kubhayithi ngayinye. Isibonelo, ibhayithi yokuqala ku-UTF-8 ihlale iqala ngayo noma yikuphi 0
, noma nge 11
- isiqalo 10
Amabhayithi alandelayo kuphela anayo. Asimiselenise isiqalo 11
on 1
, futhi kumabhayithi alandelayo sizosusa iziqalo ngokuphelele. Kuzokwenzekani?
0xxxxxxx
- 1 ibhayithi
10xxxxxx xxxxxxxx
- 2 amabhayithi
110xxxxx xxxxxxxx xxxxxxxx
- 3 amabhayithi
Ima, ikuphi irekhodi lamabhayithi amane? Kodwa ayisadingeki - uma sibhala ngamabhayithi amathathu, manje sesinamabhithi angama-21 atholakalayo futhi lokhu kwanele kuzo zonke izinombolo kuze kufike. 0x10FFFF
.
Sinikele ngani lapha? Into ebaluleke kakhulu ukutholwa kwemingcele yezinhlamvu endaweni engafanele kubhafa. Asikwazi ukukhomba i-byte engafanele futhi sithole isiqalo sohlamvu olulandelayo kuyo. Lokhu kuwumkhawulo wefomethi yethu, kodwa empeleni lokhu akuvamile isidingo. Ngokuvamile siyakwazi ukugijima ku-buffer kusukela ekuqaleni (ikakhulukazi uma kuziwa emigqeni emifushane).
Isimo sokumboza izilimi ngamabhayithi ama-2 sesibuye saba ngcono: manje ifomethi yamabhayithi amabili inikeza uhla lwamabhithi ayi-14, futhi lawa amakhodi afika ku. 0x3FFF
. AmaShayina anebhadi (izinhlamvu zawo ngokuvamile zisukela 0x4E00
ukuze 0x9FFF
), kodwa abantu baseGeorgia kanye nabanye abantu abaningi bajabule kakhulu - izilimi zabo nazo zingena kumabhayithi angu-2 ngohlamvu ngalunye.
Faka isimo sesifaki khodi
Manje ake sicabange ngezakhiwo zemigqa ngokwayo. Isichazamazwi ngokuvamile siqukethe amagama abhalwe ngezinhlamvu zezinhlamvu ezifanayo, futhi lokhu kuyiqiniso nakweminye imibhalo eminingi. Kungaba kuhle ukukhombisa le alfabhethi kanye, bese ukhombisa kuphela inombolo yohlamvu olungaphakathi kwayo. Ake sibone ukuthi ukuhlelwa kwezinhlamvu etafuleni le-Unicode kuzosisiza yini.
Njengoba kushiwo ngenhla, i-Unicode ihlukaniswe yaba indiza 65536 amakhodi lilinye. Kodwa lokhu akukona ukuhlukaniswa okuwusizo kakhulu (njengoba sekushiwo, ngokuvamile sisendizeni ye-zero). Okuthakazelisa kakhulu ukuhlukaniswa nge amabhlogo. Lobu bubanzi abusenabo ubude obunqunyiwe, futhi bunenjongo ngokwengeziwe - njengomthetho, ngayinye ihlanganisa izinhlamvu zezinhlamvu ezifanayo.
Ibhulokhi equkethe izinhlamvu zezinhlamvu zesiBengali. Ngeshwa, ngenxa yezizathu zomlando, lesi yisibonelo sokupakishwa okungaminyene kakhulu - izinhlamvu ezingama-96 zihlakazeke ngesiphithiphithi kuwo wonke amaphuzu angama-block block angama-128.
Ukuqala kwamabhulokhi kanye nosayizi bawo kuhlale kuphindaphindeka ka-16 - lokhu kwenzelwa ukuthi kube lula. Ngaphezu kwalokho, amabhulokhi amaningi aqala futhi aphele kumanani angu-128 noma ngisho 256 - isibonelo, izinhlamvu eziyisisekelo zesiCyrillic zithatha amabhayithi angu-256 ukusuka 0x0400
ukuze 0x04FF
. Lokhu kulula kakhulu: uma sigcina isiqalo kanye 0x04
, khona-ke noma yiluphi uhlamvu lwesiCyrillic lungabhalwa ngebhayithi eyodwa. Yiqiniso, ngale ndlela sizolahlekelwa ithuba lokubuyela ku-ASCII (nanoma yiziphi ezinye izinhlamvu ngokujwayelekile). Ngakho-ke senza lokhu:
- Amabhayithi amabili
10yyyyyy yxxxxxxx
akusho nje kuphela uphawu olunenomboloyyyyyy yxxxxxxx
, kodwa futhi ushintshe izinhlamvu zamagama zamanje onyyyyyy y0000000
(okungukuthi, sikhumbula zonke izingcezu ngaphandle kwalezo ezingabalulekile kangako 7 okuncane); - Ibhayithi elilodwa
0xxxxxxx
lolu wuhlamvu lwezinhlamvu zamanje. Idinga nje ukwengezwa ku-offset esiyikhumbule esinyathelweni 1. Nakuba singashintshanga izinhlamvu zamagama, i-offset inguziro, ngakho-ke silondoloze ukuhambisana ne-ASCII.
Ngokufanayo namakhodi adinga amabhayithi angu-3:
- Amabhayithi amathathu
110yyyyy yxxxxxxx xxxxxxxx
khombisa uphawu olunenomboloyyyyyy yxxxxxxx xxxxxxxx
, ushintsho izinhlamvu zamagama zamanje onyyyyyy y0000000 00000000
(wakhumbula konke ngaphandle kwabancane 15 okuncane), bese uqoka ibhokisi esikulo manje eside imodi (uma sishintsha izinhlamvu zamagama zibuyele kwe-double-byte, sizosetha kabusha leli fulegi); - Amabhayithi amabili
0xxxxxxx xxxxxxxx
kwimodi ende wuhlamvu lwezinhlamvu zamanje. Ngokufanayo, siyengeza nge-offset kusuka kusinyathelo 1. Umehluko kuphela ukuthi manje sifunda ama-byte amabili (ngoba sishintshele kule modi).
Kuzwakala kukuhle: manje ngenkathi sidinga ukubhala izinhlamvu kububanzi obufanayo be-7-bit Unicode, sisebenzisa ibhayithi e-1 eyengeziwe ekuqaleni kanye nengqikithi yebhayithi eyodwa ngohlamvu ngalunye.
Isebenza kusukela kwenye yezinguqulo zangaphambili. Isivele ivame ukwehlula i-UTF-8, kodwa sisekhona isikhala sokuthuthuka.
Yini embi? Okokuqala, sinombandela, okungukuthi i-alfabhethi yamanje offset kanye nebhokisi lokuhlola imodi ende. Lokhu kuphinde kusibeke umkhawulo: manje izinhlamvu ezifanayo zingafakwa ngekhodi ngendlela ehlukile ezimeni ezahlukene. Ukusesha ochungechunge abancane, isibonelo, kuzodingeka kwenziwe ngokucabangela lokhu, hhayi nje ngokuqhathanisa amabhayithi. Okwesibili, lapho nje sishintsha izinhlamvu zamagama, kwaba kubi ngokufakwa kwekhodi kwezinhlamvu ze-ASCII (futhi lokhu akuyona nje izinhlamvu zesiLatini kuphela, kodwa futhi nezimpawu zokubhala eziyisisekelo, kuhlanganise nezikhala) - zidinga ukushintsha izinhlamvu futhi zibe ngu-0, okungukuthi, futhi i-byte eyengeziwe (bese kuba nenye ukuze sibuyele ephuzwini lethu eliyinhloko).
I-alfabhethi eyodwa ilungile, ezimbili zingcono
Ake sizame ukuguqula iziqalo zethu kancane, sicindezele kwesinye kube ezintathu ezichazwe ngenhla:
0xxxxxxx
β 1 ibhayithi kumodi evamile, 2 kwimodi ende
11xxxxxx
- 1 ibhayithi
100xxxxx xxxxxxxx
- 2 amabhayithi
101xxxxx xxxxxxxx xxxxxxxx
- 3 amabhayithi
Manje kurekhodi lamabhayithi amabili kunebhithi eyodwa etholakala kancane - ikhodi ikhomba phezulu 0x1FFF
kodwa cha 0x3FFF
. Kodwa-ke, isenkulu ngokuphawulekayo kunamakhodi e-UTF-8 e-double-byte, izilimi ezivame kakhulu zisangena, ukulahlekelwa okuphawuleka kakhulu kuwile.
Ithini le khodi entsha? 11xxxxxx
? Lena βi-stashβ encane enezinhlamvu ezingama-64 ngosayizi, ihambisana nezinhlamvu zethu eziyinhloko, ngakho-ke ngiyibize ngokuthi isiza (umsizi) ama-alfabhethi. Uma sishintsha izinhlamvu zamanje, ucezu lwamagama amadala luba usizo. Isibonelo, sisuke ku-ASCII saya ku-Cyrillic - okufihliwe manje sekuqukethe izinhlamvu ezingama-64 Izinhlamvu zesiLatini, izinombolo, isikhala kanye nekhefana (ukufakwa okuvamile emibhalweni engeyona eye-ASCII). Shintshela emuva ku-ASCII - futhi ingxenye eyinhloko yezinhlamvu zamagama zesiCyrillic izoba i-alfabhethi eyisiza.
Ngenxa yokufinyelela kuzinhlamvu ezimbili zezinhlamvu, singakwazi ukuphatha inombolo enkulu yemibhalo enezindleko ezincane zokushintsha izinhlamvu (izimpawu zokubhala ngokuvamile zizoholela ekubuyeleni ku-ASCII, kodwa ngemva kwalokho sizothola izinhlamvu eziningi ezingezona eze-ASCII kusukela ku-alfabhethi eyengeziwe, ukushintsha futhi).
Ibhonasi: prefixing sub-alfabhethi 11xxxxxx
nokukhetha i-offset yayo yokuqala ukuthi ibe 0xC0
, sithola ukuhambisana okuyingxenye ne-CP1252. Ngamanye amazwi, imibhalo eminingi (kodwa hhayi yonke) yaseNtshonalanga Yurophu efakwe ikhodi ku-CP1252 izobukeka ngendlela efanayo ku-UTF-C.
Nokho, lapha kuphakama ubunzima: indlela yokuthola isisi esivela ku-alfabhethi eyinhloko? Ungashiya i-offset efanayo, kodwa - maye - lapha isakhiwo se-Unicode sesivele sidlala ngokumelene nathi. Ngokuvamile ingxenye eyinhloko yezinhlamvu ayikho ekuqaleni kwebhulokhi (isibonelo, inhloko-dolobha yaseRussia "A" inekhodi. 0x0410
, nakuba i-Cyrillic block iqala ngokuthi 0x0400
). Ngakho, ngemva kokuthatha izinhlamvu zokuqala ezingu-64 sazifaka ku-stash, singase silahlekelwe ukufinyelela engxenyeni yomsila wezinhlamvu.
Ukuze kulungiswe le nkinga, ngidlule mathupha kumabhulokhi athile ahambisana nezilimi ezihlukene, futhi ngacacisa i-offset yezinhlamvu ezisizayo ngaphakathi kweyinhloko yazo. Izinhlamvu zesiLatini, njengokuhlukile, zazihlelwa kabusha njenge-base64.
Ukuthintwa kokugcina
Ekugcineni ake sicabange ngendawo lapho esingathuthukisa khona okuthile.
Qaphela ukuthi ifomethi 101xxxxx xxxxxxxx xxxxxxxx
ikuvumela ukuthi ubhale izinombolo kuze kufike 0x1FFFFF
, futhi i-Unicode iphela ngaphambili, ngo 0x10FFFF
. Ngamanye amazwi, iphuzu lokugcina lekhodi lizomelwa njenge 10110000 11111111 11111111
. Ngakho-ke, singasho ukuthi uma i-byte yokuqala ingeyefomu 1011xxxx
(lapho xxxx
okukhulu kuno-0), bese kusho okunye. Isibonelo, ungangeza ezinye izinhlamvu eziyi-15 lapho ezitholakala njalo ukuze zifakwe ikhodi ngebhayithi eyodwa, kodwa nginqume ukukwenza ngendlela ehlukile.
Ake sibheke lawo mabhulokhi e-Unicode adinga amabhayithi amathathu manje. Ngokuyinhloko, njengoba sekushiwo kakade, lezi yizinhlamvu zesiShayina - kodwa kunzima ukwenza lutho ngazo, kunezinkulungwane ezingama-21 zazo. Kodwa i-hiragana ne-katakana nazo zandizela lapho - futhi azisekho eziningi zazo, zingaphansi kwamakhulu amabili. Futhi, njengoba sikhumbule isiJapane, kukhona nama-emojis (empeleni, ahlakazeke ezindaweni eziningi ku-Unicode, kodwa amabhlogo ayinhloko akuluhlu. 0x1F300
- 0x1FBFF
). Uma ucabanga ngeqiniso lokuthi manje sekunama-emojis aqoqwe kusuka kumakhodi ambalwa ngasikhathi sinye (isibonelo, i-emoji β
Ngakho-ke, sikhetha ububanzi obumbalwa obukhethiwe obuhambisana ne-emoji, i-hiragana ne-katakana, siphinde sibhale ngohlu olulodwa oluqhubekayo bese sibhala amabhayithi amabili esikhundleni samathathu:
1011xxxx xxxxxxxx
Okuhle: i-emoji eshiwo ngenhla
Ake sizame ukulungisa enye inkinga futhi. Njengoba sikhumbula, ama-alfabhethi ayisisekelo empeleni amabhithi angu-6 aphezulu, esikukhumbulayo futhi sinamathisele kukhodi yophawu ngalunye oluqanjiwe olulandelayo. Endabeni yezinhlamvu zesiShayina ezisebhulokhini 0x4E00
- 0x9FFF
, lokhu kungaba u-0 noma u-1. Lokhu akulula kakhulu: sizodinga ukushintsha njalo izinhlamvu zamagama phakathi kwalawa manani amabili (okungukuthi sebenzisa amabhayithi amathathu). Kodwa qaphela ukuthi kwimodi ende, kusukela kukhodi ngokwayo singasusa inombolo yezinhlamvu esizibhala ngemodi emfushane (ngemuva kwawo wonke amaqhinga achazwe ngenhla, lokhu kungu-10240) - khona-ke uhla lwama-hieroglyphs luzodlulela ku- 0x2600
- 0x77FF
, futhi kulesi simo, kulo lonke lolu hlu, ama-bits angu-6 abaluleke kakhulu (kwangu-21) azolingana no-0. Ngakho, ukulandelana kwama-hieroglyphs kuzosebenzisa amabhayithi amabili nge-hieroglyph ngayinye (okuyinto elungele uhla olukhulu kangaka), ngaphandle kubangela ukushintsha kwezinhlamvu.
Ezinye izixazululo: SCSU, BOCU-1
Ochwepheshe be-Unicode, njengoba besanda kufunda isihloko sendatshana, cishe bazoshesha ukukukhumbuza ukuthi ngqo phakathi kwamazinga e-Unicode kukhona.
Ngiyavuma ngobuqotho: Ngafunda ngokuba khona kwayo ngemva kokuba ngicwile ngokujulile ekubhaleni isinqumo sami. Ukube ngangazi ngakho kwasekuqaleni, mhlawumbe ngabe ngizamile ukubhala ukuqaliswa esikhundleni sokuza nendlela yami.
Kuyathakazelisa ukuthi i-SCSU isebenzisa imibono efana kakhulu naleyo engiqhamuke nayo ngedwa (esikhundleni somqondo "wamagama" basebenzisa "amafasitela", futhi miningi etholakalayo kunami). Ngasikhathi sinye, le fomethi nayo inemibi: isondele kancane kuma-algorithms wokucindezela kunawombhalo wekhodi. Ikakhulukazi, indinganiso inikeza izindlela eziningi zokumelela, kodwa ayisho ukuthi ungayikhetha kanjani efanelekile - kulokhu, isifaki khodi kufanele sisebenzise uhlobo oluthile lwe-heuristics. Ngakho-ke, isifaki khodi se-SCSU esikhiqiza ukupakishwa okuhle sizoba yinkimbinkimbi futhi sibe nzima kune-algorithm yami.
Ukuze uqhathanise, ngidlulisele ukuqaliswa okulula kwe-SCSU ku-JavaScript - ngokwevolumu yekhodi kuvele ukuthi kuqhathaniswe ne-UTF-C yami, kodwa kwezinye izimo umphumela waba kubi kakhulu amashumi amaphesenti (ngezinye izikhathi kungase kudlule, kodwa hhayi kakhulu). Ngokwesibonelo, imibhalo yesiHebheru nesiGreki yabhalwa nge-UTF-C 60% kangcono kune-SCSU (mhlawumbe ngenxa yezinhlamvu zabo ezihlangene).
Ngokwehlukana, ngizongeza ukuthi ngaphandle kwe-SCSU kukhona futhi enye indlela yokumela i-Unicode ngokuhlangene -
Ukuthuthukiswa okungenzeka
I-algorithm engiyethulile ayifani ngokuklanywa kwendawo yonke (lokhu mhlawumbe yilapho imigomo yami yehluka khona kakhulu emigomeni ye-Unicode Consortium). Sengike ngabalula ukuthi yakhelwe ikakhulukazi umsebenzi owodwa (ukugcina isichazamazwi sezilimi eziningi esihlahleni sesiqalo), futhi ezinye izici zayo zingase zingayifanelekeli kahle eminye imisebenzi. Kodwa iqiniso lokuthi akuyona indinganiso kungaba plus - ungakwazi ukuyishintsha kalula ukuze ihambisane nezidingo zakho.
Isibonelo, ngendlela esobala ungasusa ubukhona bombuso, wenze amakhodi angenasisekelo - ungabuyekezi okuguquguqukayo offs
, auxOffs
ΠΈ is21Bit
kusishumeki kanye nesiqophi sekhodi. Kulesi simo, ngeke kwenzeke ukupakisha ngokuphumelelayo ukulandelana kwezinhlamvu zezinhlamvu ezifanayo, kodwa kuzoba nesiqinisekiso sokuthi uhlamvu olufanayo luhlala lubhalwe ngekhodi ngamabhayithi afanayo, kungakhathaliseki umongo.
Ukwengeza, ungakwazi ukuhlela isishumeki sibe nolimi oluthile ngokushintsha isimo esimisiwe - isibonelo, ugxile emibhalweni yesiRashiya, usethe isishumeki kanye nesikhiphi khodi ekuqaleni. offs = 0x0400
ΠΈ auxOffs = 0
. Lokhu kunengqondo ikakhulukazi esimweni semodi engenasimo. Ngokuvamile, lokhu kuzofana nokusebenzisa umbhalo omdala wamabhithi ayisishiyagalombili, kodwa ngaphandle kokukhipha ikhono lokufaka izinhlamvu kuwo wonke ama-Unicode njengoba kudingeka.
Okunye okuhlehlayo okukhulunywe ngakho ekuqaleni ukuthi embhalweni omkhulu ofakwe ikhodi ku-UTF-C ayikho indlela esheshayo yokuthola umngcele wezinhlamvu oseduze kakhulu nebhayithi engaqondakali. Uma usika okokugcina, yithi, amabhayithi ayi-100 kusuka kusigcinalwazi esifakwe ikhodi, uzibeka engcupheni yokuthola udoti ongakwazi ukwenza lutho ngawo. Umbhalo wekhodi awuklanyelwe ukugcina amalogi e-multi-gigabyte, kodwa ngokuvamile lokhu kungalungiswa. Byte 0xBF
akumele neze ivele njengebhayithi yokuqala (kodwa ingaba eyesibili noma yesithathu). Ngakho-ke, lapho ufaka ikhodi, ungafaka ukulandelana 0xBF 0xBF 0xBF
zonke, zithi, 10 KB - khona-ke, uma udinga ukuthola umngcele, kuzokwanela ukuskena ucezu olukhethiwe kuze kutholakale umaka ofanayo. Ukulandela okokugcina 0xBF
uqinisekisiwe ukuthi uyisiqalo somlingisi. (Lapho kukhishwa amakhodi, lokhu kulandelana kwamabhayithi amathathu, vele, kuzodinga ukushaywa indiva.)
Ukufingqa
Uma ufunde kuze kube manje, siyakuhalalisela! Ngithemba ukuthi nawe, njengami, ufunde okuthile okusha (noma uvuselele inkumbulo yakho) mayelana nesakhiwo se-Unicode.
Ikhasi ledemo. Isibonelo sesiHebheru sibonisa izinzuzo ngaphezu kwakho kokubili kwe-UTF-8 ne-SCSU.
Lolu cwaningo oluchazwe ngenhla akufanele luthathwe njengokugxambukela kumazinga. Nokho, ngokuvamile ngenelisekile ngemiphumela yomsebenzi wami, ngakho ngiyajabula ngayo
Ekugcineni, ngizophinda ngidonse ukunaka ezimeni lapho kusetshenziswa khona i-UTF-C akufanelekile:
- Uma imigqa yakho imide ngokwanele (kusukela ezinhlamvu ezingu-100-200). Kulokhu, kufanele ucabange ngokusebenzisa ama-algorithms wokucindezela njenge-deflate.
- Uma udinga ASCII obala, okusho ukuthi, kubalulekile kuwe ukuthi ukulandelana okufakwe ikhodi kungaqukathi amakhodi e-ASCII abengekho kuyunithi yezinhlamvu yomthombo. Isidingo salokhu singagwenywa uma, lapho usebenzisana nama-API ezinkampani zangaphandle (isibonelo, ukusebenza nesizindalwazi), udlulisa umphumela wombhalo wekhodi njengesethi engabonakali yamabhayithi, hhayi njengezintambo. Uma kungenjalo, usengozini yokuthola ubungozi obungalindelekile.
- Uma ufuna ukukwazi ukuthola ngokushesha imingcele yezinhlamvu ngendlela engafanele (isibonelo, lapho ingxenye yomugqa ilimele). Lokhu kungenziwa, kodwa kuphela ngokuskena umugqa kusukela ekuqaleni (noma ukusebenzisa ukuguqulwa okuchazwe esigabeni esandulele).
- Uma udinga ukwenza imisebenzi ngokushesha kokuqukethwe kwezintambo (zihlele, sesha ochungechunge abancane kuzo, concatenate). Lokhu kudinga ukuthi amayunithi ezinhlamvu aqoshwe kuqala, ngakho i-UTF-C izohamba kancane kune-UTF-8 kulezi zimo (kodwa isheshe kunama-algorithms okucindezela). Njengoba iyunithi yezinhlamvu efanayo ihlale ifakwe ikhodi ngendlela efanayo, ukuqhathanisa okuqondile kokukhipha amakhodi akudingeki futhi kungenziwa ngesisekelo se-byte-by-byte.
buyekeza: umsebenzisi
Source: www.habr.com