Enye ibhayisekile: sigcina imitya ye-Unicode 30-60% icwecwe ngakumbi kune-UTF-8

Enye ibhayisekile: sigcina imitya ye-Unicode 30-60% icwecwe ngakumbi kune-UTF-8

Ukuba ungumphuhlisi kwaye ujongene nomsebenzi wokukhetha i-encoding, ke i-Unicode iya kuhlala isisisombululo esifanelekileyo. Indlela yokumelwa ethile ixhomekeke kumxholo, kodwa amaxesha amaninzi kukho impendulo ekhoyo apha - UTF-8. Into entle ngayo kukuba ikuvumela ukuba usebenzise zonke iimpawu ze-Unicode ngaphandle kokuchitha kakhulu iibytes ezininzi kwiimeko ezininzi. Enyanisweni, kwiilwimi ezisebenzisa ngaphezu kwealfabhethi yesiLatini, "hayi kakhulu" ubuncinane iibyte ezimbini ngomlinganiswa. Ngaba singenza ngcono ngaphandle kokubuyela kwiikhowudi zangaphambi kwembali ezisithintele ukuba sibe noonobumba abakhoyo abangama-256 kuphela?

Apha ngezantsi ndicebisa ukuba uziqhelanise nomzamo wam wokuphendula lo mbuzo kwaye usebenzise i-algorithm elula ekuvumela ukuba ugcine imigca kwiilwimi ezininzi zehlabathi ngaphandle kokongeza ukuphinda-phinda okukwi-UTF-8.

Ukuzihlangula. Ndiza kwenza ngokukhawuleza ugcino olubalulekileyo olumbalwa: Isisombululo esichaziweyo asibonelelwanga njengendawo yendawo yonke ye-UTF-8, ifanelekile kuphela kuluhlu olumxinwa lwamatyala (ngaphezulu kubo ngezantsi), kwaye akukho meko kufuneka isetyenziswe ukusebenzisana ne-third party party APIs (abangayaziyo ngayo). Amaxesha amaninzi, i-algorithms yoxinzelelo lwenjongo ngokubanzi (umzekelo, deflate) ilungele ukugcinwa okudibeneyo komthamo omkhulu wedatha yokubhaliweyo. Ukongeza, sele ndikwinkqubo yokudala isisombululo sam, ndifumene umgangatho okhoyo kwi-Unicode ngokwayo, oyisombulula ingxaki efanayo - inzima kakhulu (kwaye ihlala imbi kakhulu), kodwa isengumgangatho owamkelweyo, kwaye hayi nje ukubeka. kunye edolweni. Nam ndiya kukuxelela ngaye.

Malunga ne-Unicode kunye ne-UTF-8

Ukuqala, amagama ambalwa malunga nokuba yintoni na Unicode ΠΈ UTF-8.

Njengoko usazi, i-8-bit encodings yayikade idumile. Ngabo, yonke into yayilula: abalinganiswa be-256 banokubalwa ngamanani ukusuka ku-0 ukuya ku-255, kwaye amanani ukusuka ku-0 ukuya ku-255 angabonakaliswa ngokucacileyo njenge-byte enye. Ukuba sibuyela kwasekuqaleni, i-encoding ye-ASCII ilinganiselwe ngokupheleleyo kwi-bits ezisi-7, ngoko ke eyona nto ibalulekileyo kwi-byte yokumelwa ngu-zero, kwaye uninzi lwe-encodings ye-8-bit iyahambelana nayo (yahluka kuphela "phezulu" inxalenye, apho eyona nto ibalulekileyo inye ).

I-Unicode yahluke njani kwezo khowudi kwaye kutheni zininzi iinkcazo ezithe ngqo ezinxulumene nayo - UTF-8, UTF-16 (BE kunye ne-LE), UTF-32? Masiyilungise ngolungelelwano.

Umgangatho osisiseko we-Unicode uchaza kuphela imbalelwano phakathi koonobumba (kwaye kwezinye iimeko, amacandelo omntu ngamnye) kunye namanani abo. Kwaye kukho amanani amaninzi anokwenzeka kulo mgangatho - ukusuka 0x00 Π΄ΠΎ 0x10FFFF (iziqwenga eziyi-1). Ukuba sifuna ukubeka inani kuluhlu olulolo hlobo kuguquguquko, akukho bhayithi i-114 okanye i-112 inokusanela. Kwaye njengoko iiprosesa zethu aziyilwanga kakhulu ukusebenza ngamanani ebhayithi ezintathu, siya kunyanzelwa ukuba sisebenzise ezininzi njenge-1 bytes ngomlinganiswa ngamnye! Le yi-UTF-2, kodwa kungenxa yale "nkcitho" ukuba le fomati ayidumile.

Ngethamsanqa, ulandelelwano lwabalinganiswa ngaphakathi kwe-Unicode alukhethi. Iseti yabo yonke yahlulwe yaba li-17 ".iinqwelomoya", nganye kuzo iqulethe 65536 (0x10000) "amanqaku ekhowudi" Ingqikelelo ye "code point" apha ilula inombolo yomlinganiswa, yabelwe kuyo yi-Unicode. Kodwa, njengoko kukhankanyiwe ngasentla, kwi-Unicode ayingobalinganiswa ababodwa kuphela, kodwa kunye namacandelo kunye namanqaku enkonzo (kwaye ngamanye amaxesha akukho nto ihambelana nenani - mhlawumbi okwangoku, kodwa kuthi oku akubalulekanga kangako), ke ichanekile ngakumbi soloko uthetha ngokuthe ngqo ngenani lamanani ngokwawo, hayi iisimboli. Nangona kunjalo, kwezi zilandelayo, ngenxa yobufutshane, ndiya kuhlala ndisebenzisa igama elithi "isimboli", elithetha igama elithi "ikhowudi yekhowudi".

Enye ibhayisekile: sigcina imitya ye-Unicode 30-60% icwecwe ngakumbi kune-UTF-8
Iindiza ze-Unicode. Njengoko ubona, uninzi lwayo (iinqwelomoya ezi-4 ukuya kwi-13) zisasetyenziswa.

Eyona nto iphawuleka kakhulu kukuba yonke i "pulp" ephambili ilele kwinqwelomoya enguziro, ibizwa ngokuba "ISiseko seNdlela yeeLwimi eziNinzi". Ukuba umgca uqulathe okubhaliweyo kolunye lweelwimi zangoku (kubandakanya isiTshayina), awusayi ngaphaya kwale ndiza. Kodwa awunakunqumla yonke i-Unicode - umzekelo, i-emoji ibekwe ikakhulu ekupheleni inqwelomoya elandelayo"ISindululo seeLwimi eziNinzi esongezelelweyo"(isuka ukusuka 0x10000 Π΄ΠΎ 0x1FFFF). Ke i-UTF-16 yenza oku: bonke abalinganiswa abawela ngaphakathi ISiseko seNdlela yeeLwimi eziNinzi, zifakwe iikhowudi β€œnjengoko zinjalo” ngenani elihambelana nebhayithi ezimbini. Nangona kunjalo, amanye amanani kolu luhlu alubonisi abalinganiswa abathile kwaphela, kodwa abonisa ukuba emva kwesi sibini seebhayithi kufuneka siqwalasele enye - ngokudibanisa amaxabiso ezi bytes ezine kunye, sifumana inani eligubungelayo. lonke uluhlu olusebenzayo lwe-Unicode. Lo mbono ubizwa ngokuba β€œzizibini ezizalanayo”—usenokuba ukhe weva ngazo.

Ke i-UTF-16 ifuna ezimbini okanye (kwiimeko ezinqabileyo kakhulu) ezine bytes nge "code point". Oku kungcono kunokusebenzisa iibytes ezine ngalo lonke ixesha, kodwa isiLatini (kunye nabanye oonobumba be-ASCII) xa i-encoded ngale ndlela imosha isiqingatha sesithuba kooziro. I-UTF-8 yenzelwe ukulungisa oku: I-ASCII kuyo ihlala, njengangaphambili, i-byte enye kuphela; iikhowudi ukusuka 0x80 Π΄ΠΎ 0x7FF - iibhayithi ezimbini; ukusuka 0x800 Π΄ΠΎ 0xFFFF - ezintathu, kwaye ukusuka 0x10000 Π΄ΠΎ 0x10FFFF - ezine. Ngakolunye uhlangothi, i-alfabhethi yesiLatini ibe yinto enhle: ukuhambelana ne-ASCII kubuyile, kwaye ukusabalalisa "kusasazeka" ukusuka kwi-1 ukuya kwi-4 bytes. Kodwa iialfabhethi ngaphandle kwesiLatini, yeha, azizuzi nto nangayiphi na indlela xa kuthelekiswa ne-UTF-16, kwaye uninzi ngoku lufuna iibhayithi ezintathu endaweni yezimbini-uluhlu olugutyungelwe yirekhodi yeebhayithi ezimbini luncitshiswe ngamaxesha angama-32, 0xFFFF Π΄ΠΎ 0x7FF, kwaye akukho siTshayina okanye, ngokomzekelo, isiGeorgia sibandakanyiwe kuyo. IsiCyrillic kunye nezinye ialphabets ezintlanu - hurray - lucky, 2 bytes umlinganiswa ngamnye.

Kutheni le nto isenzeka? Makhe sibone ukuba i-UTF-8 imele njani iikhowudi zabalinganiswa:
Enye ibhayisekile: sigcina imitya ye-Unicode 30-60% icwecwe ngakumbi kune-UTF-8
Ukumela amanani ngokuthe ngqo, amasuntswana aphawulwe ngesimboli asetyenziswa apha x. Ingabonwa ukuba kwirekhodi ezimbini-byte kukho amasuntswana ali-11 kuphela (ngaphandle kwe-16). Amasuntswana aphambili apha anomsebenzi oncedisayo kuphela. Kwimeko yerekhodi ezine-byte, i-21 kwi-32 bits yabelwe inombolo yekhowudi - kubonakala ngathi ii-byte ezintathu (ezinika i-bits ye-24 iyonke) ziya kwanela, kodwa abamakishi benkonzo batya kakhulu.

Ngaba oku kubi? Hayi ncma. Kwelinye icala, ukuba sikhathalela kakhulu indawo, sinee-algorithms zokucinezela ezinokuphelisa ngokulula yonke i-entropy eyongezelelweyo kunye nokuphindaphinda. Kwelinye icala, injongo ye-Unicode yayikukubonelela ngeyona khowudi yehlabathi jikelele. Umzekelo, sinokubeka umgca ofakwe kwikhowudi kwi-UTF-8 kwikhowudi eyayisebenza kuphela nge-ASCII, kwaye ungoyiki ukuba iya kubona umlinganiswa osuka kuluhlu lwe-ASCII olungekhoyo (emva kwayo yonke into, kwi-UTF-8 yonke. ii-bytes eziqala nge-zero bit - yile nto kanye iyiyo i-ASCII). Kwaye ukuba ngequbuliso sifuna ukunqumla umsila omncinci kumtya omkhulu ngaphandle kokuyichaza kwasekuqaleni (okanye ukubuyisela inxalenye yolwazi emva kwecandelo elonakeleyo), kulula ukuba sifumane i-offset apho umlinganiswa aqala khona (kwanele). ukutsiba ii-bytes ezinesiqalo esincinci 10).

Kutheni ke ngoko uyile into entsha?

Kwangaxeshanye, kukho amaxesha ngamaxesha xa ii-algorithms zoxinzelelo ezifana ne-deflate zingasebenzi kakuhle, kodwa ufuna ukufezekisa ugcino olubambeneyo lweentambo. Ngokomntu, ndidibene nale ngxaki xa ndicinga ngokwakha umthi wesimaphambili esicinezelweyo kwisichazi-magama esikhulu esiquka amagama kwiilwimi ezingafanelekanga. Kwelinye icala, igama ngalinye lifutshane kakhulu, ngoko ke ukulicinezela akuyi kusebenza. Kwelinye icala, ukuphunyezwa komthi endikuthathele ingqalelo kuye kwayilwa ukuze i-byte nganye yomtya ogciniweyo ivelise i-vertex yomthi eyahlukileyo, ngoko ke ukunciphisa inani labo kwakuluncedo kakhulu. Kwithala lam leencwadi Az.js (Njengoba i-pymorphy2, ekusekwe phezu kwayo) ingxaki efanayo ingasonjululwa ngokulula - imitya epakishwe kuyo I-DAWG-isichazi-magama, sigcinwe apho indala CP1251. Kodwa, njengoko kulula ukuyiqonda, oku kusebenza kakuhle kuphela kwialfabhethi elinganiselweyo - umgca wesiTshayina awunakongezwa kwisichazi-magama esinjalo.

Ngokwahlukileyo, ndingathanda ukuqaphela enye i-nuance engathandekiyo evela xa usebenzisa i-UTF-8 kulwakhiwo lwedatha. Lo mfanekiso ungasentla ubonisa ukuba xa umlinganiswa ebhalwe njenge-byte ezimbini, amasuntswana anxulumene nenani lawo awafiki emgceni, kodwa ahlulwe ngamasuntswana amabini. 10 esiphakathini: 110xxxxx 10xxxxxx. Ngenxa yoku, xa amasuntswana ama-6 ebhayithi yesibini ephuphuma kwikhowudi yomlinganiswa (oko kukuthi, utshintsho lwenzeka 10111111 β†’ 10000000), emva koko i-byte yokuqala iyatshintsha nayo. Kuvela ukuba unobumba "p" uboniswa ngamabhayithi 0xD0 0xBF, kwaye olandelayo β€œr” sele esele 0xD1 0x80. Kumthi wesimaphambili, oku kukhokelela kukwahlulwa kwenodi yomzali kubini - enye ibe sisimaphambili 0xD0, kunye nenye ye 0xD1 (nangona yonke ialfabhethi yesiCyrillic inokufakwa ngekhowudi kuphela ngebyte yesibini).

Ndifumene ntoni

Ndijongene nale ngxaki, ndaye ndagqiba kwelokuba ndiziqhelanise nokudlala imidlalo ngamasuntswana, kwaye kwangaxeshanye ndiqhelane ngcono nesakhiwo se-Unicode ngokupheleleyo. Isiphumo ibe yifomati yekhowudi ye-UTF-C ("C" ye indawo yokudlala), engachithi ngaphezulu kwe-3 bytes ngekhowudi nganye, kwaye rhoqo ikuvumela ukuba uchithe kuphela ibhayithi enye eyongezelelweyo kuwo wonke umgca wekhowudi. Oku kukhokelela kwinto yokuba kwiialfabhethi ezininzi ezingezizo ze-ASCII olu hlobo lwekhowudi lujika lube I-30-60% ihlangene ngaphezu kwe-UTF-8.

Ndibonise imizekelo yokuphunyezwa kwe-encoding kunye ne-decoding algorithms kwifom JavaScript kunye neGosa amathala eencwadi, unokuzisebenzisa ngokukhululekileyo kwikhowudi yakho. Kodwa ndiya kugxininisa ukuba ngandlela ithile le fomati ihlala "ibhayisekile", kwaye andiyikucebisa ukuyisebenzisa ngaphandle kokuqonda ukuba kutheni uyidinga. Oku kuseyimfuniselo ngaphezulu "kokuphuculwa kwe-UTF-8". Nangona kunjalo, ikhowudi elapho ibhalwe ngokucocekileyo, ngokufutshane, kunye nenani elikhulu lamagqabaza kunye nokugubungela uvavanyo.

Enye ibhayisekile: sigcina imitya ye-Unicode 30-60% icwecwe ngakumbi kune-UTF-8
Iziphumo zovavanyo kunye nothelekiso kunye ne-UTF-8

Nam ndenze njalo iphepha ledemo, apho unokuvavanya ukusebenza kwe-algorithm, kwaye ke ndiya kukuxelela ngakumbi malunga nemigaqo yayo kunye nenkqubo yophuhliso.

Ukuphelisa amasuntswana angasebenziyo

Ndithathe i-UTF-8 njengesiseko, kunjalo. Into yokuqala kunye neyona nto icacileyo enokutshintshwa kuyo kukunciphisa inani leebhithi zenkonzo kwi-byte nganye. Umzekelo, i-byte yokuqala kwi-UTF-8 ihlala iqala nokuba yiyiphi 0, okanye nge 11 - isimaphambili 10 Kuphela ezi bytes zilandelayo banayo. Masitshintshe isimaphambili 11 phezu 1, kwaye kwii-bytes ezilandelayo siya kususa izimaphambili ngokupheleleyo. Kuya kwenzeka ntoni?

0xxxxxxx β€” 1 ibhayithi
10xxxxxx xxxxxxxx - 2 bytes
110xxxxx xxxxxxxx xxxxxxxx - 3 bytes

Yima, iphi irekhodi yebhayithi ezine? Kodwa ayisafuneki- xa ubhala ngee-bytes ezintathu, ngoku sinama-bits angama-21 kwaye oku kwanele kuwo onke amanani ukuya kuthi ga ngoku. 0x10FFFF.

Sincame ntoni apha? Eyona nto ibalulekileyo kukuchongwa kwemida yeempawu ukusuka kwindawo enganyanzelekanga kwi-buffer. Asinakukhomba kwi-byte engenasizathu kwaye sifumane isiqalo somlinganiswa olandelayo kuyo. Lo ngumda wefomathi yethu, kodwa ekusebenzeni oku akufane kwenzeke. Siqhele ukukwazi ukubaleka kwi-buffer ukusuka ekuqaleni (ingakumbi xa isiza kwimigca emifutshane).

Imeko yokugubungela iilwimi nge-2 bytes iye yaba ngcono: ngoku ifomathi yebhayithi ezimbini inika uluhlu lweebhithi ezili-14, kwaye ezi ziikhowudi ukuya kuthi ga ngoku. 0x3FFF. AmaTshayina anelishwa (abalinganiswa babo ubukhulu becala basuka kwi 0x4E00 Π΄ΠΎ 0x9FFF), kodwa abantu baseGeorgia kunye nabanye abantu abaninzi bonwabile ngakumbi - iilwimi zabo zikwangena kwii-bytes ezi-2 ngomlinganiswa ngamnye.

Ngenisa imeko yekhowudi

Ngoku makhe sicinge ngeempawu zemigca ngokwazo. Isichazi-magama sidla ngokuba namagama abhalwe ngoonobumba bealfabhethi enye, kwaye oku kuyinyaniso nakwezinye izicatshulwa ezininzi. Kuya kuba kuhle ukubonisa le alfabhethi kube kanye, kwaye ubonise kuphela inani lonobumba ongaphakathi kuyo. Makhe sibone ukuba ukucwangciswa kwabalinganiswa kwitafile ye-Unicode kuya kusinceda na.

Njengoko kukhankanyiwe ngasentla, i-Unicode yahlulwe yaba inqwelomoya 65536 iikhowudi nganye. Kodwa oku akulona ulwahlulo oluluncedo kakhulu (njengoko sele kutshiwo, amaxesha amaninzi sikwinqwelomoya engu-zero). Okunomdla ngakumbi kukwahlulahlulwa nge iibhloko. Olu luhlu alusenalo ubude obumiselweyo, kwaye lunentsingiselo ngakumbi - njengomthetho, nganye idibanisa iimpawu ezivela kwi-alfabhethi efanayo.

Enye ibhayisekile: sigcina imitya ye-Unicode 30-60% icwecwe ngakumbi kune-UTF-8
Ibhloko equlathe oonobumba bealfabhethi yaseBengali. Ngelishwa, ngenxa yezizathu zembali, lo ngumzekelo wokupakishwa okungaxinananga kakhulu - oonobumba abangama-96 basasazeke ngokuphazamiseka kwiindawo ezili-128 zekhowudi yebhloko.

Ukuqala kweebhloko kunye nobukhulu bazo buhlala buphindaphindwa ka-16 - oku kwenziwa ngokulula. Ukongeza, iibhloko ezininzi ziqala kwaye ziphele kumaxabiso aziphindaphinda ka-128 okanye nokuba ngama-256 - umzekelo, ialfabhethi yesiCyrillic esisiseko ithatha ama-256 bytes ukusuka. 0x0400 Π΄ΠΎ 0x04FF. Oku kuluncedo kakhulu: ukuba sigcina isimaphambili kanye 0x04, ngoko nawuphi na umlinganiswa wesiCyrillic unokubhalwa kwi-byte enye. Enyanisweni, ngale ndlela siya kuphulukana nethuba lokubuyela kwi-ASCII (kunye nabaphi na abalinganiswa ngokubanzi). Ngoko senza oku:

  1. Iibhayithi ezimbini 10yyyyyy yxxxxxxx ayibonisi isimboli enenani kuphela yyyyyy yxxxxxxx, kodwa kwakhona utshintshe ialfabhethi yangoku phezu yyyyyy y0000000 (o.k.t. sikhumbula onke amasuntswana ngaphandle kwezona zingabalulekanga kangako I-7 bit);
  2. Ibhayithi enye 0xxxxxxx lo ngunobumba we alfabhethi yangoku. Ifuna nje ukufakwa kwi-offset esiyikhumbule kwisinyathelo 1. Nangona singazange sitshintshe i-alfabhethi, i-offset ngu-zero, ngoko ke sigcine ukuhambelana ne-ASCII.

Ngokunjalo kwiikhowudi ezifuna iibhayithi ezi-3:

  1. Iibhayithi ezintathu 110yyyyy yxxxxxxx xxxxxxxx bonisa isimboli enenani yyyyyy yxxxxxxx xxxxxxxx, utshintsho ialfabhethi yangoku phezu yyyyyy y0000000 00000000 (wakhumbula yonke into ngaphandle kwabancinci I-15 bit), kwaye khangela ibhokisi esikuyo ngoku ixesha elide indlela (xa utshintsha ialfabhethi umva ukuya kabini-byte enye, siya kuphinda siyisete le flegi);
  2. Iibhayithi ezimbini 0xxxxxxx xxxxxxxx kwimo ende luphawu lwe alfabhethi yangoku. Ngokufanayo, siyongeza kunye ne-offset ukusuka kwisinyathelo 1. Ukwahlukana kuphela kukuba ngoku sifunda ii-byte ezimbini (kuba sitshintshele kule modi).

Kuvakala kukuhle: ngoku ngelixa sifuna ukufaka ikhowudi yoonobumba ukusuka kuluhlu olufanayo lwe-7-bit Unicode, sichitha i-byte e-1 eyongezelelweyo ekuqaleni kunye nebhayithi yebhayithi enye kumlinganiswa ngamnye.

Enye ibhayisekile: sigcina imitya ye-Unicode 30-60% icwecwe ngakumbi kune-UTF-8
Ukusebenza kwenye yeenguqulelo zangaphambili. Sele ihlala ibetha i-UTF-8, kodwa kusekho indawo yokuphucula.

Yintoni embi? Okokuqala, sinemeko, eyile ialfabhethi yangoku kunye nebhokisi yokukhangela imo ende. Oku kuphinda kusithintele: ngoku amagama afanayo anokufakwa ngekhowudi ngokwahlukileyo kwiimeko ezahlukeneyo. Ukukhangela imitya engaphantsi, umzekelo, kuya kufuneka kwenziwe kuthathelwa ingqalelo oku, kwaye hayi nje ngokuthelekisa iibytes. Okwesibini, kamsinya nje sakuba sitshintshe ialfabhethi, yaba mbi ngokufakwa kwekhowudi koonobumba be-ASCII (kwaye oku ayisiyo-alfabhethi yesiLatin kuphela, kodwa neziphumlisi ezisisiseko, kuquka izithuba) - zifuna ukutshintsha ialfabhethi kwakhona ukuya ku-0, oko kukuthi, kwakhona i-byte eyongezelelweyo (kwaye ke enye enye ukubuyela kwinqaku lethu eliphambili).

I-alfabhethi enye ilungile, ezimbini zingcono

Makhe sizame ukutshintsha i-bit prefixes kancinane, sicinezele esinye ukuya kwezithathu zichazwe ngasentla:

0xxxxxxx β€” 1 byte kwimowudi eqhelekileyo, 2 kwimo ende
11xxxxxx β€” 1 ibhayithi
100xxxxx xxxxxxxx - 2 bytes
101xxxxx xxxxxxxx xxxxxxxx - 3 bytes

Enye ibhayisekile: sigcina imitya ye-Unicode 30-60% icwecwe ngakumbi kune-UTF-8

Ngoku kwirekhodi yebhayithi ezimbini kukho enye encinci ekhoyo-ikhowudi ikhomba ukuya kuthi ga 0x1FFF, kwaye akunjalo 0x3FFF. Nangona kunjalo, isenkulu ngokubonakalayo kunekhowudi ye-UTF-8 ephindwe kabini, uninzi lweelwimi eziqhelekileyo zisangena, eyona lahleko ibonakalayo iwile. hiragana ΠΈ katakana, amaJapan alusizi.

Yintoni le khowudi intsha? 11xxxxxx? Le "stash" encinci enamagama angama-64 ngobukhulu, ihambelana nealfabhethi yethu ephambili, ke ndiyibize njengencedisi (umncedisi) ialfabhethi. Xa sitshintsha ialfabhethi yangoku, iqhekeza lealfabhethi endala iba luncedo. Ngokomzekelo, sitshintshe ukusuka kwi-ASCII ukuya kwiCyrillic - i-stash ngoku inamagama angama-64 aqulethe Ialfabhethi yesiLatini, amanani, indawo kunye nesiphumlisi (ukufakwa rhoqo rhoqo kwiitekisi ezingezizo ze-ASCII). Tshintsha ubuyele kwi-ASCII - kwaye inxalenye ephambili yealfabhethi yesiCyrillic iya kuba yialfabhethi encedisayo.

Siyabulela ukufikelela kwiialfabhethi ezimbini, singakwazi ukuphatha inani elikhulu leetekisi kunye neendleko ezincinci zokutshintsha iialfabhethi (iziphumlisi ziya kuhlala zikhokelela ekubuyiseleni kwi-ASCII, kodwa emva koko siya kufumana abaninzi abangengo-ASCII abalinganiswa kwi-alfabhethi eyongezelelweyo, ukutshintsha kwakhona).

Ibhonasi: isimaphambili se-alfabhethi 11xxxxxx kunye nokukhetha i-offset yayo yokuqala ukuba ibe 0xC0, sifumana ukuhambelana okuyingxenye kunye ne-CP1252. Ngamanye amazwi, ezininzi (kodwa hayi zonke) imibhalo yaseNtshona Yurophu efakwe kwi-CP1252 iya kujongeka ifana kwi-UTF-C.

Apha, nangona kunjalo, kukho ubunzima: indlela yokufumana enye encedisayo kwi-alfabhethi ephambili? Ungayishiya i-offset efanayo, kodwa - yeha - nantsi isakhiwo se-Unicode sele sidlala ngokuchasene nathi. Ngokuqhelekileyo inxalenye ephambili yealfabhethi ayikho ekuqaleni kwebhloko (umzekelo, i-capital yaseRashiya "A" inekhowudi. 0x0410, nangona ibhloko yesiCyrillic iqala nge 0x0400). Ke, xa sithathe oonobumba bokuqala abangama-64 kwi-stash, sinokuphulukana nokufikelela kwindawo yomsila wealfabhethi.

Ukulungisa le ngxaki, ndidlule ngesandla kwezinye iibhloko ezihambelana neelwimi ezahlukeneyo, kwaye ndachaza ukususwa kwealfabhethi encedisayo ngaphakathi kweyona iphambili kubo. Ialfabhethi yesiLatini, ngokungafaniyo, yahlelwa ngokutsha njengesiseko64.

Enye ibhayisekile: sigcina imitya ye-Unicode 30-60% icwecwe ngakumbi kune-UTF-8

Ukuchukunyiswa kokugqibela

Ekugqibeleni masicinge ngendawo enye esinokuyiphucula kuyo into.

Qaphela ukuba ifomathi 101xxxxx xxxxxxxx xxxxxxxx ikuvumela ukuba udibanise amanani ukuya kuthi ga 0x1FFFFF, kunye ne-Unicode iphela ngaphambili, ngo 0x10FFFF. Ngamanye amazwi, inqaku lokugqibela lekhowudi liya kumelwa njenge 10110000 11111111 11111111. Ngoko ke, sinokuthi ukuba i-byte yokuqala yefom 1011xxxx (Apho xxxx mkhulu kuno-0), ngoko lithetha enye into. Umzekelo, unokongeza abanye abalinganiswa abali-15 apho abasoloko befumaneka ukuze kufakwe ikhowudi kwi-byte enye, kodwa ndigqibe kwelokuba ndiyenze ngokwahlukileyo.

Makhe sijonge kwezo bloko ze-Unicode zifuna iibhayithi ezintathu ngoku. Ngokusisiseko, njengoko sele kukhankanyiwe, ezi ziimpawu zesiTshayina - kodwa kunzima ukwenza nantoni na, kukho amawaka angama-21 kuwo. Kodwa i-hiragana kunye ne-katakana nazo zabhabha apho - kwaye azikho zininzi kakhulu, zingaphantsi kwamakhulu amabini. Kwaye, ekubeni sikhumbule amaJapan, kukho i-emojis (enyanisweni, zisasazeke kwiindawo ezininzi kwi-Unicode, kodwa iibhloko eziphambili zikuluhlu. 0x1F300 - 0x1FBFF). Ukuba ucinga ngenyani yokuba ngoku kukho i-emojis ezidityaniswe kumanqaku amaninzi ekhowudi ngexesha elinye (umzekelo, i-emoji ‍Enye ibhayisekile: sigcina imitya ye-Unicode 30-60% icwecwe ngakumbi kune-UTF-8 iqulethe ezininzi njengeekhowudi ezisi-7!), emva koko iba lihlazo elipheleleyo ukuchitha iibhayithi ezintathu kwindawo nganye (7Γ—3 = 21 bytes ngenxa ye icon enye, iphupha elibi).

Ke ngoko, sikhetha uluhlu olumbalwa olukhethiweyo oluhambelana ne-emoji, i-hiragana kunye ne-katakana, ziphindaphinde zibe kuluhlu olunye oluqhubekayo kwaye uzifake kwii-byte ezimbini endaweni yezithathu:

1011xxxx xxxxxxxx

Okuhle: le emoji ikhankanywe ngasentlaEnye ibhayisekile: sigcina imitya ye-Unicode 30-60% icwecwe ngakumbi kune-UTF-8, ebandakanya amanqaku ekhowudi ayi-7, ithatha i-8 bytes kwi-UTF-25, kwaye siyifakela 14 (kanye ii-byte ezimbini kwinqaku lekhowudi nganye). Ngendlela, uHabr wenqaba ukuyigaya (kokubini kumdala kunye nomhleli omtsha), ngoko ke kwafuneka ndiyifake ngomfanekiso.

Masizame ukulungisa enye ingxaki. Njengoko sikhumbula, ialfabhethi esisiseko ibalulekile amasuntswana ama-6 aphezulu, esiyigcinayo engqondweni kwaye sincamathelise kwikhowudi yesimboli ngasinye esilandelayo. Kwimeko yabalinganiswa baseTshayina abakwibhloko 0x4E00 - 0x9FFF, le yi bit 0 okanye 1. Oku akulula kakhulu: kuya kufuneka sitshintshe rhoqo ialfabhethi phakathi kwala maxabiso mabini (okt ukuchitha iibytes ezintathu). Kodwa qaphela ukuba kwimo ende, ukusuka kwikhowudi ngokwayo sinokususa inani leempawu esizifake kwimodi emfutshane (emva kwazo zonke iindlela ezichazwe ngasentla, le yi-10240) - ke uluhlu lwe-hieroglyphs luya kutshintshela 0x2600 - 0x77FF, kwaye kule meko, kulo lonke uluhlu, eyona 6 bits ibaluleke kakhulu (ngaphandle kwe-21) iyakulingana no-0. Ngoko ke, ukulandelelana kwe-hieroglyphs kuya kusebenzisa i-bytes ezimbini nge-hieroglyph (eyeyona ilungileyo kuluhlu olukhulu kangaka), ngaphandle ebangela utshintsho lwealfabhethi.

Izisombululo ezizezinye: SCSU, BOCU-1

Iingcali ze-Unicode, zisanda kufunda isihloko senqaku, ziya kukhawuleza zikukhumbuze ukuba ngokuthe ngqo phakathi kwemigangatho ye-Unicode kukho. ISkimu sokuNxinzelela esiQhelekileyo se-Unicode (SCSU), echaza indlela yokufaka ikhowudi efana kakhulu naleyo ichazwe kwinqaku.

Ndivuma ngokunyanisekileyo: Ndafunda malunga nobukho bayo kuphela emva kokuba ndingene nzulu ekubhaleni isigqibo sam. Ukuba bendisazi ngayo kwasekuqaleni, ngendizame ukubhala ukuphunyezwa endaweni yokuza nendlela yam.

Eyona nto inika umdla kukuba iSCSU isebenzisa izimvo ezifanayo nezo ndize nazo ndedwa (endaweni yegama elithi β€œalphabets” basebenzisa β€œiifestile”, kwaye zininzi ezifumanekayo kunam). Kwangaxeshanye, le fomati ikwanazo nezingeloncedo: isondele kancinci kwii-algorithms zoxinzelelo kuneekhowudingi. Ngokukodwa, umgangatho unika iindlela ezininzi zokubonisa, kodwa awuthethi ukuba ungakhetha njani eyona ilungileyo - kule nto, i-encoder kufuneka isebenzise uhlobo oluthile lwe-heuristics. Ke, i-encoder ye-SCSU evelisa ukupakishwa okuhle iya kuba nzima ngakumbi kwaye inzima kune-algorithm yam.

Ukuthelekisa, ndidlulisele ukuphunyezwa okulula kwe-SCSU kwiJavaScript - ngokomthamo wekhowudi kuye kwabonakala kuthelekiseka ne-UTF-C yam, kodwa kwezinye iimeko umphumo waba ngamashumi eepesenti ezimbi (ngamanye amaxesha unokudlula, kodwa hayi kakhulu). Ngokomzekelo, imibhalo yesiHebhere nesiGrike yafakwa kwikhowudi ye-UTF-C I-60% ingcono kune-SCSU (mhlawumbi ngenxa yee-alphabets zabo ezihlangeneyo).

Ngokwahlukileyo, ndiza kongeza ukuba ngaphandle kwe-SCSU kukho enye indlela yokumela i-Unicode ngokudibeneyo - BOCU-1, kodwa ijolise kwi-MIME ehambelanayo (endingayifuniyo) kwaye ithatha indlela eyahlukileyo kancinane ekufakweni kweekhowudi. Khange ndivavanye ukusebenza kwayo, kodwa kubonakala ngathi ayinakwenzeka ukuba ibe phezulu kune-SCSU.

Uphuculo olunokwenzeka

I-algorithm endiyibonisileyo ayihambelani noyilo jikelele (oku mhlawumbi apho iinjongo zam zahlukana khona kakhulu kwiinjongo ze-Unicode Consortium). Sele ndiyichazile into yokuba yaphuhliselwa umsebenzi omnye (ukugcina isichazi-magama seelwimi ezininzi kumthi wesimaphambili), kwaye ezinye zeempawu zayo zisenokungalungeli kakuhle eminye imisebenzi. Kodwa inyani yokuba ayingomgangatho inokuba yinto edibeneyo - ungayiguqula ngokulula ukuze ihambelane neemfuno zakho.

Umzekelo, ngendlela ecacileyo ungasusa ubukho bombuso, wenze ikhowudi engenasiphelo - ungahlaziyi nje izinto eziguquguqukayo. offs, auxOffs ΠΈ is21Bit kwi-encoder kunye ne-decoder. Kule meko, akuyi kwenzeka ukupakisha ngokufanelekileyo ukulandelelana kweempawu zealfabhethi efanayo, kodwa kuya kubakho isiqinisekiso sokuba uphawu olufanayo luhlala lufakwe ngekhowudi ngee-bytes ezifanayo, kungakhathaliseki ukuba umxholo unjani.

Ukongeza, unokwenza i-encoder kulwimi oluthile ngokutshintsha imeko engagqibekanga - umzekelo, ugxininise kwimibhalo yesiRashiya, seta i-encoder kunye ne-decoder ekuqaleni. offs = 0x0400 ΠΈ auxOffs = 0. Oku kunengqiqo ngakumbi kwimeko yemowudi engenammiselo. Ngokubanzi, oku kuya kufana nokusebenzisa i-encoding endala ye-8-bit, kodwa ngaphandle kokususa amandla okufaka oonobumba kuyo yonke i-Unicode njengoko kufuneka.

Enye i-drawback ekhankanywe ngaphambili kukuba kumbhalo omkhulu ofakwe kwi-UTF-C akukho ndlela ekhawulezayo yokufumana umda womlinganiswa osondeleyo kwi-byte engafanelekanga. Ukuba unqumle okokugqibela, yithi, ii-byte ezili-100 kwi-encoded buffer, usengozini yokufumana inkunkuma ongenakwenza nto ngayo. I-encoding ayilungiselelwe ukugcina iilogi ze-multi-gigabyte, kodwa ngokubanzi oku kunokulungiswa. Byte 0xBF mayingaze ivele njengebhayithi yokuqala (kodwa inokuba yeyesibini okanye eyesithathu). Ngoko ke, xa ufaka ikhowudi, ungafaka ulandelelwano 0xBF 0xBF 0xBF yonke into, ithi, i-10 KB - ngoko, ukuba ufuna ukufumana umda, kuya kukwanela ukuskena isiqwenga esikhethiweyo de kufunyenwe umakishi ofanayo. Ukulandela okokugqibela 0xBF uqinisekisiwe ukuba sisiqalo somlinganiswa. (Xa kusenziwa ikhowudi, olu landelelwano lwee-byte ezintathu, ngokuqinisekileyo, luya kufuna ukungahoywa.)

Shwa nkathela

Ukuba ufunde oku kude, halala! Ndiyathemba ukuba, njengam, ufunde into entsha (okanye uhlaziye inkumbulo yakho) malunga nokwakheka kwe-Unicode.

Enye ibhayisekile: sigcina imitya ye-Unicode 30-60% icwecwe ngakumbi kune-UTF-8
Iphepha ledemo. Umzekelo wesiHebhere ubonisa iingenelo ngaphezulu kwe-UTF-8 kunye ne-SCSU.

Olu phando luchazwe ngasentla akufunekanga luthathwe njengongenelelo kwimigangatho. Nangona kunjalo, ndanelisekile ngokubanzi ngeziphumo zomsebenzi wam, ngoko ndonwabile ngabo yabelana: umzekelo, ilayibrari ye-JS ene-minified inobunzima be-1710 bytes kuphela (kwaye ayinakuxhomekeka, kunjalo). Njengoko benditshilo ngasentla, umsebenzi wakhe unokufumaneka iphepha ledemo (kukwakho neseti yezicatshulwa apho inokuthelekiswa ne-UTF-8 kunye ne-SCSU).

Okokugqibela, ndiza kuphinda nditsalele ingqalelo kwiimeko apho i-UTF-C isetyenziswa khona ayifanelekanga:

  • Ukuba imigca yakho inde ngokwaneleyo (ukusuka kwiimpawu ze-100-200). Kule meko, kuya kufuneka ucinge ngokusebenzisa i-algorithms yoxinzelelo njenge-deflate.
  • Ukuba uyafuna ASCII elubala, oko kukuthi, kubalulekile kuwe ukuba ulandelelwano olukhowudiweyo aluqulathanga iikhowudi ze-ASCII ebezingekho kumtya woqobo. Isidingo salokhu sinokuphetshwa ukuba, xa usebenzisana neqela lesithathu APIs (umzekelo, ukusebenza kunye nesiseko sedatha), udlula umphumo we-encoding njenge-abstract set of bytes, kwaye kungekhona njengeentambo. Ngaphandle koko, ubeka umngcipheko wokufumana ubuthathaka obungalindelekanga.
  • Ukuba ufuna ukukwazi ukufumana ngokukhawuleza imida yeempawu kwi-offset engafanelekanga (umzekelo, xa inxalenye yomgca yonakalisiwe). Oku kunokwenziwa, kodwa kuphela ngokuskena umgca ukusuka ekuqaleni (okanye ukusebenzisa ukuguqulwa okuchazwe kwicandelo langaphambili).
  • Ukuba ufuna ukukhawuleza ukwenza imisebenzi kwimixholo yeentambo (zihlele, khangela imitya engaphantsi kuyo, concatenate). Oku kufuna ukuba imitya ihlaziywe kuqala, ngoko ke i-UTF-C iyakucotha kune-UTF-8 kwezi meko (kodwa ngokukhawuleza kune-algorithms yoxinzelelo). Ekubeni umtya ofanayo uhlala ukhowudwe ngendlela efanayo, uthelekiso oluchanekileyo lwe-decoding alufunwa kwaye lunokwenziwa kwisiseko se-byte-by-byte.

uhlaziyo: umsebenzisi Tyomitch kumazwana angezantsi iposelwe igrafu eqaqambisa imida yokusebenziseka kwe-UTF-C. Ibonisa ukuba i-UTF-C isebenze ngakumbi kune-algorithm yoxinzelelo lwenjongo jikelele (utshintsho lwe-LZW) ukuba nje umtya opakishweyo umfutshane. ~Amagama ali-140 (nangona kunjalo, ndiqaphela ukuba uthelekiso lwenziwe kwisicatshulwa esinye; kwezinye iilwimi iziphumo zinokwahluka).
Enye ibhayisekile: sigcina imitya ye-Unicode 30-60% icwecwe ngakumbi kune-UTF-8

umthombo: www.habr.com

Yongeza izimvo