Rimwe bhasikoro: isu tinochengeta Unicode tambo 30-60% yakawanda compact pane UTF-8

Rimwe bhasikoro: isu tinochengeta Unicode tambo 30-60% yakawanda compact pane UTF-8

Kana iwe uri mugadziri uye wakatarisana nebasa rekusarudza encoding, saka Unicode inogara iri iyo mhinduro chaiyo. Iyo chaiyo yekumiririra nzira inoenderana nemamiriro ezvinhu, asi kazhinji pane mhinduro yepasirese pano zvakare - UTF-8. Chinhu chakanaka pazviri ndechekuti inokutendera iwe kushandisa ese Unicode mavara pasina kushandisa zvakare mabheti akawanda muzviitiko zvakawanda. Ichokwadi, kumitauro inoshandisa zvinopfuura mavara echiLatini, "kwete zvakanyanya" zvishoma mabhayiti maviri pahunhu. Tingaite zvirinani here tisingadzokeri kune prehistoric encodings iyo inotiganhurira kune 256 chete mavara aripo?

Pazasi ini ndinokurudzira kujairana nekuedza kwangu kupindura uyu mubvunzo uye kushandisa iri nyore algorithm iyo inokutendera iwe kuchengeta mitsara mumitauro mizhinji yenyika pasina kuwedzera redundancy iri muUTF-8.

Disclaimer. Ini ndichakurumidza kuita mashoma akakosha ekuchengetedza: iyo yakatsanangurwa mhinduro haina kupihwa sekutsiva kwepasirese kweUTF-8, inokodzera chete mune yakamanikana runyorwa rwemakesi (zvakawanda pavari pazasi), uye hapana kana chiitiko chinofanira kushandiswa kupindirana nevechitatu-bato APIs (vasingatombozivi nezvazvo). Kazhinji, general-chinangwa compression algorithms (semuenzaniso, deflate) akakodzera compact chengetedzo yemavhoriyamu makuru e data data. Uye zvakare, ndatove mukugadzira mhinduro yangu, ndakawana chiyero chiripo muUnicode pachayo, chinogadzirisa dambudziko rimwe chete - rakatonyanya kuomarara (uye kazhinji zvakanyanya kuipa), asi zvakadaro chiyero chinogamuchirwa, uye kwete kungoisa. pamwe chete pamabvi. Ndichakuudzai nezvakewo.

Nezve Unicode uye UTF-8

Kutanga, mazwi mashoma pamusoro pekuti chii Unicode ΠΈ UTF-8.

Sezvaunoziva, 8-bit encodings aimbove nemukurumbira. Navo, zvese zvaive nyore: mavara 256 anogona kuverengerwa nenhamba kubva pa0 kusvika 255, uye nhamba kubva pa0 kusvika 255 inogona kumiririrwa senge byte. Kana tikadzokera kumavambo chaiko, iyo ASCII encoding inongogumira ku7 bits, saka iyo inonyanya kukosha mumiriri wayo ndeye zero, uye mazhinji e8-bit encodings anowirirana nawo (anosiyana chete mu "kumusoro" chikamu, apo chinonyanya kukosha chiri chimwe ).

Ko Unicode inosiyana sei neaya encodings uye nei akawanda anomiririra anoenderana nawo - UTF-8, UTF-16 (BE uye LE), UTF-32? Ngatizvironge zvakarongeka.

Iyo yakakosha Unicode standard inotsanangura chete kunyorerana pakati pemavara (uye mune dzimwe nguva, ega ega zvikamu zvemavara) nenhamba dzavo. Uye kune akawanda anogoneka manhamba mune iyi chiyero - kubva 0x00 up to 0x10FFFF (1 zvidimbu). Dai taida kuisa nhamba mumutsara wakadaro kuita shanduko, hapana kana 114 kana 112 mabhayiti angatikwanira. Uye sezvo ma processors edu asina kugadzirwa zvakanyanya kuti ashande nenhamba-nhatu-byte, isu taizomanikidzwa kushandisa akawanda se1 bytes pahunhu! Iyi ndiyo UTF-2, asi imhaka yeiyi "kupambadza" iyo fomati iyi haina kufarirwa.

Sezvineiwo, kurongeka kwemavara mukati meiyo Unicode haina kungoitika. Seti yavo yese yakakamurwa kuita 17 "ndege", imwe neimwe ine 65536 (0x10000) "code points" Pfungwa ye "code point" pano iri nyore nhamba yemunhu, yakagoverwa kwairi neUnicode. Asi, sezvataurwa pamusoro, mu Unicode kwete chete mavara ega akaverengerwa, asiwo zvikamu zvawo uye masevhisi mamaki (uye dzimwe nguva hapana chinomboenderana nenhamba - pamwe parizvino, asi kwatiri izvi hazvina kukosha), saka ndizvo chaizvo nguva dzose kutaura zvakananga pamusoro nhamba nhamba pachavo, uye kwete zviratidzo. Nekudaro, mune zvinotevera, nekuda kwekupfupika, ini ndinowanzo shandisa izwi rekuti "chiratidzo", zvichireva izwi rekuti "code point".

Rimwe bhasikoro: isu tinochengeta Unicode tambo 30-60% yakawanda compact pane UTF-8
Unicode ndege. Sezvauri kuona, mazhinji acho (ndege 4 kusvika 13) haisati yashandiswa.

Chinonyanya kushamisa ndechekuti "pulp" yese iri mundege ye zero, inonzi "Basic Multilingual Plane". Kana mutsetse uine mavara mune imwe yemitauro yemazuva ano (kusanganisira chiChinese), haungapfuuri ndege iyi. Asi haugone kucheka imwe Unicode - semuenzaniso, emoji anonyanya kuwanikwa kumagumo e. ndege inotevera,"Supplementary Multilingual Plane"(inobva pa 0x10000 up to 0x1FFFF) Saka UTF-16 inoita izvi: mavara ese anowira mukati Basic Multilingual Plane, akaiswa encoded β€œsezvaari” nenhamba inoenderana netwo-byte. Nekudaro, dzimwe dzenhamba dziri muchikamu ichi hadziratidzi mavara chaiwo, asi dzinoratidza kuti mushure meiyi mabhaiti, isu tinofanirwa kufunga nezve imwe - nekubatanidza kukosha kweaya mabheti mana pamwechete, tinowana nhamba inovhara. iyo yese inoshanda Unicode renji. Pfungwa iyi inonzi "vakaroorana" - iwe unogona kunge wakanzwa nezvavo.

Saka UTF-16 inoda maviri kana (mune zvisingawanzo kuitika) mabhaiti mana pa "code point". Izvi zviri nani pane kushandisa mabhaiti mana nguva dzese, asi chiLatin (nemamwe mavara eASCII) kana akakodha neiyi nzira inoparadza hafu yenzvimbo pazero. UTF-8 yakagadzirirwa kugadzirisa izvi: ASCII mairi inogara, sepakutanga, imwe chete byte; codes kubva 0x80 up to 0x7FF - mabheti maviri; kubva 0x800 up to 0xFFFF - vatatu, uye kubva 0x10000 up to 0x10FFFF - ina. Kune rumwe rutivi, mavara echiLatin ave akanaka: kuwirirana neASCII kwakadzoka, uye kugovera kunowedzera "kupararira" kubva pa1 kusvika ku4 bytes. Asi maarufabheti kunze kwechiLatin, nhamo, haabatsiri chero nzira achienzaniswa neUTF-16, uye mazhinji zvino anoda matatu mabheti pachinzvimbo maviri - huwandu hwakafukidzwa nembiri-byte rekodhi yakadzikira ne32 nguva, ine 0xFFFF up to 0x7FF, uye kwete chiChinese kana, semuenzaniso, chiGeorgian chinosanganisirwa mairi. Cyrillic uye mamwe mashanu alphabets - hurray - rombo rakanaka, 2 bytes pamunhu.

Sei izvi zvichiitika? Ngationei kuti UTF-8 inomiririra macode ehunhu:
Rimwe bhasikoro: isu tinochengeta Unicode tambo 30-60% yakawanda compact pane UTF-8
Zvakananga kumiririra nhamba, mabits akaiswa chiratidzo anoshandiswa pano x. Zvinogona kuonekwa kuti mune mbiri-byte rekodhi pane 11 chete mabheti akadaro (kunze kwe16). Mabhiti anotungamira pano ane basa rekubatsira chete. Panyaya yefour-byte rekodhi, 21 out of 32 bits akagoverwa nhamba yecode point - zvingaite sekuti matatu mabhayiti (ayo anopa huwandu hwe24 bits) angave akakwana, asi masevhisi mamaki anodya zvakanyanya.

Zvakaipa here izvi? Kwete saizvozvo. Kune rimwe divi, kana isu tine hanya zvakanyanya nezve nzvimbo, isu tine compression algorithms iyo inogona nyore kubvisa ese ekuwedzera entropy uye redundancy. Kune rimwe divi, chinangwa cheUnicode chaive chekupa iyo yakasarudzika coding inogoneka. Semuenzaniso, tinogona kuisa mutsara wakavharirwa muUTF-8 kune kodhi yakamboshanda chete neASCII, uye usatya kuti ichaona hunhu kubva kuASCII renji iyo isipo (mushure mezvose, muUTF-8 zvese. bytes kutanga kubva pa zero bit - izvi ndizvo chaizvo zviri ASCII). Uye kana tikangoerekana tada kucheka muswe mudiki kubva patambo hombe tisina kudhirodha kubva pakutanga (kana kudzoreredza chikamu cheruzivo mushure mechikamu chakakanganisika), zviri nyore kuti isu tiwane iyo yekubvisa panotanga hunhu (zvakakwana. kusvetuka mabhaiti ane chivakashure 10).

Saka seiko kugadzira chimwe chinhu chitsva?

Panguva imwecheteyo, pane dzimwe nguva mamiriro ezvinhu apo compression algorithms senge deflate isingashande, asi iwe unoda kuwana compact chengetedzo yetambo. Ini pachangu, ndakasangana nedambudziko iri pandakafunga nezvekuvaka compressed prefix tree kuduramazwi hombe rinosanganisira mazwi emitauro isingaverengeki. Kune rumwe rutivi, izwi rimwe nerimwe ipfupi kwazvo, saka kurimanikidza kunenge kusina basa. Kune rimwe divi, kuiswa kwemuti kwandakafunga kwakagadzirwa kuitira kuti imwe neimwe byte yetambo yakachengetwa ibudise yakaparadzana muti vertex, saka kuderedza nhamba yavo kwaibatsira zvikuru. Muraibhurari yangu Az.js (Sezvo mu pymorphy2, payakavakirwa) dambudziko rakafanana rinogona kugadziriswa zviri nyore - tambo dzakarongedzwa mukati DAWG-dictionary, yakachengetwa imomo yakanaka yekare CP1251. Asi, sezviri nyore kunzwisisa, izvi zvinoshanda nemazvo chete kune akaganhurwa arufabheti - mutsara muchiChinese haugone kuwedzerwa kuduramazwi rakadaro.

Neparutivi, ini ndinoda kucherechedza imwezve isingafadzi nuance inomuka kana uchishandisa UTF-8 mune yakadaro data chimiro. Mufananidzo uri pamusoro unoratidza kuti kana chimiro chakanyorwa semabhayiti maviri, mabhiti ane chekuita nenhamba yacho haauye akatevedzana, asi anopatsanurwa maviri maviri. 10 pakati: 110xxxxx 10xxxxxx. Nekuda kweizvi, kana iyo yakaderera 6 bits yechipiri byte ichifashukira mukodhi yehunhu (kureva, shanduko inoitika. 10111111 β†’ 10000000), ipapo yekutanga byte inochinja zvakare. Zvinoitika kuti tsamba "p" inoratidzwa nemabheti 0xD0 0xBF, uye inotevera β€œr” yatova 0xD1 0x80. Mumuti wekutanga, izvi zvinotungamira kukupatsanurwa kwenodhi yemubereki kuita maviri - imwe yechivakashure 0xD0, uye imwe ye 0xD1 (kunyangwe arufabheti yese yeCyrillic yaikwanisa kuvharwa chete neyechipiri byte).

Chii chandakawana

Takatarisana nedambudziko iri, ndakafunga kudzidzira kutamba mitambo nemabits, uye panguva imwecheteyo ndive nekuziva zviri nani chimiro che Unicode zvachose. Mhedzisiro yacho yaive iyo UTF-C encoding fomati ("C" ye tsindirana), iyo inopedza isingasviki 3 bytes pacode point, uye kazhinji inobvumidza iwe kushandisa chete imwe yekuwedzera byte yemutsetse wese wakavharwa. Izvi zvinotungamira kune chokwadi chekuti pane mazhinji asiri-ASCII alphabets akadai encoding anoshanduka kuva 30-60% yakawanda compact pane UTF-8.

Ini ndapa mienzaniso yekuitwa kwe encoding uye decoding algorithms mufomu JavaScript uye Go maraibhurari, unogona kuzvishandisa zvakasununguka mukodhi yako. Asi ini ndicharamba ndichisimbisa kuti neimwe nzira iyi fomati inoramba iri "bhasikoro", uye ini handikurudzire kuishandisa usingazive kuti sei uchichida. Uku kuchiri kuyedza kupfuura kwakakomba "kuvandudzwa kweUTF-8". Zvakadaro, kodhi iripo yakanyorwa zvakatsvinda, muchidimbu, nehuwandu hukuru hwekutaura uye bvunzo yekuvhara.

Rimwe bhasikoro: isu tinochengeta Unicode tambo 30-60% yakawanda compact pane UTF-8
Miedzo mhinduro uye kuenzanisa neUTF-8

Ndakadarowo demo peji, kwaunogona kuongorora kushanda kwegorgorithm, uye ipapo ini ndichakuudza zvakawanda pamusoro pemitemo yayo uye nzira yekuvandudza.

Kubvisa redundant bits

Ndakatora UTF-8 sehwaro, hongu. Chinhu chekutanga uye chakanyanya kujeka chinogona kushandurwa mairi ndechekudzikisa huwandu hwemabheti ebasa mune yega yega. Semuenzaniso, yekutanga byte muUTF-8 inogara ichitanga chero 0, kana ne 11 - chivakashure 10 Mabhaiti anotevera chete anayo. Ngatitsive chivakashure 11 pamusoro 1, uye kune mabytes anotevera tichabvisa prefixes zvachose. Chii chichaitika?

0xxxxxxx - 1 byte
10xxxxxx xxxxxxxx - 2 bytes
110xxxxx xxxxxxxx xxxxxxxx - 3 bytes

Mirira, iripi mana-byte rekodhi? Asi hazvichadiwi - kana tichinyora mumabhayiti matatu, isu tave ne21 bits zviripo uye izvi zvinokwana nhamba dzese kusvika. 0x10FFFF.

Takabaira chii apa? Chinhu chinonyanya kukosha ndechekuonekwa kwemiganhu yeunhu kubva panzvimbo inopokana mubuffer. Hatikwanise kunongedza pane arbitrary byte totsvaga mavambo emunhu anotevera kubva pairi. Uku ndiko kudzikisira kwefomati yedu, asi mukuita izvi hazviwanzodikanwa. Isu tinowanzo kwanisa kumhanya kuburikidza nebhafa kubva pakutanga (kunyanya kana zvasvika kumitsetse mipfupi).

Mamiriro ezvinhu nemitauro inovhara ine 2 byte zvakare ave nani: ikozvino maviri-byte fomati inopa huwandu hwegumi nemana mabhiti, uye aya makodhi anosvika. 0x3FFF. MaChinese haana rombo rakanaka (mavara avo mazhinji anobva 0x4E00 up to 0x9FFF), asi maGeorgia nevamwe vanhu vazhinji vanonakidzwa - mitauro yavo inokwanawo mu2 bytes pamunhu.

Pinda iyo encoder state

Ngatifungei zvino nezvezvinhu zvemitsara pachayo. Duramanzwi rinowanzova nemashoko akanyorwa nemabhii akafanana, uye izvi ndezvechokwadiwo kune mamwe magwaro akawanda. Zvingava zvakanaka kuratidza arufabheti iyi kamwe chete, wobva waratidza nhamba chete yebhii iri mukati maro. Ngationei kana kurongeka kwemavara muiyo Unicode tafura kuchatibatsira.

Sezvambotaurwa pamusoro, Unicode yakakamurwa kuita ndege 65536 makodhi imwe neimwe. Asi iyi haisi kupatsanurwa kunobatsira (sezvakataurwa, kazhinji isu tiri mundege ye zero). Zvimwe zvinonakidza ndiko kupatsanurwa ne mabhuroko. Aya mitsara haachina hurefu hwakatarwa, uye ane chirevo - semutemo, imwe neimwe inosanganisa mavara kubva kune imwecheteyo alphabet.

Rimwe bhasikoro: isu tinochengeta Unicode tambo 30-60% yakawanda compact pane UTF-8
Chivharo chine mavara echiBengali alphabet. Nehurombo, nekuda kwezvikonzero zvenhoroondo, uyu muenzaniso wekusanyanya kuomarara - mavara makumi mapfumbamwe nematanhatu akapararira zvakapararira 96 block code points.

Matangiro emabhuraki uye saizi yawo anogara akawanda e16 - izvi zvinongoitirwa nyore. Pamusoro pezvo, mabhuroki mazhinji anotanga uye anopera pazvikoshi izvo zvinowanda zve 128 kana kunyange 256 - semuenzaniso, iyo yekutanga Cyrillic alphabet inotora 256 bytes kubva. 0x0400 up to 0x04FF. Izvi zviri nyore: kana tikachengeta prefix kamwe 0x04, ipapo chero mavara eCyrillic anogona kunyorwa mune imwe byte. Chokwadi, nenzira iyi isu ticharasikirwa nemukana wekudzokera kuASCII (uye kune chero mamwe mavara mune ese). Saka tinoita izvi:

  1. Mabhayiti maviri 10yyyyyy yxxxxxxx kwete chete chiratidzo chine nhamba yyyyyy yxxxxxxx, asiwo chinja arufabheti yazvino pamusoro yyyyyy y0000000 (kureva kuti tinorangarira mabhiti ese kunze kweakakosha 7 zvishoma);
  2. One byte 0xxxxxxx uyu ndiwo hunhu hwearufabheti yazvino. Inongoda kuwedzerwa kune iyo yakagadziriswa yatakarangarira mudanho 1. Kunyange isu tisina kushandura alphabet, iyo yekubvisa ndeye zero, saka takachengeta kuenderana neASCII.

Saizvozvo kune macode anoda 3 bytes:

  1. Mabhayiti matatu 110yyyyy yxxxxxxx xxxxxxxx ratidza chiratidzo chine nhamba yyyyyy yxxxxxxx xxxxxxxx, shanduko arufabheti yazvino pamusoro yyyyyy y0000000 00000000 (akarangarira zvese kunze kwevadiki 15 zvishoma), uye tarisa bhokisi ratiri zvino refu modhi (kana uchichinja alphabet kudzokera kune kaviri-byte imwe, isu tichagadzirisa iyi mureza);
  2. Mabhayiti maviri 0xxxxxxx xxxxxxxx muchimiro chakareba ndiwo hunhu hwearufabheti yazvino. Saizvozvowo, tinozviwedzera nekugadzirisa kubva padanho 1. Misiyano chete ndeyokuti ikozvino tinoverenga mabheti maviri (nokuti takashandura kune iyi mode).

Inonzwika zvakanaka: ikozvino isu tichida kukodha mavara kubva kune imwecheteyo 7-bit Unicode renji, isu tinopedza 1 yekuwedzera byte pakutanga uye inokwana biti imwe pamunhu.

Rimwe bhasikoro: isu tinochengeta Unicode tambo 30-60% yakawanda compact pane UTF-8
Kushanda kubva kune imwe yekutanga shanduro. Inotogara ichirova UTF-8, asi pachine nzvimbo yekuvandudza.

Chii chakaipisisa? Chekutanga, tine mamiriro, kureva ikozvino alphabet offset uye checkbox refu mode. Izvi zvinowedzera kutitadzisa: ikozvino mavara mamwe chete anogona kukodha zvakasiyana mumamiriro akasiyana. Kutsvaga substrings, semuenzaniso, ichafanirwa kuitwa uchifunga izvi, uye kwete chete nekuenzanisa bytes. Chechipiri, patakangoshandura arufabheti, yakazoshata nekunyorwa kwemavara eASCII (uye iyi haisi iyo alphabet yechiLatin chete, asiwo manyorerwo ekutanga, kusanganisira nzvimbo) - zvinoda kushandura alfabheti zvakare kuti 0, kureva, zvakare imwe yekuwedzera byte (uyezve imwe yekudzokera kune yedu huru pfungwa).

Imwe arufabheti yakanaka, maviri ari nani

Ngatiedzei kushandura yedu zvishoma prefixes zvishoma, kudzvanya mune imwezve kune matatu atsanangurwa pamusoro:

0xxxxxxx - 1 byte mune yakajairika modhi, 2 mune refu modhi
11xxxxxx - 1 byte
100xxxxx xxxxxxxx - 2 bytes
101xxxxx xxxxxxxx xxxxxxxx - 3 bytes

Rimwe bhasikoro: isu tinochengeta Unicode tambo 30-60% yakawanda compact pane UTF-8

Iye zvino mune maviri-byte rekodhi pane imwe isingawanikwe zvishoma - kodhi inonongedza kusvika 0x1FFFasi kwete 0x3FFF. Nekudaro, ichiri kuoneka yakakura kupfuura mune mbiri-byte UTF-8 makodhi, mitauro yakajairika ichiri kukwana, kurasikirwa kunoonekwa kwadonha. hiragana ΠΈ katakana, maJapan akasuruvara.

Kodhi itsva iyi chii? 11xxxxxx? Iyi idiki "stash" ye64 mavara muhukuru, inozadzisa alfabheti yedu huru, saka ndakaidaidza kuti yekubatsira (webetsero) alphabet. Patinochinja arufabheti yazvino, chidimbu chealfabheti yekare chinova chebetsero. Semuenzaniso, isu takachinja kubva kuASCII kuenda kuCyrillic - iyo stash ikozvino ine makumi matanhatu nemana mavara ane Latin alphabet, nhamba, nzvimbo uye koma (kunyanya kuiswa mune zvisiri zveASCII zvinyorwa). Chinja kudzokera kuASCII - uye chikamu chikuru cheCyrillic alphabet chichava arufabheti ebetsero.

Nekuda kwekuwana maarufabheti maviri, tinokwanisa kubata nhamba huru yezvinyorwa nemutengo wakaderera wekuchinja alphabets (punctuation inowanzo tungamira mukudzokera kuASCII, asi mushure meizvozvo tichawana akawanda asiri eASCII mavara kubva kune yakawedzera alphabet, pasina. kuchinja zvakare).

Bhonasi: nekutanga sub-alphabet 11xxxxxx uye kusarudza kwayo yekutanga kugadzirisa kuve 0xC0, tinowana kuenderana zvishoma neCP1252. Mune mamwe mazwi, mazhinji (asi asiri ese) zvinyorwa zveWestern Europe zvakavharwa muCP1252 zvinotaridzika zvakafanana muUTF-C.

Pano, zvisinei, dambudziko rinomuka: nzira yekuwana yekubatsira kubva kune huru alfabheti? Iwe unogona kusiya iyo yakafanana offset, asi - maiwe - heino iyo Unicode chimiro chave kutotamba ichipokana nesu. Kazhinji kazhinji chikamu chikuru chearufabheti hachisi pakutanga kwebhuroko (semuenzaniso, guta guru reRussia "A" rine kodhi. 0x0410, kunyangwe iyo Cyrillic block inotanga na 0x0400) Saka, tatora mavara makumi matanhatu nemana ekutanga mustash, tinogona kurasikirwa nechikamu chemuswe wearufabheti.

Kugadzirisa dambudziko iri, ini pachezvangu ndakapfuura nepakati pemamwe mabhuroki anoenderana nemitauro yakasiyana, uye ndikatsanangura kubviswa kwearufabheti yekubatsira mukati meiyo huru yavo. Iyo alphabet yechiLatin, senge yakasarudzika, yaiwanzo kurongeka senge base64.

Rimwe bhasikoro: isu tinochengeta Unicode tambo 30-60% yakawanda compact pane UTF-8

Final touches

Ngatifungei pane kumwe kwatingavandudza chimwe chinhu.

Cherechedza kuti chimiro 101xxxxx xxxxxxxx xxxxxxxx inokutendera kuti uverenge nhamba kusvika 0x1FFFFF, uye Unicode inopera kare, pa 0x10FFFF. Mune mamwe mazwi, iyo yekupedzisira kodhi poindi ichamiririrwa se 10110000 11111111 11111111. Naizvozvo, tinogona kutaura kuti kana yekutanga byte iri yefomu 1011xxxx (kupi xxxx mukuru pane 0), zvobva zvareva zvimwewo. Semuenzaniso, unogona kuwedzera mamwe mavara gumi nemashanu ipapo anogara aripo encoding mune imwe byte, asi ini ndafunga kuzviita zvakasiyana.

Ngatitarisei iwo ma Unicode mabhuroki anoda matatu mabheti izvozvi. Chaizvoizvo, sezvatotaurwa, aya mavara echiChinese - asi zvakaoma kuita chero chinhu navo, kune zviuru makumi maviri nerimwe. Asi hiragana uye katakana zvakare akabhururuka ipapo - uye hapachina akawanda kwazvo, asingasviki mazana maviri. Uye, sezvo isu takarangarira maJapan, kune zvakare emojis (chaizvoizvo, akapararira munzvimbo zhinji muUnicode, asi mabhururu makuru ari muhuwandu. 0x1F300 - 0x1FBFF) Kana iwe uchifunga nezve chokwadi chekuti ikozvino kune emojis akaunganidzwa kubva akati wandei kodhi mapoinzi kamwechete (semuenzaniso, iyo emoji ‍Rimwe bhasikoro: isu tinochengeta Unicode tambo 30-60% yakawanda compact pane UTF-8 ine akawanda se7 macode!), zvino zvinova zvinonyadzisa kushandisa mabheti matatu pane imwe neimwe (7 Γ— 3 = 21 bytes nekuda kwechiratidzo chimwe chete, hope dzinotyisa).

Naizvozvo, isu tinosarudza mashoma akasarudzwa maseru anoenderana emoji, hiragana uye katakana, toanyora zvakare kuita imwe inoenderera runyorwa uye encode iwo semabhayiti maviri pane matatu:

1011xxxx xxxxxxxx

Hukuru: yambotaurwa ‍ emojiRimwe bhasikoro: isu tinochengeta Unicode tambo 30-60% yakawanda compact pane UTF-8, ine 7 kodhi mapoinzi, inotora 8 byte muUTF-25, uye isu tinoikwana mairi 14 (chaizvoizvo mabhayiti maviri kune yega kodhi poindi). Nenzira, Habr akaramba kuichera (zvose mukare uye mumupepeti mutsva), saka ndaifanira kuiisa nemufananidzo.

Ngatiedzei kugadzirisa rimwe dambudziko. Sezvatinorangarira, maarufabheti ekutanga ndiwo chaiwo yakakwirira 6 bits, iyo yatinochengeta mupfungwa uye inonamira kune kodhi yeimwe neimwe inotevera decoded chiratidzo. Panyaya yemavara echiChinese ari mubhuroko 0x4E00 - 0x9FFF, izvi zvingave zvishoma 0 kana 1. Izvi hazvina kunyanya kunaka: tichada kugara tichishandura alfabheti pakati pezvinhu zviviri izvi (kureva kushandisa matatu mabheti). Asi cherechedza kuti mune yakareba modhi, kubva kune kodhi pachayo tinogona kubvisa nhamba yemavara atinoodha tichishandisa mapfupi mode (mushure memanomano ese atsanangurwa pamusoro apa, iyi 10240) - ipapo huwandu hwema hieroglyphs huchachinja kuenda 0x2600 - 0x77FF, uye munyaya iyi, munharaunda yose iyi, iyo inonyanya kukosha 6 bits (kunze kwe21) ichave yakaenzana ne 0. Nokudaro, kutevedzana kwezvinyorwa zvekunyora kuchashandisa ma bytes maviri pa hieroglyph (iyo ndiyo yakakwana kune yakakura zvakadaro), pasina. zvichiita kuti alphabet inochinja.

Dzimwe mhinduro: SCSU, BOCU-1

Unicode nyanzvi, vachangobva kuverenga musoro wechinyorwa, vangangokurumidza kukuyeuchidza kuti zvakananga pakati peiyo Unicode zviyero zviripo. Standard Compression Scheme yeUnicode (SCSU), iyo inotsanangura nzira yekukodha yakafanana neyakatsanangurwa muchinyorwa.

Ndinobvuma nokutendeseka: Ndakadzidza nezvekuvapo kwayo chete mushure mekunge ndanyura mukunyora sarudzo yangu. Dai ndakaziva nezvazvo kubva pakutanga, ndingadai ndakaedza kunyora kuita pachinzvimbo chekuuya nenzira yangu.

Chinofadza ndechekuti SCSU inoshandisa pfungwa dzakada kufanana nedziya dzandakauya nadzo ndega (panzvimbo peiyo pfungwa ye "alphabets" vanoshandisa "mahwindo", uye kune akawanda aripo kupfuura ini ndinazvo). Panguva imwecheteyo, iyi fomati zvakare ine zvipingamupinyi: iri padyo zvishoma kune compression algorithms pane encoding. Kunyanya, chiyero chinopa nzira dzakawanda dzekumiririra, asi haatauri nzira yekusarudza iyo yakakwana - kune izvi, iyo encoder inofanirwa kushandisa imwe mhando yeheuristics. Nekudaro, iyo SCSU encoder inogadzira yakanaka kurongedza ichave yakaoma uye yakanyanya kuomesesa kupfuura algorithm yangu.

Kuenzanisa, ndakatamisa kuita kuri nyore kweSCSU kuJavaScript - maererano nehuwandu hwekodhi yakave yakafanana neUTF-C yangu, asi mune dzimwe nguva mhedzisiro yacho yaive makumi ezana muzana (dzimwe nguva inogona kuipfuura, asi kwete zvakanyanya). Somuenzaniso, magwaro echiHebheru nechiGiriki akakodhewa neUTF-C 60% iri nani pane SCSU (zvichida nekuda kwema compact alphabets).

Kuparadzaniswa, ini ndichawedzera kuti kunze kweSCSU kune zvakare imwe nzira yekumiririra inomiririra Unicode - BOCU-1, asi ine chinangwa chekuenderana neMIME (izvo zvandaisada) uye inotora nzira yakati siyanei pakukodha. Ini handina kuongorora kushanda kwayo, asi zvinoratidzika kwandiri kuti haifanire kunge yakakwira kupfuura SCSU.

Kuvandudzwa kunobvira

Iyo algorithm yandakaratidza haisi yepasirese nedhizaini (apa ndipo panosiyana zvibodzwa zvangu zvakanyanya kubva kuzvinangwa zveUnicode Consortium). Ndatotaura kuti rakagadzirirwa basa rimwechete (kuchengeta duramazwi remitauro yakawanda mumuti wekutanga), uye zvimwe zvezvimiro zvaro zvinogona kunge zvisina kunyatsokodzera mamwe mabasa. Asi chokwadi chekuti haisi chiyero chinogona kuwedzera - unogona kuigadzirisa zviri nyore kuti ienderane nezvido zvako.

Semuenzaniso, nenzira iri pachena iwe unogona kubvisa kuvapo kwenyika, gadzira kodhi isina chirevo - ingo gadzirisa zvinosiyana. offs, auxOffs ΠΈ is21Bit mune encoder uye decoder. Muchiitiko ichi, hazvizokwanisike kurongedza zvinotevedzana zvemavara ealfabheti imwe chete, asi pachave neruvimbo rwekuti chimiro chimwe chete chinogara chakanyorwa nemabyte akafanana, zvisinei nemamiriro ezvinhu.

Mukuwedzera, iwe unogona kugadzirisa encoder kune mumwe mutauro nekushandura iyo default mamiriro - semuenzaniso, kutarisa pane zvinyorwa zveRussia, isa encoder uye decoder pakutanga. offs = 0x0400 ΠΈ auxOffs = 0. Izvi zvinonyanya kuita zvine musoro panyaya ye stateless mode. Kazhinji, izvi zvichafanana nekushandisa yekare sere-bit encoding, asi pasina kubvisa kugona kuisa mavara kubva kuUnicode yese sezvinodiwa.

Chimwe chidhiraivho chambotaurwa ndechekuti mumavara makuru akavharirwa muUTF-C hapana nzira yekukurumidza yekuwana muganho wehunhu uri padyo neabiti byte. Kana iwe ukacheka chekupedzisira, toti, zana mabyte kubva kune encoded buffer, uri panjodzi yekuwana marara ausingakwanise kuita chero chinhu nawo. Iyo encoding haina kugadzirirwa kuchengetedza akawanda-gigabyte matanda, asi kazhinji izvi zvinogona kugadziriswa. Byte 0xBF haifanire kuoneka seyekutanga byte (asi inogona kunge iri yechipiri kana yechitatu). Naizvozvo, kana encoding, unogona kuisa kutevedzana 0xBF 0xBF 0xBF imwe neimwe, toti, 10 KB - saka, kana iwe uchida kuwana muganhu, zvichave zvakakwana kuongorora chidimbu chakasarudzwa kusvikira chiratidzo chakafanana chawanikwa. Kutevera yekupedzisira 0xBF anovimbiswa kuva kutanga kwehunhu. (Paunenge uchinyora, kutevedzana uku kwematatu mabheti, hongu, kunoda kufuratirwa.)

Summing up

Kana iwe wakaverenga kusvika pano, makorokoto! Ndinovimba iwe, seni, wakadzidza chimwe chinhu chitsva (kana kuzorodza ndangariro yako) nezve chimiro che Unicode.

Rimwe bhasikoro: isu tinochengeta Unicode tambo 30-60% yakawanda compact pane UTF-8
Demo peji. Muenzaniso wechiHebheru unoratidza zvakanakira zvese UTF-8 neSCSU.

Tsvagiridzo yatsanangurwa pamusoro haifanirwe kutorwa sekupindira pazviyero. Zvisinei, ndinowanzogutsikana nemigumisiro yebasa rangu, saka ndinofara navo share: semuenzaniso, raibhurari yeJS minified inorema 1710 bytes chete (uye haina inotsamira, hongu). Sezvandambotaura pamusoro, basa rake rinogona kuwanikwa pa demo peji (kune zvakare seti yezvinyorwa iyo inogona kuenzaniswa neUTF-8 uye SCSU).

Chekupedzisira, ini zvakare ndichakwevera kutarisa kune zviitiko umo UTF-C inoshandiswa kwete kukosha:

  • Kana mitsetse yako yakareba zvakakwana (kubva pa100-200 mavara). Muchiitiko ichi, iwe unofanirwa kufunga nezve kushandisa compression algorithms senge deflate.
  • Kana uchida ASCII pachena, ndiko kuti, zvakakosha kwauri kuti ma encoded sequences haana ASCII macode anga asiri mutambo yekutanga. Kudiwa kweizvi kunogona kudziviswa kana, kana uchifambidzana nevechitatu-party APIs (somuenzaniso, kushanda nedhatabhesi), iwe unopfuudza encoding mhedzisiro seabstract set of bytes, uye kwete setambo. Zvikasadaro, unogona kuwana njodzi isingatarisirwe.
  • Kana iwe uchida kukwanisa kukurumidza kuwana miganhu yehunhu pane yakasarudzika offset (semuenzaniso, kana chikamu chemutsara chakanganiswa). Izvi zvinogona kuitwa, asi chete nekutarisa mutsara kubva pakutanga (kana kushandisa shanduko yakatsanangurwa muchikamu chakapfuura).
  • Kana iwe uchida kukurumidza kuita mashandiro pane zviri mukati metambo (zvironga, tsvaga substrings mazviri, concatenate). Izvi zvinoda kuti tambo dzigadziriswe kutanga, saka UTF-C inononoka pane UTF-8 muzviitiko izvi (asi nekukurumidza kupfuura compression algorithms). Sezvo tambo imwe chete inogara yakavharidzirwa nenzira imwechete, kunyatsoenzanisa kwe decoding haidiwi uye inogona kuitwa pane byte-by-byte.

update: mushandisi Tyomitch mumashoko ari pasi apa yakatumira girafu inoratidzira iyo miganho yekushandisa yeUTF-C. Zvinotaridza kuti UTF-C inoshanda zvakanyanya kupfuura general-chinangwa compression algorithm (yasiyana yeLZW) chero tambo yakarongedzwa ipfupi. ~140 mavara (zvisinei, ini ndinocherechedza kuti kuenzanisa kwakaitwa pane imwe rugwaro; kune mimwe mitauro mhedzisiro inogona kusiyana).
Rimwe bhasikoro: isu tinochengeta Unicode tambo 30-60% yakawanda compact pane UTF-8

Source: www.habr.com

Voeg