Wani keken: muna adana igiyoyin Unicode 30-60% fiye da UTF-8

Wani keken: muna adana igiyoyin Unicode 30-60% fiye da UTF-8

Idan kai mai haɓakawa ne kuma kana fuskantar aikin zabar codeing, to Unicode kusan koyaushe zai zama mafita mai kyau. Takaitacciyar hanyar wakilci ta dogara da mahallin, amma galibi akwai amsa ta duniya anan ma - UTF-8. Abu mai kyau game da shi shine yana ba ku damar amfani da duk haruffa Unicode ba tare da kashe kuɗi ba ma da yawa bytes a mafi yawan lokuta. Gaskiya ne, ga harsunan da suke amfani da fiye da haruffan Latin kawai, "ba da yawa ba" aƙalla bytes biyu a kowane hali. Shin za mu iya yin abin da ya fi kyau ba tare da komawa zuwa bayanan tarihin tarihi wanda ya iyakance mu ga haruffa 256 kawai ba?

A ƙasa na ba da shawarar fahimtar kanku tare da ƙoƙarina na amsa wannan tambayar da aiwatar da ingantaccen algorithm mai sauƙi wanda ke ba ku damar adana layi a yawancin yarukan duniya ba tare da ƙara ƙarin aikin da ke cikin UTF-8 ba.

Disclaimer. Nan da nan zan yi wasu tanadi masu mahimmanci: Ba a bayar da bayanin da aka bayyana azaman maye gurbin UTF-8 na duniya ba, Ya dace kawai a cikin kunkuntar jerin lokuta (ƙari akan su a ƙasa), kuma a kowane hali bai kamata a yi amfani da shi don yin hulɗa tare da APIs na ɓangare na uku (waɗanda ba su ma san game da shi ba). Mafi sau da yawa, maƙasudin maƙasudi na gaba ɗaya (misali, deflate) sun dace don ƙaramin adadin manyan bayanan rubutu. Bugu da kari, riga a kan aiwatar da samar da mafita na, Na sami wani data kasance misali a Unicode kanta, wanda ke warware wannan matsala - shi ne da ɗan mafi rikitarwa (kuma sau da yawa mafi muni), amma duk da haka shi ne yarda da misali, kuma ba kawai sa. tare a gwiwa. Zan ba ku labarinsa kuma.

Game da Unicode da UTF-8

Don farawa da, 'yan kalmomi game da abin da yake Unicode и UTF-8.

Kamar yadda ka sani, 8-bit codedings sun kasance sananne. Tare da su, komai ya kasance mai sauƙi: 256 haruffa za a iya ƙidaya tare da lambobi daga 0 zuwa 255, kuma lambobi daga 0 zuwa 255 ana iya wakilta su azaman byte ɗaya. Idan muka koma farkon farkon, tsarin shigar da ASCII gaba daya yana iyakance ga 7 ragowa, don haka mafi mahimmancin bit a cikin wakilcin byte shine sifili, kuma mafi yawan bayanan 8-bit sun dace da shi (sun bambanta kawai a cikin “babba”). sashi, inda mafi mahimmancin bit shine daya).

Ta yaya Unicode ya bambanta da waɗancan rufaffiyar kuma me yasa ke da alaƙa da takamaiman wakilci da shi - UTF-8, UTF-16 (BE da LE), UTF-32? Bari mu tsara shi cikin tsari.

Ma'aunin Unicode na asali yana bayyana kawai wasiƙun da ke tsakanin haruffa (kuma a wasu lokuta, abubuwan haɗin kai na haruffa) da lambobin su. Kuma akwai lambobi masu yawa a cikin wannan ma'auni - daga 0x00 to 0x10FFFF ( guda 1). Idan muna son sanya lamba a cikin irin wannan kewayon zuwa madaidaici, ba 114 ko 112 bytes ba da zai ishe mu. Kuma tunda ba'a tsara na'urori masu sarrafa mu don aiki tare da lambobi uku-byte ba, za a tilasta mana mu yi amfani da kusan bytes 1 a kowane hali! Wannan shi ne UTF-2, amma daidai ne saboda wannan "lalata" cewa wannan tsarin ba shi da mashahuri.

Abin farin ciki, tsarin haruffa a cikin Unicode ba bazuwar ba ne. An raba su gaba ɗaya zuwa 17 "jirage", kowanne daga cikinsu ya ƙunshi 65536 (0x10000) "maki maki" Ma'anar "ma'anar lamba" anan shine kawai lambar hali, wanda Unicode ya sanya shi. Amma, kamar yadda aka ambata a sama, a cikin Unicode ba kawai haruffan mutum ɗaya aka ƙidaya ba, har ma da abubuwan haɗin su da alamun sabis (kuma wani lokacin babu abin da ya dace da lambar - watakila a yanzu, amma a gare mu wannan ba shi da mahimmanci). ya fi daidai ko da yaushe magana musamman game da adadin lambobi da kansu, ba alamomi ba. Duk da haka, a cikin masu zuwa, don taƙaitawa, sau da yawa zan yi amfani da kalmar "alama", yana nuna kalmar "launi".

Wani keken: muna adana igiyoyin Unicode 30-60% fiye da UTF-8
Jiragen Unicode. Kamar yadda kake gani, yawancinsa (jirgi 4 zuwa 13) har yanzu ba a amfani da su.

Abin da ya fi ban mamaki shi ne cewa duk babban "bangaren al'ada" ya ta'allaka ne a cikin jirgin sifili, ana kiran shi "Basic Plane Multilingual"Idan layi ya ƙunshi rubutu a cikin ɗayan yarukan zamani (ciki har da Sinanci), ba za ku wuce wannan jirgin ba. jirgin na gaba"Ƙarin Jirgin Saman Harsuna da yawa"(yana daga 0x10000 to 0x1FFFF). Don haka UTF-16 yayi wannan: duk haruffa suna faɗuwa a ciki Basic Plane Multilingual, an lullube su “kamar yadda yake” tare da madaidaicin lamba ta byte biyu. Duk da haka, wasu lambobi a cikin wannan kewayon ba su nuna takamaiman haruffa kwata-kwata ba, amma suna nuna cewa bayan waɗannan bytes biyu muna buƙatar yin la'akari da wani - ta hanyar haɗa ƙimar waɗannan bytes huɗu tare, muna samun lamba wanda ke rufewa. duk ingantaccen kewayon Unicode. Ana kiran wannan ra'ayin "ma'aurata" - watakila kun ji labarinsu.

Don haka UTF-16 na buƙatar biyu ko (a cikin lokuta masu wuyar gaske) bytes huɗu a kowane “makiyin lamba”. Wannan ya fi yin amfani da bytes guda huɗu koyaushe, amma Latin (da sauran haruffa ASCII) lokacin da aka sanya su ta wannan hanya yana lalata rabin sarari akan sifilai. An tsara UTF-8 don gyara wannan: ASCII a ciki ya mamaye, kamar yadda ya gabata, byte ɗaya kawai; lambobin daga 0x80 to 0x7FF - biyu bytes; daga 0x800 to 0xFFFF - uku, kuma daga 0x10000 to 0x10FFFF - hudu. A gefe guda, haruffan Latin sun zama mai kyau: dacewa tare da ASCII ya dawo, kuma rarraba ya fi dacewa "ya yada" daga 1 zuwa 4 bytes. Amma haruffa ban da Latin, alas, ba sa amfana ta kowace hanya idan aka kwatanta da UTF-16, kuma da yawa yanzu suna buƙatar bytes uku maimakon biyu - kewayon da rikodin baiti biyu ya ragu da sau 32, tare da 0xFFFF to 0x7FF, kuma ba Sinanci ko, alal misali, Jojiyanci a ciki. Cyrillic da wasu haruffa biyar - hurray - sa'a, 2 bytes kowane hali.

Me yasa hakan ke faruwa? Bari mu ga yadda UTF-8 ke wakiltar lambobin haruffa:
Wani keken: muna adana igiyoyin Unicode 30-60% fiye da UTF-8
Kai tsaye don wakiltar lambobi, ana amfani da ragowa masu alamar alamar a nan x. Ana iya ganin cewa a cikin rikodin ta biyu-byte akwai irin wannan ragowa guda 11 kawai (daga cikin 16). Manyan rago a nan suna da aikin taimako kawai. A cikin yanayin rikodin baiti huɗu, 21 cikin 32 ragowa ana keɓe don lambar ma'anar lambar - yana da alama cewa bytes uku (wanda ke ba da jimlar 24 ragowa) zai isa, amma alamun sabis suna ci da yawa.

Shin wannan mara kyau? Ba da gaske ba. A gefe guda, idan muka damu da yawa game da sararin samaniya, muna da algorithms matsawa wanda zai iya kawar da duk ƙarin entropy da redundancy cikin sauƙi. A gefe guda, makasudin Unicode shine don samar da mafi kyawun lambar ƙira ta duniya. Misali, zamu iya ba da amanar layin da aka sanya a cikin UTF-8 zuwa lambar da a baya tayi aiki tare da ASCII kawai, kuma kada ku ji tsoro cewa zai ga hali daga kewayon ASCII wanda a zahiri ba a can (bayan duka, a cikin UTF-8 duka. bytes farawa daga sifilin bit - wannan shine ainihin abin da ASCII yake). Kuma idan ba zato ba tsammani muna so mu yanke ɗan ƙaramin wutsiya daga babban kirtani ba tare da yanke shi daga farkon ba (ko mayar da wani ɓangare na bayanin bayan ɓangaren da ya lalace), yana da sauƙi a gare mu mu nemo kashewa inda hali ya fara (ya isa. don tsallake bytes waɗanda ke da ɗan prefix 10).

Don me kuke ƙirƙira sabon abu?

A lokaci guda, akwai yanayi lokaci-lokaci lokacin da algorithms matsawa kamar deflate ba su da kyau, amma kuna son cimma ƙaramin ajiyar kirtani. Da kaina, na fuskanci wannan matsala lokacin tunanin gini matse bishiyar prefix don babban ƙamus gami da kalmomi a cikin yarukan sabani. A gefe guda, kowace kalma gajeru ce, don haka matsawa ba zai yi tasiri ba. A gefe guda kuma, aiwatar da bishiyar da na yi la'akari an tsara shi ne ta yadda kowane byte na kirtani da aka adana ya samar da wani gefen bishiyar daban, don haka rage adadin su yana da amfani sosai. A cikin ɗakin karatu na Az.js (Kamar in pymorphy2, wanda aka dogara akan shi) ana iya magance irin wannan matsala kawai - kirtani cushe cikin DAWG- ƙamus, adana a can Saukewa: CP1251. Amma, kamar yadda yake da sauƙin fahimta, wannan yana aiki da kyau don ƙayyadaddun haruffa kawai - ba za a iya ƙara layi a cikin Sinanci zuwa irin wannan ƙamus ba.

Na dabam, Ina so in lura da wani ƙarin rashin jin daɗi wanda ya taso lokacin amfani da UTF-8 a cikin irin wannan tsarin bayanai. Hoton da ke sama yana nuna cewa idan aka rubuta hali a matsayin bytes biyu, raƙuman da ke da alaƙa da lambarsa ba sa zuwa a jere, sai dai su rabu da bita biyu. 10 a tsakiya: 110xxxxx 10xxxxxx. Saboda wannan, lokacin da ƙananan 6 ragowa na byte na biyu ya cika a cikin lambar haruffa (watau canji yana faruwa. 1011111110000000), sa'an nan kuma farkon byte ya canza. Ya bayyana cewa harafin "p" ana nuna shi ta bytes 0xD0 0xBF, kuma "r" na gaba ya rigaya 0xD1 0x80. A cikin bishiyar prefix, wannan yana haifar da rarrabuwar kullin mahaifa zuwa biyu - ɗaya don prefix 0xD0, da wani don 0xD1 (kodayake dukkan haruffan Cyrillic ana iya yin su ta byte na biyu kawai).

Me na samu

Na fuskanci wannan matsala, sai na yanke shawarar yin wasa tare da bits, kuma a lokaci guda na ɗan ƙara fahimtar tsarin Unicode gaba ɗaya. Sakamakon shine tsarin ɓoye UTF-C ("C" don m), wanda ke kashewa ba fiye da 3 bytes a kowace lamba ba, kuma sau da yawa yana ba ku damar ciyarwa kawai ƙarin byte ɗaya don dukan layin da aka ɓoye. Wannan yana haifar da gaskiyar cewa akan yawancin haruffan da ba ASCII ba irin wannan ɓoyewa ya zama 30-60% mafi m fiye da UTF-8.

Na gabatar da misalan aiwatar da ɓoyewa da yanke hukunci a cikin tsari JavaScript da Go dakunan karatu, za ku iya amfani da su kyauta a cikin lambar ku. Amma har yanzu zan jaddada cewa a wata ma'ana wannan tsari ya kasance "keke", kuma ban ba da shawarar amfani da shi ba ba tare da sanin dalilin da yasa kuke buƙata ba. Wannan har yanzu yana da ƙarin gwaji fiye da "ingantawar UTF-8" mai tsanani. Duk da haka, lambar da ke wurin an rubuta shi da kyau, a takaice, tare da adadi mai yawa na tsokaci da ɗaukar hoto.

Wani keken: muna adana igiyoyin Unicode 30-60% fiye da UTF-8
Sakamakon gwaji da kwatancen UTF-8

Na kuma yi demo page, inda za ku iya kimanta aikin algorithm, sa'an nan kuma zan ba ku ƙarin bayani game da ka'idodinta da tsarin ci gaba.

Kawar da ragowa da yawa

Na ɗauki UTF-8 a matsayin tushen, ba shakka. Abu na farko kuma mafi bayyane wanda za'a iya canza shi shine rage yawan adadin sabis a cikin kowane byte. Misali, byte na farko a cikin UTF-8 koyaushe yana farawa da ko dai 0, ko tare da 11 - prefix 10 Waɗannan bytes ne kawai ke da shi. Bari mu maye gurbin prefix 11 a kan 1, kuma na gaba bytes za mu cire prefixes gaba daya. Me zai faru?

0xxxxxxx - 1 byte
10xxxxxx xxxxxxxx - 2 bytes
110xxxxx xxxxxxxx xxxxxxxx - 3 bytes

Jira, ina rikodin-byte huɗu? Amma ba a buƙatar shi - lokacin rubutawa cikin bytes uku, yanzu muna da 21 ragowa kuma wannan ya isa ga duk lambobi har zuwa 0x10FFFF.

Me muka sadaukar a nan? Abu mafi mahimmanci shine gano iyakokin halaye daga wuri na sabani a cikin buffer. Ba za mu iya nunawa a wani baiti na sabani kuma mu sami farkon hali na gaba daga gare ta. Wannan ƙayyadaddun tsarin mu ne, amma a aikace wannan ba ya zama dole ba. Yawancin lokaci muna iya tafiya ta cikin buffer tun farkon (musamman idan yazo ga gajerun layi).

Halin da ake ciki tare da rufe harsuna tare da 2 bytes kuma ya zama mafi kyau: yanzu tsarin biyu-byte yana ba da kewayon 14 rago, kuma waɗannan lambobin har zuwa 0x3FFF. Sinawa ba su da sa'a (halayen su galibi suna fitowa daga 0x4E00 to 0x9FFF), amma Georgians da sauran jama'a da yawa sun fi jin daɗi - harsunansu kuma sun dace da 2 bytes a kowane hali.

Shigar da yanayin rikodin

Bari yanzu muyi tunani game da kaddarorin layin da kansu. Kamus galibi yana ƙunshe da kalmomin da aka rubuta cikin haruffa iri ɗaya, kuma wannan ma gaskiya ne ga sauran matani da yawa. Zai yi kyau a nuna wannan haruffa sau ɗaya, sannan a nuna kawai adadin harafin da ke cikinsa. Bari mu ga ko tsarin haruffa a teburin Unicode zai taimake mu.

Kamar yadda aka ambata a sama, Unicode ya kasu kashi jirgin sama 65536 lambobin kowane. Amma wannan ba rabo mai amfani ba ne (kamar yadda aka riga aka fada, mafi yawan lokuta muna cikin jirgin sifili). Mafi ban sha'awa shine rarraba ta tubalan. Waɗannan jeri ba su da tsayayyen tsayi, kuma sun fi ma'ana - a matsayin mai mulkin, kowanne yana haɗa haruffa daga haruffa iri ɗaya.

Wani keken: muna adana igiyoyin Unicode 30-60% fiye da UTF-8
Toshe mai ɗauke da haruffan haruffan Bengali. Abin takaici, saboda dalilai na tarihi, wannan misali ne na marufi marasa yawa - haruffa 96 suna warwatse cikin rudani a cikin makirufo toshe 128.

Farkon tubalan da girman su koyaushe suna da yawa na 16 - ana yin wannan kawai don dacewa. Bugu da kari, da yawa tubalan farawa da ƙare a kan dabi'u waɗanda suke da yawa na 128 ko ma 256 - alal misali, ainihin haruffan Cyrillic yana ɗaukar 256 bytes daga 0x0400 to 0x04FF. Wannan ya dace sosai: idan muka ajiye prefix sau ɗaya 0x04, to ana iya rubuta kowane hali na Cyrillic a cikin byte ɗaya. Gaskiya ne, ta wannan hanyar za mu rasa damar da za mu koma ASCII (da sauran haruffa gaba ɗaya). Don haka muna yin haka:

  1. bytes biyu 10yyyyyy yxxxxxxx ba kawai nuna alama mai lamba ba yyyyyy yxxxxxxx, amma kuma canza haruffa na yanzu a kan yyyyyy y0000000 (watau muna tunawa da duk ramukan sai dai mafi ƙarancin mahimmanci 7 ragowa);
  2. byte daya 0xxxxxxx wannan shine halin haruffa na yanzu. Yana buƙatar kawai ƙarawa zuwa kashewa wanda muka tuna a mataki na 1. Duk da yake ba mu canza haruffa ba, kashewa ba shi da sifili, don haka mun ci gaba da dacewa da ASCII.

Hakanan ga lambobin da ke buƙatar 3 bytes:

  1. Bytes uku 110yyyyy yxxxxxxx xxxxxxxx nuna alama mai lamba yyyyyy yxxxxxxx xxxxxxxx, canji haruffa na yanzu a kan yyyyyy y0000000 00000000 (ya tuna komai sai kanana 15 ragowa), kuma duba akwatin da muke ciki yanzu dogo yanayin (lokacin da ake canza haruffa zuwa mai biyu-byte, za mu sake saita wannan tuta);
  2. bytes biyu 0xxxxxxx xxxxxxxx a cikin dogon yanayi shine halin haruffa na yanzu. Hakazalika, muna ƙara shi tare da kashewa daga mataki na 1. Bambanci kawai shine cewa yanzu mun karanta bytes biyu (saboda mun canza zuwa wannan yanayin).

Yayi kyau: yanzu yayin da muke buƙatar ɓoye haruffa daga kewayon Unicode 7-bit iri ɗaya, muna kashe ƙarin byte 1 a farkon kuma jimlar byte ɗaya akan kowane hali.

Wani keken: muna adana igiyoyin Unicode 30-60% fiye da UTF-8
Yin aiki daga ɗaya daga cikin sigar farko. Ya riga yakan doke UTF-8, amma har yanzu akwai sauran damar ingantawa.

Menene mafi muni? Na farko, muna da sharadi, wato halin yanzu haruffa biya diyya da akwati dogon yanayi. Wannan yana ƙara iyakance mu: yanzu haruffa iri ɗaya ana iya ɓoye su daban a cikin mahallin daban-daban. Neman ƙananan igiyoyi, alal misali, dole ne a yi la'akari da wannan, ba kawai ta hanyar kwatanta bytes ba. Na biyu, da zaran mun canza haruffa, ya zama mummunan tare da shigar da haruffan ASCII (kuma wannan ba kawai haruffan Latin ba ne, amma har ma da alamar rubutu, gami da sarari) - suna buƙatar sake canza haruffa zuwa 0, wato. sake karin byte (sai kuma wani don komawa ga babban batu na mu).

Haruffa ɗaya yana da kyau, biyu ya fi kyau

Bari mu yi ƙoƙari mu canza ƙaƙƙarfan prefixes ɗin mu kaɗan, muna matsi cikin ƙari ɗaya zuwa uku da aka kwatanta a sama:

0xxxxxxx - 1 byte a yanayin al'ada, 2 a cikin dogon yanayi
11xxxxxx - 1 byte
100xxxxx xxxxxxxx - 2 bytes
101xxxxx xxxxxxxx xxxxxxxx - 3 bytes

Wani keken: muna adana igiyoyin Unicode 30-60% fiye da UTF-8

Yanzu a cikin rikodin-byte biyu akwai ƙarancin samuwa guda ɗaya - maki code har zuwa 0x1FFF, kuma ba 0x3FFF. Duk da haka, har yanzu yana da girma fiye da a cikin lambobin UTF-8-byte biyu, yawancin harsunan da suka fi dacewa har yanzu sun dace, asarar da aka fi sani da ita ta fadi. hiragana и katakana, Jafanawa suna bakin ciki.

Menene wannan sabon lambar? 11xxxxxx? Wannan ƙaramin “stash” ne mai girman haruffa 64, ya dace da babban haruffanmu, don haka na kira shi da taimako (mataimaki) haruffa. Lokacin da muka canza haruffa na yanzu, wani yanki na tsohuwar haruffa zai zama mataimaki. Misali, mun canza daga ASCII zuwa Cyrillic - stash yanzu ya ƙunshi haruffa 64 da suka ƙunshi. haruffan Latin, lambobi, sarari da waƙafi (mafi yawan shigarwa a cikin rubutun da ba ASCII ba). Komawa zuwa ASCII - kuma babban ɓangaren haruffan Cyrillic zai zama ƙarin haruffa.

Godiya ga samun damar yin amfani da haruffa biyu, za mu iya sarrafa adadi mai yawa na rubutu tare da ƙarancin kuɗi don canza haruffa (launi zai fi sau da yawa haifar da komawa zuwa ASCII, amma bayan haka za mu sami haruffan ASCII da yawa daga ƙarin haruffa, ba tare da ƙarin haruffa ba. sake canzawa).

Bonus: prefixing sub-alphabet 11xxxxxx da kuma zabar farkon biya diyya ya zama 0xC0, muna samun daidaituwa tare da CP1252. A wasu kalmomi, da yawa (amma ba duka) rubutun Yammacin Turai da aka sanya a cikin CP1252 za ​​su yi kama da UTF-C.

Anan, duk da haka, wahala ta taso: yadda ake samun ƙarin taimako daga babban haruffa? Kuna iya barin wannan biya diyya, amma - alas - anan tsarin Unicode ya riga ya fara wasa da mu. Sau da yawa babban ɓangaren haruffa ba a farkon toshe ba (alal misali, babban birnin Rasha "A" yana da lambar. 0x0410, ko da yake Cyrillic block yana farawa da 0x0400). Don haka, bayan ɗaukar haruffa 64 na farko a cikin tarkace, za mu iya rasa damar zuwa ɓangaren wutsiya na haruffa.

Don gyara wannan matsalar, da hannu na bi ta wasu tubalan da suka dace da yaruka daban-daban, kuma na ƙayyadad da ɓangarorin haruffan taimako a cikin babban ɗaya gare su. Harafin Latin, a matsayin ban da, gabaɗaya an sake yin oda kamar tushe64.

Wani keken: muna adana igiyoyin Unicode 30-60% fiye da UTF-8

Taɓawar ƙarshe

A karshe mu yi tunanin inda kuma za mu iya inganta wani abu.

Lura cewa tsarin 101xxxxx xxxxxxxx xxxxxxxx yana ba ku damar ɓoye lambobin har zuwa 0x1FFFFF, kuma Unicode ya ƙare a baya, a 0x10FFFF. A wasu kalmomi, za a wakilta wurin lamba ta ƙarshe azaman 10110000 11111111 11111111. Saboda haka, za mu iya cewa idan na farko byte ne na nau'i 1011xxxx (ku xxxx fiye da 0), to yana nufin wani abu dabam. Misali, zaku iya ƙara wasu haruffa 15 a wurin waɗanda ake samun su akai-akai don shigar da su cikin byte ɗaya, amma na yanke shawarar yin shi daban.

Bari mu kalli waɗannan tubalan Unicode waɗanda ke buƙatar bytes uku yanzu. Ainihin, kamar yadda aka ambata, waɗannan haruffan Sinanci ne - amma yana da wuya a yi wani abu tare da su, akwai 21 dubu daga cikinsu. Amma hiragana da katakana suma sun tashi a can - kuma ba su da yawa kuma, ƙasa da ɗari biyu. Kuma, tun da mun tuna da Jafananci, akwai kuma emojis (a zahiri, suna warwatse a wurare da yawa a cikin Unicode, amma manyan tubalan suna cikin kewayon. 0x1F300 - 0x1FBFF). Idan kuna tunanin gaskiyar cewa yanzu akwai emojis waɗanda aka haɗa su daga wuraren lamba da yawa lokaci guda (misali, emojis.Wani keken: muna adana igiyoyin Unicode 30-60% fiye da UTF-8 ya ƙunshi kusan lambobin 7!), Sa'an nan ya zama cikakkiyar kunya don kashe bytes uku akan kowane (7 × 3 = 21 bytes don kare alama ɗaya, mafarki mai ban tsoro).

Don haka, muna zaɓar ƴan zaɓaɓɓun jeri masu dacewa da emoji, hiragana da katakana, mu saka su cikin jerin ci gaba ɗaya kuma mu sanya su azaman bytes biyu maimakon uku:

1011xxxx xxxxxxxx

Mai girma: emoji da aka ambataWani keken: muna adana igiyoyin Unicode 30-60% fiye da UTF-8, wanda ya ƙunshi maki 7 code, yana ɗaukar 8 bytes a cikin UTF-25, kuma mun dace da shi 14 (kawai bytes biyu ga kowane ma'anar lambar). Af, Habr ya ki narkar da shi (a cikin tsohon da na sabon edita), don haka sai na saka shi da hoto.

Mu yi kokarin gyara wata matsala. Kamar yadda muke tunawa, ainihin haruffa shine ainihin babban 6 bit, wanda muke tunawa kuma muna manne da lambar kowace alama ta gaba. A cikin yanayin haruffan Sinanci waɗanda ke cikin toshe 0x4E00 - 0x9FFF, Wannan shi ne ko dai bit 0 ko 1. Wannan ba sosai dace: za mu bukatar mu kullum canza haruffa tsakanin wadannan biyu dabi'u (watau ciyar uku bytes). Amma lura cewa a cikin dogon yanayin, daga lambar kanta za mu iya cire adadin haruffan da muke ɓoye ta amfani da gajeriyar yanayin (bayan duk dabarun da aka bayyana a sama, wannan shine 10240) - to, kewayon hieroglyphs zai matsa zuwa 0x2600 - 0x77FF, kuma a cikin wannan yanayin, a cikin dukan wannan kewayon, mafi mahimmancin 6 ragowa (daga cikin 21) zai kasance daidai da 0. Don haka, jerin hieroglyphs za su yi amfani da bytes biyu a kowace hieroglyph (wanda shine mafi kyau ga irin wannan babban kewayon), ba tare da haifar da sauya haruffa.

Madadin mafita: SCSU, BOCU-1

Masana Unicode, bayan karanta taken labarin, da alama za su yi gaggawar tunatar da ku cewa kai tsaye a cikin ƙa'idodin Unicode akwai. Daidaitaccen Tsarin Matsi don Unicode (SCSU), wanda ke bayyana hanyar yin rikodin daidai da wanda aka bayyana a cikin labarin.

Na yarda da gaske: Na koyi game da wanzuwarsa bayan da na nutse sosai wajen rubuta shawarar da na yanke. Da na sani game da shi tun daga farko, da yiwuwa na yi ƙoƙarin rubuta aiwatarwa maimakon in fito da tsarina.

Abin sha'awa shine SCSU tana amfani da ra'ayoyi masu kama da waɗanda na fito da su da kaina (maimakon manufar "alphabets" suna amfani da "windows", kuma akwai ƙarin samuwa fiye da yadda nake da su). A lokaci guda kuma, wannan tsari shima yana da nakasu: yana da ɗan kusanci ga matsawa algorithms fiye da waɗanda aka sanya su. Musamman ma, ƙayyadaddun yana ba da hanyoyin wakilci da yawa, amma bai faɗi yadda za a zaɓi mafi kyawun zaɓi ba - don wannan, dole ne mai rikodin ya yi amfani da wasu nau'ikan heuristics. Don haka, mai rikodin SCSU wanda ke samar da marufi mai kyau zai zama mafi rikitarwa kuma mafi wahala fiye da algorithm na.

Don kwatantawa, na canja wurin aiwatar da SCSU mai sauƙi zuwa JavaScript - dangane da ƙarar lambar ya juya ya zama kwatankwacin UTF-C na, amma a wasu lokuta sakamakon ya kasance dubun na kashi mafi muni (wani lokacin yana iya wuce shi, amma ba da yawa). Misali, rubutun a cikin Ibrananci da Hellenanci an saka su ta UTF-C 60% mafi kyau fiye da SCSU (wataƙila saboda ƙaƙƙarfan haruffan su).

Na dabam, zan ƙara cewa ban da SCSU akwai kuma wata hanyar da za ta wakilci Unicode a takaice - BOCU-1, amma yana nufin dacewa da MIME (wanda ban buƙata ba) kuma yana ɗaukar hanya ta daban don ɓoyewa. Ban tantance tasirinsa ba, amma ga alama a gare ni da wuya ya wuce SCSU.

Ci gaba mai yuwuwa

Algorithm ɗin da na gabatar ba na duniya bane ta ƙira (watakila wannan shine inda burina ya bambanta da manufofin Unicode Consortium). Na riga na ambata cewa an ƙirƙira shi da farko don ɗawainiya ɗaya (ajiye ƙamus na harsuna da yawa a cikin bishiyar prefix), kuma wasu fasalulluka nasa bazai dace da wasu ayyuka ba. Amma gaskiyar cewa ba ma'auni ba na iya zama ƙari - zaka iya gyara shi cikin sauƙi don dacewa da bukatunku.

Misali, a zahirin hanyar da zaku iya kawar da kasancewar jihar, yin coding mara ƙasa - kawai kar a sabunta masu canji. offs, auxOffs и is21Bit a cikin encoder da dikodi. A wannan yanayin, ba zai yiwu a shirya jerin haruffa na haruffa iri ɗaya yadda ya kamata ba, amma za a sami garantin cewa koyaushe ana sanya haruffa iri ɗaya tare da bytes iri ɗaya, ba tare da la'akari da mahallin ba.

Bugu da kari, zaku iya daidaita mai rikodin zuwa wani takamaiman harshe ta hanyar canza yanayin tsoho - alal misali, mai da hankali kan rubutun Rashanci, saita maɓalli da dikodi a farkon. offs = 0x0400 и auxOffs = 0. Wannan yana da ma'ana musamman a yanayin yanayin rashin ƙasa. Gabaɗaya, wannan zai yi kama da yin amfani da tsohuwar rufaffiyar ɓoyayyen-bit takwas, amma ba tare da cire ikon saka haruffa daga duk Unicode kamar yadda ake buƙata ba.

Wani koma baya da aka ambata a baya shine cewa a cikin babban rubutu da aka sanya a cikin UTF-C babu wata hanya mai sauri don nemo iyakar halayen kusa da byte na sabani. Idan ka yanke na ƙarshe, ka ce, 100 bytes daga rumbun adana bayanai, kuna haɗarin samun datti wanda ba za ku iya yin komai da shi ba. Ba a ƙirƙiri ɓoyayyen ɓoye don adana rajistan ayyukan gigabyte da yawa ba, amma gabaɗaya ana iya gyara wannan. Byte 0xBF kada ya taba bayyana azaman byte na farko (amma yana iya zama na biyu ko na uku). Don haka, lokacin shigar da bayanai, zaku iya saka jerin 0xBF 0xBF 0xBF kowane, ka ce, 10 KB - to, idan kana buƙatar nemo iyaka, zai isa ka bincika yanki da aka zaɓa har sai an sami irin wannan alamar. Bin na karshe 0xBF yana da tabbacin zama farkon hali. (Lokacin yanke hukunci, wannan jeri na bytes uku, ba shakka, za a buƙaci a yi watsi da su.)

Don taƙaita

Idan kun karanta wannan nisa, taya murna! Ina fatan ku, kamar ni, kun koyi sabon abu (ko sabunta ƙwaƙwalwar ku) game da tsarin Unicode.

Wani keken: muna adana igiyoyin Unicode 30-60% fiye da UTF-8
Shafin Demo. Misalin Ibrananci yana nuna fa'idodi akan duka UTF-8 da SCSU.

Binciken da aka kwatanta a sama bai kamata a dauki shi a matsayin cin zarafi ga ma'auni ba. Duk da haka, gaba ɗaya na gamsu da sakamakon aikina, don haka ina farin ciki da su raba: alal misali, ƙaramin ɗakin karatu na JS yana auna bytes 1710 kawai (kuma ba shi da abin dogaro, ba shakka). Kamar yadda na ambata a sama, ana iya samun aikinta a demo page (akwai kuma saitin rubutun da ake iya kwatanta shi da UTF-8 da SCSU).

A ƙarshe, zan sake jawo hankali ga lamuran da ake amfani da UTF-C ba shi daraja:

  • Idan layukan ku sun yi tsayi (daga haruffa 100-200). A wannan yanayin, ya kamata ku yi tunani game da amfani da algorithms matsawa kamar deflate.
  • Idan kana bukata ASCII nuna gaskiya, wato, yana da mahimmanci a gare ku cewa rukunonin jerin ba su ƙunshi lambobin ASCII waɗanda ba su cikin asalin kirtani. Ana iya guje wa buƙatar wannan idan, lokacin yin hulɗa tare da APIs na ɓangare na uku (misali, aiki tare da bayanan bayanai), kun ƙaddamar da sakamakon ɓoyewa azaman saitin bytes, kuma ba azaman kirtani ba. In ba haka ba, kuna haɗarin samun raunin da ba zato ba tsammani.
  • Idan kana so ka sami damar gano iyakoki cikin sauri a cikin sabani na sabani (misali, lokacin da ɓangaren layi ya lalace). Ana iya yin wannan, amma ta hanyar bincika layin daga farkon (ko yin amfani da gyaran da aka bayyana a sashin da ya gabata).
  • Idan kana buƙatar yin aiki da sauri a kan abubuwan da ke cikin igiyoyin igiya (jera su, bincika substrings a cikinsu, haɗawa). Wannan yana buƙatar ƙaddamar da kirtani da farko, don haka UTF-C zai yi hankali fiye da UTF-8 a cikin waɗannan lokuta (amma sauri fiye da algorithms matsawa). Tunda kirtani iri ɗaya koyaushe ana ɓoye su ta hanya ɗaya, ba a buƙatar ainihin kwatancen ƙididdigewa kuma ana iya yin su akan tsarin byte-by-byte.

ta karshe: mai amfani Tyomitch a cikin sharhin da ke ƙasa ya buga jadawali wanda ke nuna iyakoki masu dacewa na UTF-C. Yana nuna cewa UTF-C ya fi dacewa fiye da maƙasudin maƙasudi na gaba ɗaya (bambancin LZW) idan dai kirtani ya fi guntu. ~ 140 haruffa (duk da haka, na lura cewa an yi kwatancen akan rubutu ɗaya; ga sauran harsuna sakamakon na iya bambanta).
Wani keken: muna adana igiyoyin Unicode 30-60% fiye da UTF-8

source: www.habr.com

Add a comment