Njinga ina: timasunga zingwe za Unicode 30-60% zophatikizana kuposa UTF-8

Njinga ina: timasunga zingwe za Unicode 30-60% zophatikizana kuposa UTF-8

Ngati ndinu wopanga mapulogalamu ndipo mukukumana ndi ntchito yosankha encoding, ndiye kuti Unicode nthawi zonse imakhala yankho lolondola. Njira yeniyeni yoyimira imatengera zomwe zikuchitika, koma nthawi zambiri pamakhala yankho lapadziko lonse lapansi - UTF-8. Ubwino wake ndikuti umakupatsani mwayi wogwiritsa ntchito zilembo zonse za Unicode osawononga ndalama nawonso ma byte ambiri nthawi zambiri. Zowona, m'zilankhulo zomwe zimagwiritsa ntchito kuposa zilembo zachilatini, "osachulukira" ndi osachepera ma byte awiri pa khalidwe. Kodi tingachite bwinoko osabwereranso ku ma encodings akale omwe amatilepheretsa kukhala ndi zilembo 256 zokha?

Pansipa ndikufunsani kuti mudziwe bwino ndikuyesera kuyankha funsoli ndikugwiritsa ntchito njira yosavuta yomwe imakupatsani mwayi wosunga mizere m'zilankhulo zambiri zapadziko lapansi popanda kuwonjezera kubwereza komwe kuli mu UTF-8.

Chodzikanira. Nthawi yomweyo ndisungitseko zofunika zingapo: yankho lofotokozedwa silinaperekedwe ngati m'malo mwa UTF-8, ndizoyenera pamndandanda wopapatiza wamilandu (zambiri pa iwo pansipa), ndipo palibe chomwe chiyenera kugwiritsidwa ntchito polumikizana ndi ma API a chipani chachitatu (omwe sadziwa ngakhale za izo). Nthawi zambiri, ma aligorivimu acholinga chambiri (mwachitsanzo, deflate) ndi oyenera kusungirako zinthu zambiri zamawu. Kuphatikiza apo, popanga yankho langa, ndapeza mulingo womwe ulipo mu Unicode womwewo, womwe umathetsa vuto lomwelo - ndizovuta kwambiri (ndipo nthawi zambiri zimakhala zoyipa), komabe ndi muyezo wovomerezeka, osati kungoyika. pamodzi pa bondo. Inenso ndikuuzani za iye.

Za Unicode ndi UTF-8

Poyamba, mawu ochepa za chomwe chiri Unicode ΠΈ UTF-8.

Monga mukudziwa, ma encoding a 8-bit anali otchuka. Ndi iwo, chirichonse chinali chophweka: zilembo 256 zikhoza kuwerengedwa ndi manambala kuyambira 0 mpaka 255, ndipo manambala kuyambira 0 mpaka 255 akhoza kuimiridwa ngati baiti imodzi. Ngati tibwereranso koyambirira, kabisidwe ka ASCII kamakhala kocheperako ku ma bits 7, kotero chofunikira kwambiri pakuyimilira kwake ndi ziro, ndipo ma encodings ambiri a 8-bit amagwirizana nawo (amasiyana "chapamwamba" gawo, pomwe chofunikira kwambiri ndi chimodzi).

Kodi Unicode imasiyana bwanji ndi ma encodings ndipo chifukwa chiyani zoyimira zambiri zimalumikizidwa nazo - UTF-8, UTF-16 (BE ndi LE), UTF-32? Tiyeni tikonze izo mwadongosolo.

Muyezo woyambira wa Unicode umangofotokozera makalata omwe ali pakati pa zilembo (ndipo nthawi zina, zigawo za zilembo) ndi manambala awo. Ndipo pali ziwerengero zambiri zomwe zingatheke muyeso uwu - kuchokera 0x00 mpaka 0x10FFFF (1 zidutswa). Ngati tikanafuna kuyika nambala mumtundu woterewu kuti ikhale yosinthika, palibe 114 kapena 112 mabayiti sangakhale okwanira kwa ife. Ndipo popeza mapurosesa athu sanapangidwe kuti azigwira ntchito ndi manambala atatu-byte, titha kukakamizidwa kugwiritsa ntchito ma byte 1 pamunthu aliyense! Ichi ndi UTF-2, koma ndichifukwa cha "zowononga" izi kuti mtundu uwu si wotchuka.

Mwamwayi, dongosolo la zilembo mkati mwa Unicode sikungochitika mwachisawawa. Seti yawo yonse yagawidwa mu 17 ".ndege", chilichonse chomwe chili ndi 65536 (0x10000""kodi points" Lingaliro la "code point" apa ndi losavuta nambala yamunthu, yoperekedwa ndi Unicode. Koma, monga tafotokozera pamwambapa, ku Unicode sikuti zilembo pawokha zimawerengedwa, komanso zigawo zawo ndi zizindikiro zautumiki (ndipo nthawi zina palibe chomwe chimafanana ndi chiwerengerocho - mwina pakadali pano, koma kwa ife izi sizofunikira), kotero Ndi zolondola nthawi zonse kulankhula mwachindunji chiwerengero cha manambala okha, osati zizindikiro. Komabe, pazotsatirazi, chifukwa cha kufupikitsa, nthawi zambiri ndimagwiritsa ntchito mawu oti "chizindikiro", kutanthauza kuti "code point".

Njinga ina: timasunga zingwe za Unicode 30-60% zophatikizana kuposa UTF-8
Unicode ndege. Monga mukuonera, zambiri (ndege 4 mpaka 13) sizinagwiritsidwe ntchito.

Chochititsa chidwi kwambiri ndi chakuti "zamkati" zonse zili mu ndege ya zero, yotchedwa "Ndege Yoyambira Zilankhulo Zambiri". Ngati mzere uli ndi zolemba m'zilankhulo zamakono (kuphatikiza Chitchaina), simungadutse ndegeyi. Koma simungathe kudulira ma Unicode onse - mwachitsanzo, ma emoji amapezeka kumapeto kwa ndege yotsatira,"Ndege Yowonjezera Zinenero Zambiri"(kuchokera ku 0x10000 mpaka 0x1FFFF). Chifukwa chake UTF-16 imachita izi: zilembo zonse zikugwera mkati Ndege Yoyambira Zilankhulo Zambiri, amalembedwa kuti "monga momwe ziliri" ndi nambala yofanana ya ma baiti awiri. Komabe, ziwerengero zina pamndandanda uwu sizimawonetsa zilembo zenizeni, koma zikuwonetsa kuti pambuyo pa ma byte awiriwa tiyenera kuganiziranso ina - kuphatikiza zikhalidwe za ma byte anayiwa palimodzi, timapeza nambala yomwe imaphimba. mitundu yonse yovomerezeka ya Unicode. Lingaliro ili limatchedwa "mabanja oberekera" - mwina mudamvapo za iwo.

Chifukwa chake UTF-16 imafuna awiri kapena (nthawi zina) ma byte anayi pa "code point". Izi ndizabwino kuposa kugwiritsa ntchito mabayiti anayi nthawi zonse, koma Chilatini (ndi zilembo zina za ASCII) zikasungidwa motere zimawononga theka la danga pa ziro. UTF-8 idapangidwa kuti ikonze izi: ASCII momwemo imakhala, monga kale, byte imodzi yokha; kodi ku 0x80 mpaka 0x7FF - mabayiti awiri; kuchokera 0x800 mpaka 0xFFFF - atatu, ndi kuchokera 0x10000 mpaka 0x10FFFF - zinayi. Kumbali imodzi, zilembo za Chilatini zakhala zabwino: kuyanjana ndi ASCII kwabwerera, ndipo kugawa kumafalikira "kufalikira" kuchokera ku 1 mpaka 4 mabayiti. Koma zilembo zina kupatula Chilatini, tsoka, sizipindula mwanjira iliyonse poyerekeza ndi UTF-16, ndipo ambiri tsopano amafuna ma byte atatu m'malo mwa awiri - mndandanda womwe umakutidwa ndi mbiri ya ma byte awiri wachepa ndi nthawi 32, ndi 0xFFFF mpaka 0x7FF, ndipo palibe Chitchaina kapena, mwachitsanzo, Chijojiya chophatikizidwamo. Cyrillic ndi zilembo zina zisanu - hurray - mwayi, ma byte 2 pamunthu.

Chifukwa chiyani izi zimachitika? Tiyeni tiwone momwe UTF-8 imayimira zilembo:
Njinga ina: timasunga zingwe za Unicode 30-60% zophatikizana kuposa UTF-8
Mwachindunji kuyimira manambala, ma bits okhala ndi chizindikiro amagwiritsidwa ntchito pano x. Zitha kuwoneka kuti muzolemba ziwiri-byte pali ma bits 11 okha (pa 16). Zigawo zotsogola pano zili ndi ntchito yothandiza yokha. Pankhani ya mbiri ya mabayiti anayi, ma bits 21 mwa 32 amaperekedwa pa nambala ya code - zikuwoneka kuti ma byte atatu (omwe amapereka ma bits 24) angakhale okwanira, koma zolembera zantchito zimadya kwambiri.

Kodi izi ndizoyipa? Osati kwenikweni. Kumbali imodzi, ngati timasamala kwambiri za malo, tili ndi ma aligorivimu opondereza omwe amatha kuchotsa mosavuta entropy ndi redundancy. Kumbali ina, cholinga cha Unicode chinali kupereka zolemba zonse zomwe zingatheke. Mwachitsanzo, titha kuyika mzere womwe uli mu UTF-8 ku code yomwe idangogwirapo ntchito ndi ASCII, osachita mantha kuti iwona mawonekedwe amtundu wa ASCII omwe kulibe (pambuyo pake, mu UTF-8 onse. ma bytes kuyambira pa zero bit - izi ndi zomwe ASCII ili). Ndipo ngati mwadzidzidzi tikufuna kudula mchira wawung'ono kuchokera ku chingwe chachikulu popanda kuulemba kuyambira pachiyambi (kapena kubwezeretsa gawo lachidziwitso pambuyo pa gawo lowonongeka), n'zosavuta kuti tipeze kuchotsera kumene khalidwe limayambira (ndikokwanira). kulumpha mabayiti omwe ali ndi mawu oyambira pang'ono 10).

Nanga bwanji kupanga zatsopano?

Nthawi yomweyo, pamakhala nthawi zina pomwe ma compression algorithms ngati deflate sagwira ntchito bwino, koma mukufuna kukwaniritsa zingwe zophatikizika. Ineyo pandekha, ndinakumana ndi vuto ili poganiza zomanga mtengo wamayambiriro wothinikizidwa kwa dikishonale yayikulu kuphatikiza mawu azilankhulo zosagwirizana. Kumbali imodzi, liwu lililonse ndi lalifupi kwambiri, kotero kulipiritsa sikungakhale kothandiza. Kumbali ina, kukhazikitsidwa kwa mtengo komwe ndidawona kudapangidwa kuti byte iliyonse ya chingwe chosungidwa ipange vertex yamitengo yosiyana, kotero kuchepetsa chiwerengero chawo kunali kothandiza kwambiri. Mu library yanga Az.js (Monga mu pymorphy2, pomwe idakhazikitsidwa) vuto lofananalo litha kuthetsedwa mosavuta - zingwe zodzaza DAWG-dictionary, yosungidwa mmenemo CP1251 yabwino. Koma, monga n'zosavuta kumva, izi zimagwira ntchito bwino pa zilembo zochepa - mzere wa Chitchaina sungakhoze kuwonjezeredwa ku dikishonale yotere.

Payokha, ndikufuna kuti ndizindikire lingaliro linanso losasangalatsa lomwe limakhalapo mukamagwiritsa ntchito UTF-8 pamapangidwe amtundu wotere. Chithunzi pamwambapa chikuwonetsa kuti munthu akalembedwa ngati ma byte awiri, tizidutswa tambiri tomwe timalumikizana ndi nambala yake simabwera motsatana, koma timasiyanitsidwa ndi ma bitti awiri. 10 pakati: 110xxxxx 10xxxxxx. Chifukwa cha izi, pamene ma bits 6 otsika a byte yachiwiri akusefukira mu code code (ie, kusintha kumachitika. 10111111 β†’ 10000000), ndiye kuti byte yoyamba imasinthanso. Zikuoneka kuti chilembo "p" chikuimira ma byte 0xD0 0xBF, ndipo β€œr” wotsatira ali kale 0xD1 0x80. Mumtengo woyambira, izi zimatsogolera ku kugawanika kwa mfundo za makolo kukhala pawiri - imodzi mwachiyambi 0xD0, ndi zina za 0xD1 (ngakhale zilembo zonse za Cyrillic zitha kusindikizidwa ndi baiti yachiwiri).

Ndinapeza chiyani

Ndikukumana ndi vutoli, ndidaganiza zoyeserera kusewera masewera ndi ma bits, ndipo nthawi yomweyo kuti ndidziwe bwino kapangidwe ka Unicode yonse. Zotsatira zake zinali mtundu wa encoding wa UTF-C ("C" wa yaying'ono), yomwe sichitha kupitilira 3 mabayiti pa code iliyonse, ndipo nthawi zambiri imakupatsani mwayi wongowononga baiti imodzi yowonjezera pamzere wonse wa encoded. Izi zimatsogolera kuzinthu zambiri zomwe sizili za ASCII zilembo zotere zimakhala 30-60% yocheperako kuposa UTF-8.

Ndapereka zitsanzo za kukhazikitsa ma encoding ndi decoding algorithms mu mawonekedwe JavaScript ndi Go library, mutha kuzigwiritsa ntchito momasuka mu code yanu. Koma ndikugogomezera kuti mwanjira ina mtundu uwu ukhalabe "njinga", ndipo sindimalimbikitsa kugwiritsa ntchito popanda kuzindikira chifukwa chake mukufunikira. Uku ndikadali kuyesa kochulukirapo kuposa "kusintha kwa UTF-8". Komabe, kachidindo komweko kalembedwa mwaukhondo, mwachidule, ndi ndemanga zambiri komanso kuwunikira mayeso.

Njinga ina: timasunga zingwe za Unicode 30-60% zophatikizana kuposa UTF-8
Zotsatira zoyesa ndikuyerekeza ndi UTF-8

Inenso ndinatero tsamba lachiwonetsero, komwe mungayang'ane momwe ma aligorivimu amagwirira ntchito, ndiyeno ndikuwuzani zambiri za mfundo zake ndi njira yachitukuko.

Kuchotsa ma bits owonjezera

Ndidatenga UTF-8 ngati maziko, inde. Chinthu choyamba komanso chodziwikiratu chomwe chingasinthidwe mmenemo ndi kuchepetsa chiwerengero cha ma bits a utumiki mu byte iliyonse. Mwachitsanzo, baiti yoyamba mu UTF-8 nthawi zonse imayamba ndi mwina 0, kapena ndi 11 - chiyambi 10 Ndi mabayiti otsatirawa okha omwe ali nacho. Tiyeni tisinthe mawu oyamba 11 pa 1, ndi ma byte otsatirawa tidzachotsa ma prefixes kwathunthu. Kodi chidzachitike n'chiyani?

0xxxxxxx -1 pa
10xxxxxx xxxxxxxx - 2 bati
110xxxxx xxxxxxxx xxxxxxxx - 3 bati

Dikirani, mbiri ya mabayiti anayi ili kuti? Koma sizikufunikanso - polemba ma byte atatu, tsopano tili ndi ma bits 21 ndipo izi ndizokwanira manambala onse mpaka 0x10FFFF.

Tapereka chiyani pano? Chofunikira kwambiri ndikuzindikira malire a zilembo kuchokera kumalo osagwirizana mu buffer. Sitingathe kuloza pa byte mongotsatira ndikupeza chiyambi cha munthu wina kuchokera pamenepo. Izi ndizochepetsa mawonekedwe athu, koma pochita izi sizofunikira. Nthawi zambiri timatha kudutsa buffer kuyambira pachiyambi (makamaka ikafika pamizere yayifupi).

Zomwe zili ndi zilankhulo zokhala ndi ma byte 2 zakhalanso bwino: tsopano mawonekedwe a ma byte awiri amapereka ma bits 14, ndipo awa ndi ma code mpaka 0x3FFF. Anthu aku China ndi opanda mwayi (makhalidwe awo ambiri amachokera ku 0x4E00 mpaka 0x9FFF), koma anthu aku Georgia ndi anthu ena ambiri amasangalala kwambiri - zilankhulo zawo zimagwirizananso ndi ma byte awiri pamunthu aliyense.

Lowetsani encoder state

Tiyeni tsopano tiganizire za mawonekedwe a mizereyo. Mtanthauzira mawu nthawi zambiri amakhala ndi mawu olembedwa m'zilembo zofanana, ndipo izi ndi zoonanso m'malemba ena ambiri. Zingakhale bwino kusonyeza zilembozi kamodzi, ndiyeno kusonyeza nambala yokha ya chilembocho. Tiyeni tiwone ngati dongosolo la zilembo patebulo la Unicode litithandiza.

Monga tafotokozera pamwambapa, Unicode yagawidwa ndege 65536 kodi iliyonse. Koma izi sizothandiza kwambiri (monga tanena kale, nthawi zambiri timakhala mu ndege ya zero). Chochititsa chidwi kwambiri ndikugawanika ndi midadada. Mizere iyi ilibenso utali wokhazikika, ndipo ndi watanthauzo - monga lamulo, iliyonse imaphatikiza zilembo kuchokera ku zilembo zomwezo.

Njinga ina: timasunga zingwe za Unicode 30-60% zophatikizana kuposa UTF-8
Chida chomwe chili ndi zilembo za zilembo za Chibengali. Tsoka ilo, pazifukwa zakale, ichi ndi chitsanzo cha ma CD osaneneka kwambiri - zilembo 96 zamwazika movutikira pamiyala 128.

Zoyamba za midadada ndi kukula kwake nthawi zonse zimakhala zochulukitsa 16 - izi zimangochitika kuti zitheke. Kuphatikiza apo, midadada yambiri imayamba ndikutha pazikhalidwe zomwe zimachulukitsa 128 kapena 256 - mwachitsanzo, zilembo zoyambirira za Cyrillic zimatenga ma byte 256 kuchokera. 0x0400 mpaka 0x04FF. Izi ndizosavuta: ngati tisunga mawu oyamba kamodzi 0x04, ndiye kuti zilembo za Cyrillic zitha kulembedwa mu baiti imodzi. Zowona, mwanjira iyi tidzataya mwayi wobwerera ku ASCII (ndi zilembo zina zonse). Chifukwa chake timachita izi:

  1. Ma byte awiri 10yyyyyy yxxxxxxx osati kungotanthauza chizindikiro chokhala ndi nambala yyyyyy yxxxxxxx, komanso kusintha zilembo zamakono pa yyyyyy y0000000 (mwachitsanzo, timakumbukira zidutswa zonse kupatula zazing'ono kwambiri 7 pang'ono);
  2. Bayiti imodzi 0xxxxxxx ichi ndi chikhalidwe cha zilembo zamakono. Zimangofunika kuwonjezeredwa ku zowonongeka zomwe tidakumbukira mu sitepe 1. Ngakhale kuti sitinasinthe zilembo, kuchotserako ndi zero, kotero tinakhalabe mogwirizana ndi ASCII.

Momwemonso pama code omwe amafunikira ma byte atatu:

  1. Mabayiti atatu 110yyyyy yxxxxxxx xxxxxxxx sonyeza chizindikiro chokhala ndi nambala yyyyyy yxxxxxxx xxxxxxxx, kusintha zilembo zamakono pa yyyyyy y0000000 00000000 (anakumbukira chilichonse kupatula achichepere 15 pang'ono), ndikuyang'ana bokosi lomwe tilimo tsopano yaitali mode (posintha zilembo kubwerera ku ziwiri-byte imodzi, tidzakhazikitsanso mbendera iyi);
  2. Ma byte awiri 0xxxxxxx xxxxxxxx mumayendedwe aatali ndi mawonekedwe a zilembo zamakono. Mofananamo, timawonjezera ndi chotsitsa kuchokera ku sitepe 1. Kusiyana kokha ndiko kuti tsopano timawerenga ma byte awiri (chifukwa tinasinthira ku mode iyi).

Zikumveka bwino: tsopano pamene tikufunika kuyika zilembo kuchokera pamtundu womwewo wa 7-bit Unicode, timathera 1 byte yowonjezera poyambira ndi kuchuluka kwa baiti imodzi pamunthu aliyense.

Njinga ina: timasunga zingwe za Unicode 30-60% zophatikizana kuposa UTF-8
Kugwira ntchito kuchokera ku imodzi mwazomasulira zakale. Nthawi zambiri imamenya UTF-8, koma pali malo oti asinthe.

Choyipa ndi chiyani? Choyamba, tili ndi chikhalidwe, chomwe ndi zilembo zamtundu wamakono ndi checkbox mode wautali. Izi zikutilepheretsanso: tsopano zilembo zomwezo zitha kusungidwa mosiyanasiyana m'malo osiyanasiyana. Kusaka ma substrings, mwachitsanzo, kuyenera kuchitika poganizira izi, osati kungoyerekeza ma byte. Kachiwiri, titangosintha zilembo, zidakhala zoyipa ndikulemba zilembo za ASCII (ndipo izi sizongolemba zachilatini zokha, komanso zilembo zoyambira, kuphatikiza mipata) - zimafunikira kusintha zilembo kukhala 0, ndiye kuti, kachiwiri baiti yowonjezera (ndiyeno ina kuti tibwererenso ku mfundo yathu yayikulu).

Chilembo chimodzi ndi chabwino, ziwiri ndi zabwino

Tiyeni tiyese kusintha ma prefixes athu pang'ono, ndikufinyanso chimodzi mpaka zitatu zomwe tafotokozazi:

0xxxxxxx - 1 byte mumachitidwe abwinobwino, 2 mumachitidwe autali
11xxxxxx -1 pa
100xxxxx xxxxxxxx - 2 bati
101xxxxx xxxxxxxx xxxxxxxx - 3 bati

Njinga ina: timasunga zingwe za Unicode 30-60% zophatikizana kuposa UTF-8

Tsopano mu mbiri ya ma byte pali kachidutswa kakang'ono kamodzi komwe kamapezeka - kachidindo kofikira 0x1FFF, osati 0x3FFF. Komabe, ikadali yokulirapo kuposa ma code a UTF-8 awiri-byte, zilankhulo zodziwika bwino zimakwanira, kutayika kowoneka bwino kwagwa. hiragana ΠΈ katakana, anthu a ku Japan ali achisoni.

Kodi code yatsopanoyi ndi chiyani? 11xxxxxx? Ichi ndi "stash" yaying'ono ya zilembo 64 kukula kwake, zimakwaniritsa zilembo zathu zazikulu, chifukwa chake ndidazitcha kuti zothandizira (wothandiza) zilembo. Tikasintha zilembo zamakono, chidutswa cha zilembo zakale chimakhala chothandizira. Mwachitsanzo, tinasintha kuchoka ku ASCII kupita ku Cyrillic - stash tsopano ili ndi zilembo 64 Zilembo zachilatini, manambala, malo ndi koma (zowonjezera pafupipafupi m'malemba omwe si a ASCII). Bwererani ku ASCII - ndipo gawo lalikulu la zilembo za Cyrillic lidzakhala zilembo zothandizira.

Chifukwa chokhala ndi zilembo ziwiri, titha kuthana ndi zolemba zambiri zokhala ndi ndalama zochepa zosinthira zilembo (zopumira nthawi zambiri zimabweretsa kubwerera ku ASCII, koma pambuyo pake tipeza zilembo zambiri zomwe si za ASCII kuchokera pazowonjezera, popanda kusinthanso).

Bonasi: kuyika patsogolo zilembo zazing'ono 11xxxxxx ndikusankha njira yake yoyambira kukhala 0xC0, timapeza kuyanjana pang'ono ndi CP1252. Mwanjira ina, zolemba zambiri (koma osati zonse) zaku Western Europe zosungidwa mu CP1252 zidzawoneka chimodzimodzi mu UTF-C.

Apa, komabe, pali vuto: momwe mungapezere wothandizira kuchokera ku zilembo zazikulu? Mutha kusiya zomwezo, koma - tsoka - apa mawonekedwe a Unicode akusewera kale motsutsana nafe. Nthawi zambiri, gawo lalikulu la zilembo silimayambiriro kwa chipika (mwachitsanzo, likulu la Russia "A" lili ndi code. 0x0410, ngakhale chipika cha Cyrillic chimayamba ndi 0x0400). Chifukwa chake, titatenga zilembo 64 zoyambilira mu stash, titha kulephera kupeza gawo la mchira wa zilembo.

Kuti ndithetse vutoli, ndidadutsa pamanja midadada yofananira ndi zilankhulo zosiyanasiyana, ndikulongosola kuchotsera kwa zilembo zothandizira mkati mwa chachikulu chawo. Zilembo za Chilatini, kupatulapo, zimasinthidwanso ngati base64.

Njinga ina: timasunga zingwe za Unicode 30-60% zophatikizana kuposa UTF-8

Zomaliza zomaliza

Tiyeni pomaliza tiganizire za kwina komwe tingakonze zina.

Onani kuti mawonekedwe 101xxxxx xxxxxxxx xxxxxxxx imakulolani kuti mulembe manambala mpaka 0x1FFFFF, ndipo Unicode imatha kale, pa 0x10FFFF. Mwanjira ina, mfundo yomaliza ya code idzayimiridwa ngati 10110000 11111111 11111111. Choncho, tikhoza kunena kuti ngati byte yoyamba ndi ya mawonekedwe 1011xxxx (kuti xxxx wamkulu kuposa 0), ndiye amatanthauza china. Mwachitsanzo, mutha kuwonjezera zilembo zina 15 kumeneko zomwe zimapezeka nthawi zonse kuti zisungidwe mu baiti imodzi, koma ndidasankha kuchita mosiyana.

Tiyeni tiwone midadada ya Unicode yomwe ikufuna ma byte atatu tsopano. Kwenikweni, monga tanenera kale, awa ndi zilembo za Chitchaina - koma ndizovuta kuchita nawo, pali 21 zikwi. Koma hiragana ndi katakana adawulukiranso kumeneko - ndipo palibenso ambiri, osakwana mazana awiri. Ndipo, popeza tidakumbukira achi Japan, palinso ma emojis (m'malo mwake, amwazikana m'malo ambiri ku Unicode, koma midadada yayikulu ili mgululi. 0x1F300 - 0x1FBFF). Ngati mukuganiza zakuti tsopano pali ma emojis omwe amasonkhanitsidwa kuchokera pamakhodi angapo nthawi imodzi (mwachitsanzo, emoji ‍Njinga ina: timasunga zingwe za Unicode 30-60% zophatikizana kuposa UTF-8 imakhala ndi ma code 7!), ndiye zimakhala zamanyazi kugwiritsa ntchito mabayiti atatu pa chilichonse (7 Γ— 3 = 21 byte chifukwa cha chithunzi chimodzi, maloto owopsa).

Chifukwa chake, timasankha mindandanda yosankhidwa yolingana ndi emoji, hiragana ndi katakana, tiyiwerengenso kukhala mndandanda umodzi wopitilira ndikuyika ngati ma byte awiri m'malo mwa atatu:

1011xxxx xxxxxxxx

Zabwino: emoji yomwe tafotokozayiNjinga ina: timasunga zingwe za Unicode 30-60% zophatikizana kuposa UTF-8, yokhala ndi ma code 7, imatenga ma byte 8 mu UTF-25, ndipo timayikamo. 14 (ma baiti awiri ndendende pa khodi iliyonse). Mwa njira, Habr anakana kukumba (zonse zakale ndi mkonzi watsopano), kotero ndinayenera kuziyika ndi chithunzi.

Tiyeni tiyese kukonza vuto linanso. Monga tikukumbukira, zilembo zoyambirira ndizofunika kwambiri mkulu 6 bits, zomwe timakumbukira ndikumatira ku code ya chizindikiro chilichonse chotsatira. Pankhani ya zilembo zaku China zomwe zili mu block 0x4E00 - 0x9FFF, izi mwina ndi 0 kapena 1. Izi sizothandiza kwambiri: tidzafunika kusintha zilembo pakati pa zikhalidwe ziwirizi (i.e. kuwononga ma byte atatu). Koma zindikirani kuti mumayendedwe aatali, kuchokera pamakina omwewo, titha kuchotsa kuchuluka kwa zilembo zomwe timazilemba pogwiritsa ntchito njira yayifupi (pambuyo pazanzeru zonse zomwe tafotokozazi, izi ndi 10240) - ndiye kuti mitundu yosiyanasiyana ya hieroglyphs idzasinthira ku 0x2600 - 0x77FF, ndipo pamenepa, pamtundu wonsewu, ma bits 6 ofunika kwambiri (pa 21) adzakhala ofanana ndi 0. Choncho, ndondomeko za hieroglyphs zidzagwiritsa ntchito ma byte awiri pa hieroglyph (yomwe ili yoyenera pamtundu waukulu wotere), popanda kuchititsa masinthidwe a zilembo.

Njira zina: SCSU, BOCU-1

Akatswiri a Unicode, atangowerenga mutu wa nkhaniyi, afulumira kukukumbutsani kuti pakati pamiyezo ya Unicode pali. Standard Compression Scheme ya Unicode (SCSU), yomwe ikufotokoza njira yolembera yofanana kwambiri ndi yomwe yafotokozedwa m'nkhaniyi.

Ndikuvomereza moona mtima: Ndinaphunzira za kukhalapo kwake pokhapokha nditakhazikika kwambiri polemba chisankho changa. Ndikadadziwa za izi kuyambira pachiyambi, mwina ndikanayesa kulemba kukhazikitsa m'malo mongobwera ndi njira yangayanga.

Chosangalatsa ndichakuti SCSU imagwiritsa ntchito malingaliro ofanana kwambiri ndi omwe ndidabwera nawo ndekha (m'malo mwa lingaliro la "zilembo" amagwiritsa ntchito "mazenera", ndipo alipo ambiri kuposa omwe ndili nawo). Nthawi yomweyo, mtundu uwu ulinso ndi zovuta: ili pafupi pang'ono ndi ma aligorivimu ophatikizika kuposa ma encoding. Makamaka, muyezo umapereka njira zambiri zowonetsera, koma sizikunena momwe mungasankhire njira yabwino kwambiri - chifukwa cha izi, encoder iyenera kugwiritsa ntchito mtundu wina wa heuristics. Chifukwa chake, encoder ya SCSU yomwe imapanga ma CD abwino idzakhala yovuta komanso yovuta kuposa ma algorithm anga.

Poyerekeza, ndidasamutsa kukhazikitsidwa kosavuta kwa SCSU kupita ku JavaScript - malinga ndi kuchuluka kwa ma code kudakhala kofanana ndi UTF-C yanga, koma nthawi zina zotsatira zake zinali zoipitsitsa (nthawi zina zimatha kupitilira, koma). osati kwambiri). Mwachitsanzo, malemba a m’Chihebri ndi Chigiriki analembedwa ndi UTF-C 60% yabwino kuposa SCSU (mwina chifukwa cha zilembo zawo zophatikizika).

Payokha, ndikuwonjezera kuti kupatula SCSU palinso njira ina yoyimira Unicode - BOCU-1, koma cholinga chake ndi kuyanjana kwa MIME (zomwe sindimafunikira) ndipo zimatenga njira yosiyana pang'ono pakubisa. Sindinayese kugwira ntchito kwake, koma zikuwoneka kwa ine kuti ndizokayikitsa kukhala zapamwamba kuposa SCSU.

Zotheka zotheka

Ma algorithm omwe ndidapereka sakhala opangidwa mwachilengedwe chonse (apa ndipamene zolinga zanga zimasiyana kwambiri ndi zolinga za Unicode Consortium). Ndanena kale kuti idapangidwa makamaka kuti igwire ntchito imodzi (kusunga dikishonale yazilankhulo zambiri mumtengo woyambira), ndipo zina mwazinthu zake sizingakhale zoyenerera ntchito zina. Koma mfundo yakuti si muyezo ikhoza kukhala yowonjezera - mukhoza kusintha mosavuta kuti zigwirizane ndi zosowa zanu.

Mwachitsanzo, m'njira zodziwikiratu mutha kuchotsa kukhalapo kwa boma, pangani zolemba zopanda malire - osasintha zosintha. offs, auxOffs ΠΈ is21Bit mu encoder ndi decoder. Pachifukwa ichi, sizingatheke kulongedza bwino zilembo za zilembo zofanana, koma padzakhala chitsimikizo chakuti khalidwe lomwelo nthawi zonse limakhala ndi ma byte omwewo, mosasamala kanthu za nkhaniyo.

Kuphatikiza apo, mutha kusinthira encoder ku chilankhulo china posintha kusakhazikika - mwachitsanzo, kuyang'ana zolemba za Chirasha, ikani encoder ndi decoder poyambira. offs = 0x0400 ΠΈ auxOffs = 0. Izi ndizomveka makamaka pankhani yamtundu wopanda malire. Mwambiri, izi zidzakhala ngati kugwiritsa ntchito encoding yakale ya-bit-bit, koma osachotsa kuthekera koyika zilembo kuchokera ku Unicode yonse ngati pakufunika.

Chotsalira china chomwe chatchulidwa kale ndichakuti m'mawu akulu osungidwa mu UTF-C palibe njira yachangu yopezera malire amtundu omwe ali pafupi kwambiri ndi ma byte osagwirizana. Mukadula chomaliza, titi, ma byte 100 kuchokera pa buffer yosungidwa, mutha kupeza zinyalala zomwe simungathe kuchita nazo chilichonse. Encoding sinapangidwe kuti isungidwe ma gigabyte angapo, koma nthawi zambiri izi zitha kukonzedwa. Bwino 0xBF sayenera kuwoneka ngati baiti yoyamba (koma ikhoza kukhala yachiwiri kapena yachitatu). Chifukwa chake, mukabisala, mutha kuyika zotsatizana 0xBF 0xBF 0xBF iliyonse, titi, 10 KB - ndiye, ngati mukufuna kupeza malire, zidzakhala zokwanira kusanthula chidutswa chosankhidwa mpaka chizindikiro chofanana chikupezeka. Kutsatira chomaliza 0xBF zimatsimikizika kukhala chiyambi cha munthu. (Polemba, kutsatizana kwa ma byte atatu, ndithudi, kumayenera kunyalanyazidwa.)

Kufotokozera mwachidule

Ngati mwawerenga mpaka pano, zikomo! Ndikukhulupirira kuti, monga ine, mwaphunzira china chatsopano (kapena mwatsitsimula kukumbukira) pamapangidwe a Unicode.

Njinga ina: timasunga zingwe za Unicode 30-60% zophatikizana kuposa UTF-8
Tsamba lachiwonetsero. Chitsanzo cha Chihebri chikuwonetsa zabwino zonse za UTF-8 ndi SCSU.

Kafukufuku wafotokozedwa pamwambapa sayenera kuonedwa ngati kuphwanya miyezo. Komabe, nthawi zambiri ndimakhutira ndi zotsatira za ntchito yanga, choncho ndimasangalala nazo agawana: mwachitsanzo, laibulale ya minified ya JS imalemera ma byte 1710 okha (ndipo ilibe zodalira, ndithudi). Monga ndanenera pamwambapa, ntchito yake imapezeka pa tsamba lachiwonetsero (palinso zolemba zomwe zitha kufananizidwa ndi UTF-8 ndi SCSU).

Pomaliza, ndifotokozanso za milandu yomwe UTF-C imagwiritsidwa ntchito sizothandiza:

  • Ngati mizere yanu ndi yayitali mokwanira (kuyambira zilembo 100-200). Pankhaniyi, muyenera kuganizira kugwiritsa ntchito compression algorithms ngati deflate.
  • Ngati mukufuna Kuwonekera kwa ASCII, ndiko kuti, ndikofunikira kwa inu kuti ma encoded asakhale ndi ma code ASCII omwe sanali mu chingwe choyambirira. Kufunika kwa izi kungapewedwe ngati, polumikizana ndi ma API a chipani chachitatu (mwachitsanzo, kugwira ntchito ndi database), mudutsa zotsatira za encoding ngati ma byte achinsinsi, osati ngati zingwe. Apo ayi, mukhoza kukhala pachiopsezo chosayembekezereka.
  • Ngati mukufuna kuti muthe kupeza msanga malire a zilembo pamtundu wokhazikika (mwachitsanzo, gawo la mzere litawonongeka). Izi zitha kuchitika, koma kungoyang'ana mzere kuyambira pachiyambi (kapena kugwiritsa ntchito kusinthidwa komwe kwafotokozedwa m'gawo lapitalo).
  • Ngati mukufuna kuchita ntchito mwachangu pazomwe zili mu zingwe (zisankha, fufuzani ma substrings mmenemo, concatenate). Izi zimafuna kuti zingwe zisinthidwe kaye, kuti UTF-C ikhale yocheperako kuposa UTF-8 muzochitika izi (koma mwachangu kuposa ma aligorivimu opondereza). Popeza chingwe chomwecho nthawi zonse chimasungidwa mofanana, kufananitsa kwenikweni kwa decoding sikofunikira ndipo kungathe kuchitidwa pa byte-by-byte.

pomwe: wogwiritsa ntchito Tyomitch mu ndemanga pansipa adayika chithunzi chowunikira malire a UTF-C. Zikuwonetsa kuti UTF-C ndiyothandiza kwambiri kuposa ma aligorivimu acholinga chokhazikika (kusintha kwa LZW) bola chingwe cholongedzacho chikhale chachifupi. ~ 140 zilembo (komabe, ndikuwona kuti kufananitsako kudachitika palemba limodzi; m'zilankhulo zina zotsatira zake zimatha kusiyana).
Njinga ina: timasunga zingwe za Unicode 30-60% zophatikizana kuposa UTF-8

Source: www.habr.com

Kuwonjezera ndemanga