ʻO kekahi kaʻa: mālama mākou i nā kaula Unicode 30-60% ʻoi aku ka paʻakikī ma mua o UTF-8

ʻO kekahi kaʻa: mālama mākou i nā kaula Unicode 30-60% ʻoi aku ka paʻakikī ma mua o UTF-8

Inā he mea hoʻomohala ʻoe a ke alo nei ʻoe i ka hana o ke koho ʻana i kahi encoding, a laila ʻo Unicode ka hopena kūpono. ʻO ke ʻano kikoʻī kikoʻī e pili ana i ka pōʻaiapili, akā ʻo ka pinepine he pane ākea ma aneʻi - UTF-8. ʻO ka mea maikaʻi e pili ana iā ia, hiki iā ʻoe ke hoʻohana i nā huapalapala Unicode āpau me ka ʻole o ka hoʻolilo puei loa nui nā bytes i ka nui o nā hihia. ʻOiaʻiʻo, no nā ʻōlelo e hoʻohana ana ma mua o ka alphabet Latin wale nō, "ʻaʻole nui" ka liʻiliʻi loa ʻelua paita no kēlā me kēia ʻano. Hiki iā mākou ke hana maikaʻi me ka ʻole o ka hoʻi ʻana i nā encoding prehistoric e kaupalena iā mākou i 256 mau huaʻōlelo i loaʻa?

Ma lalo iho nei wau e hoʻomaʻamaʻa iā ʻoe iho me kaʻu hoʻāʻo e pane i kēia nīnau a hoʻokō i kahi algorithm maʻalahi e hiki ai iā ʻoe ke mālama i nā laina i ka hapa nui o nā ʻōlelo o ka honua me ka ʻole o ka hoʻohui ʻana i ka redundancy ma UTF-8.

Hoʻolele. E hana koke wau i kekahi mau mea nui: ʻAʻole hāʻawi ʻia ka hopena i wehewehe ʻia ma ke ʻano he pani honua no UTF-8, kūpono wale ia ma kahi papa inoa haiki o nā hihia (ʻoi aku ma luna o lākou ma lalo), a ʻaʻole pono e hoʻohana ʻia e launa pū me nā API ʻaoʻao ʻekolu (ʻaʻole ʻike e pili ana iā ia). ʻO ka pinepine, kūpono nā algorithms compression maʻamau (no ka laʻana, deflate) no ka mālama paʻa ʻana o nā puke nui o ka ʻikepili kikokikona. Eia kekahi, i ke kaʻina hana o kaʻu hopena, ua loaʻa iaʻu kahi maʻamau i Unicode ponoʻī, e hoʻoponopono i ka pilikia like - ʻoi aku ka paʻakikī (a ʻoi aku ka maikaʻi ʻole), akā naʻe he maʻamau i ʻae ʻia, ʻaʻole kau wale. pu ma ke kuli. E haʻi aku nō wau iā ʻoe e pili ana iā ia.

E pili ana i Unicode a me UTF-8

No ka hoʻomaka ʻana, he mau huaʻōlelo e pili ana i ia mea Unicode и UTF-8.

E like me kāu e ʻike ai, kaulana nā 8-bit encodings ma mua. Me lākou, maʻalahi nā mea a pau: hiki ke helu ʻia nā huaʻōlelo 256 me nā helu mai 0 a 255, a ʻo nā helu mai 0 a 255 hiki ke hōʻike ʻia ma ke ʻano he hoʻokahi byte. Inā mākou e hoʻi i ka hoʻomaka ʻana, ua kaupalena ʻia ka hoʻopili ʻana o ASCII i 7 bits, no laila ʻo ka mea nui loa i loko o kāna hōʻike byte he zero, a ʻo ka hapa nui o nā encodings 8-bit ua kūpono me ia (ʻokoʻa lākou ma ka "luna" wale nō. ʻāpana, kahi o ka mea nui loa he hoʻokahi).

Pehea ka ʻokoʻa ʻana o Unicode mai kēlā mau hoʻopili ʻana a no ke aha ka nui o nā hōʻike kikoʻī e pili ana me ia - UTF-8, UTF-16 (BE a me LE), UTF-32? E hooponopono kakou.

Hōʻike ka maʻamau Unicode kumu i ka pilina ma waena o nā huaʻōlelo (a i kekahi mau hihia, nā ʻāpana pākahi o nā huaʻōlelo) a me kā lākou helu. A he nui nā helu hiki i kēia maʻamau - mai 0x00 i luna 0x10FFFF (1 apana). Inā makemake mākou e hoʻokomo i kahi helu ma ia ʻano i loko o kahi loli, ʻaʻole lawa ka 114 a i ʻole 112 bytes iā mākou. A no ka mea ʻaʻole i hoʻolālā nui ʻia kā mākou mea hana no ka hana ʻana me nā helu ʻekolu-byte, e koi ʻia mākou e hoʻohana i ka nui o 1 bytes no kēlā me kēia ʻano! ʻO UTF-2 kēia, akā ma muli o kēia "mauna wale" ʻaʻole kaulana kēia ʻano.

ʻO ka mea pōmaikaʻi, ʻaʻole maʻamau ka hoʻonohonoho ʻana o nā huaʻōlelo i loko o Unicode. Ua māhele ʻia kā lākou pūʻulu holoʻokoʻa i 17 "mokulele", aia kēlā me kēia me 65536 (0x10000) «helu helu" He mea maʻalahi ka manaʻo o kahi "code point". helu huahelu, hāʻawi ʻia iā ia e Unicode. Akā, e like me ka mea i ʻōlelo ʻia ma luna nei, ma Unicode ʻaʻole i helu ʻia nā huaʻōlelo pākahi wale nō, akā ʻo kā lākou mau ʻāpana a me nā hōʻailona lawelawe (a i kekahi manawa ʻaʻohe mea e like me ka helu - no ka manawa paha, akā no mākou ʻaʻole ia he mea nui), no laila, ʻoi aku ka pololei e kamaʻilio mau e pili ana i ka helu o nā helu iā lākou iho, ʻaʻole nā ​​hōʻailona. Eia naʻe, i kēia mau mea, no ka pōkole, e hoʻohana pinepine wau i ka huaʻōlelo "hōʻailona", e hōʻike ana i ka huaʻōlelo "code point".

ʻO kekahi kaʻa: mālama mākou i nā kaula Unicode 30-60% ʻoi aku ka paʻakikī ma mua o UTF-8
Nā mokulele Unicode. E like me kāu e ʻike ai, ʻaʻole hoʻohana ʻia ka hapa nui o ia (mau mokulele 4 a 13).

ʻO ka mea kupanaha loa, aia ka "pulp" nui a pau i ka mokulele zero, ua kapa ʻia ʻo ia "Mokulele ʻŌlelo Nui". Inā loaʻa i kahi laina kikokikona ma kekahi o nā ʻōlelo hou (me ka ʻōlelo Kina), ʻaʻole ʻoe e hele ma waho o kēia mokulele. Akā ʻaʻole hiki iā ʻoe ke ʻoki i ke koena o Unicode - no ka laʻana, aia ka emoji ma ka hopena ka mokulele aʻe,"Pākuʻi ʻōlelo he nui"(Ua hoʻonui ʻia mai 0x10000 i luna 0x1FFFF). No laila, hana ʻo UTF-16 i kēia: hāʻule nā ​​​​mea āpau i loko Mokulele ʻŌlelo Nui, ua hoʻopili ʻia "e like me" me kahi helu ʻelua-byte e pili ana. Eia naʻe, ʻaʻole i hōʻike ʻia kekahi o nā helu o kēia pae i nā huaʻōlelo kikoʻī, akā e hōʻike ana ma hope o kēia mau paita e pono ai mākou e noʻonoʻo i kekahi - ma ka hoʻohui ʻana i nā waiwai o kēia mau paita ʻehā, loaʻa iā mākou kahi helu e uhi ana. ka laulā Unicode kūpono. Ua kapa ʻia kēia manaʻo "nā kāne pani" - ua lohe paha ʻoe iā lākou.

No laila, koi ʻo UTF-16 i ʻelua a i ʻole (ma nā hihia kakaʻikahi) ʻehā paita no kēlā me kēia "code point". ʻOi aku kēia ma mua o ka hoʻohana ʻana i ʻehā paita i nā manawa a pau, akā ʻo ka Latin (a me nā huaʻōlelo ASCII ʻē aʻe) ke hoʻopaʻa ʻia i kēia ala e hoʻopau i ka hapalua o ka hakahaka ma nā zeros. Hoʻolālā ʻia ʻo UTF-8 e hoʻoponopono i kēia: ASCII i loko ona, e like me ka wā ma mua, hoʻokahi wale nō byte; code mai 0x80 i luna 0x7FF - ʻelua paita; mai 0x800 i luna 0xFFFF - ekolu, a mai 0x10000 i luna 0x10FFFF - ʻehā. Ma kekahi ʻaoʻao, ua maikaʻi ka alphabet Latin: ua hoʻi mai ka hoʻohālikelike ʻana me ASCII, a ʻoi aku ka maikaʻi o ka hoʻolaha ʻana mai ka 1 a 4 bytes. Akā, ʻaʻole pōmaikaʻi nā alphabets ʻē aʻe ma mua o ka Latin, auwe, ʻaʻole i pōmaikaʻi i kekahi ʻano ke hoʻohālikelike ʻia me UTF-16, a he nui nā mea i kēia manawa e koi i ʻekolu paita ma mua o ʻelua - ʻo ka laulā i uhi ʻia e kahi moʻolelo ʻelua-byte ua hōʻemi ʻia e 32 mau manawa, me 0xFFFF i luna 0x7FF, ʻaʻole hoʻi he Pākē, ʻaʻole hoʻi, ʻo Georgian kekahi i loko. Cyrillic a me ʻelima mau alphabets ʻē aʻe - hurray - laki, 2 bytes no kēlā me kēia ʻano.

No ke aha la keia? E ʻike kākou pehea e hōʻike ai ʻo UTF-8 i nā code character:
ʻO kekahi kaʻa: mālama mākou i nā kaula Unicode 30-60% ʻoi aku ka paʻakikī ma mua o UTF-8
E hōʻike pololei i nā helu, hoʻohana ʻia nā ʻāpana i kaha ʻia me ka hōʻailona ma ʻaneʻi x. Hiki ke ʻike ʻia i loko o kahi moʻolelo ʻelua-byte he 11 wale nō ia mau bits (mai loko o 16). ʻO nā bits alakaʻi ma ʻaneʻi he hana kōkua wale nō. I ka hihia o kahi moʻolelo ʻehā-byte, ua hoʻokaʻawale ʻia ka 21 o 32 bits no ka helu helu helu - me he mea lā ua lawa ʻekolu paita (e hāʻawi ana i ka huina o 24 mau bits), akā ʻai nui nā māka lawelawe.

He ino anei keia? ʻaʻole naʻe. Ma ka lima hoʻokahi, inā mākou e mālama nui i ka lewa, loaʻa iā mākou nā algorithms compression e hiki ke hoʻopau maʻalahi i nā entropy keu a me ka redundancy. Ma ka ʻaoʻao ʻē aʻe, ʻo ka pahuhopu o Unicode ka hāʻawi ʻana i ka coding ākea ākea. No ka laʻana, hiki iā mākou ke hāʻawi i kahi laina i hoʻopaʻa ʻia ma UTF-8 e code i hana mua me ASCII wale nō, a mai makaʻu e ʻike ʻo ia i kahi ʻano mai ka pae ASCII ʻaʻole maoli ma laila (ma hope o nā mea āpau, ma UTF-8 āpau. nā bytes e hoʻomaka ana me ka zero bit - ʻo ia ka ASCII). A inā makemake koke mākou e ʻoki i kahi huelo liʻiliʻi mai kahi kaula nui me ka hoʻokaʻawale ʻole ʻana mai ka hoʻomaka ʻana (a i ʻole e hoʻihoʻi i kahi ʻāpana o ka ʻike ma hope o kahi ʻāpana i hōʻino ʻia), maʻalahi iā mākou ke ʻimi i ka offset kahi e hoʻomaka ai kahi ʻano (ua lawa. e hoʻokuʻu i nā bytes i loaʻa kahi prefix 10).

No ke aha e hana ai i kahi mea hou?

I ka manawa like, aia kekahi mau manawa ke kūpono ʻole ka hoʻopili ʻana i nā algorithms e like me deflate, akā makemake ʻoe e hoʻokō i ka mālama ʻana i nā kaula. ʻO wau iho, ua loaʻa iaʻu kēia pilikia i ka wā e noʻonoʻo ai e pili ana i ke kūkulu ʻana lāʻau prefix compressed no ka puke wehewehe'ōlelo nui me nā hua'ōlelo ma nā'ōlelo kū'ē. Ma kekahi ʻaoʻao, pōkole loa kēlā me kēia huaʻōlelo, no laila ʻaʻole pono ka hoʻoomi ʻana. Ma ka ʻaoʻao ʻē aʻe, ua hoʻolālā ʻia ka hoʻokō lāʻau aʻu i manaʻo ai i hana ʻia kēlā me kēia byte o ke kaula i mālama ʻia i kahi vertex lāʻau ʻokoʻa, no laila ʻoi aku ka maikaʻi o ka hōʻemi ʻana i kā lākou helu. Ma kaʻu waihona ʻO Az.js (E like me ka pymorphy2, kahi i hoʻokumu ʻia ai) hiki ke hoʻoponopono ʻia kahi pilikia like - nā kaula i hoʻopili ʻia i loko DAWG- puke wehewehe ʻōlelo, mālama ʻia ma laila maikaʻi kahiko CP1251. Akā, e like me ka maʻalahi o ka hoʻomaopopo ʻana, hana maikaʻi kēia no ka alphabet liʻiliʻi - ʻaʻole hiki ke hoʻohui ʻia kahi laina ma ka ʻōlelo Kina i kēlā puke wehewehe.

Ma kahi kaʻawale, makemake wau e nānā i kahi nuance maikaʻi ʻole e kū mai ana i ka wā e hoʻohana ai i ka UTF-8 i loko o ia ʻano ʻikepili. Hōʻike ke kiʻi ma luna nei i ka wā i kākau ʻia ai ke ʻano ma ke ʻano he ʻelua bytes, ʻaʻole hele mai nā ʻāpana pili i kona helu i ka lālani, akā ua hoʻokaʻawale ʻia e nā ʻāpana ʻelua. 10 i waena: 110xxxxx 10xxxxxx. Ma muli o kēia, i ka wā e kahe ana nā ʻāpana haʻahaʻa 6 o ka byte ʻelua i ke code character (ʻo ia hoʻi, hiki mai kahi hoʻololi. 1011111110000000), a laila hoʻololi pū ka byte mua. ʻIke ʻia ʻo ka huaʻōlelo "p" i hōʻike ʻia e nā bytes 0xD0 0xBF, a ʻo ka "r" e hiki mai ana 0xD1 0x80. Ma kahi lāʻau prefix, alakaʻi kēia i ka māhele ʻana o ka node makua i ʻelua - hoʻokahi no ka prefix 0xD0, a o kekahi no 0xD1 (ʻoiai hiki ke hoʻopaʻa ʻia ka alphabet Cyrillic holoʻokoʻa e ka byte lua wale nō).

He aha kaʻu i loaʻa ai

I mua o kēia pilikia, ua hoʻoholo wau e hoʻomaʻamaʻa i ka pāʻani ʻana i nā pāʻani me nā bits, a ma ka manawa like e hoʻomaʻamaʻa iki i ke ʻano o Unicode holoʻokoʻa. ʻO ka hopena, ʻo ia ke ʻano hoʻopāpā UTF-C ("C" no kana olelo hoopomaikai), ʻaʻole i ʻoi aku ma mua o 3 bytes no kēlā me kēia helu code, a ʻae pinepine iā ʻoe e hoʻolilo wale hoʻokahi paita hou no ka laina hoʻopaʻa ʻia holoʻokoʻa. Ke alakaʻi nei kēia i ka ʻoiaʻiʻo i ka nui o nā alphabets non-ASCII e like me ka hoʻopili ʻana 30-60% ʻoi aku ka paʻakikī ma mua o UTF-8.

Ua hōʻike au i nā hiʻohiʻona o ka hoʻokō ʻana i ka hoʻopili ʻana a me ka decoding algorithms ma ke ʻano Nā waihona puke JavaScript a me Go, hiki iā ʻoe ke hoʻohana manuahi i kāu code. Akā, e hoʻomau wau i ka manaʻo ma ke ʻano he "paikikala" kēia ʻano, ʻaʻole wau e manaʻo e hoʻohana. me ka ʻike ʻole i ke kumu e pono ai ʻoe. He hoʻokolohua ʻoi aku kēia ma mua o ka "hoʻomaikaʻi ʻana o UTF-8". Eia naʻe, ua kākau pololei ʻia ka code ma laila, me ka nui o nā manaʻo a me ka uhi hoʻāʻo.

ʻO kekahi kaʻa: mālama mākou i nā kaula Unicode 30-60% ʻoi aku ka paʻakikī ma mua o UTF-8
Nā hopena hoʻāʻo a me ka hoʻohālikelike ʻana me UTF-8

Ua hana au ʻaoʻao demo, kahi e hiki ai iā ʻoe ke loiloi i ka hana o ka algorithm, a laila e haʻi hou aku wau iā ʻoe e pili ana i kāna mau loina a me ke kaʻina hana hoʻomohala.

Ke hoʻopau ʻana i nā bits redundant

Ua lawe au iā UTF-8 i kumu, ʻoiaʻiʻo. ʻO ka mea mua a maopopo loa e hiki ke hoʻololi i loko o ia mea, ʻo ia ka hoʻemi ʻana i ka helu o nā bits lawelawe i kēlā me kēia byte. No ka laʻana, hoʻomaka mau ka byte mua ma UTF-8 me kekahi 0, a me 11 - he prefix 10 ʻO nā bytes wale nō i loaʻa iā ia. E pani kākou i ka prefix 11 maluna o 1, a no nā bytes aʻe e wehe mākou i nā prefixes loa. He aha ka hopena?

0xxxxxxx — 1 byte
10xxxxxx xxxxxxxx - 2 byte
110xxxxx xxxxxxxx xxxxxxxx - 3 byte

E kali, aia i hea ka mooolelo eha-byte? Akā ʻaʻole pono ia - ke kākau ʻana i ʻekolu bytes, loaʻa iā mākou he 21 bits i loaʻa a lawa kēia no nā helu āpau a hiki i 0x10FFFF.

He aha kā mākou i mōhai ai ma ʻaneʻi? ʻO ka mea nui, ʻo ia ka ʻike ʻana i nā palena o ke ʻano mai kahi wahi kūʻokoʻa i ka buffer. ʻAʻole hiki iā mākou ke kuhikuhi i kahi byte kūʻokoʻa a ʻike i ka hoʻomaka o ke ʻano hou mai ia mea. He palena kēia o kā mākou ʻano, akā ma ka hoʻomaʻamaʻa ʻaʻole pono kēia. Hiki iā mākou ke holo ma waena o ka buffer mai ka hoʻomaka ʻana (ʻoi loa i ka wā e pili ana i nā laina pōkole).

Ua ʻoi aku ka maikaʻi o ke kūlana me ka uhi ʻana i nā ʻōlelo me 2 bytes: i kēia manawa ua hāʻawi ka ʻaoʻao ʻelua-byte i kahi ākea o 14 bits, a ʻo kēia nā code a hiki i 0x3FFF. Pōmaikaʻi ka poʻe Kina (ʻo ka hapa nui o kā lākou mau ʻano mai 0x4E00 i luna 0x9FFF), akā ʻoi aku ka leʻaleʻa o ko Georgians a me nā lāhui ʻē aʻe - kūpono pū kā lākou ʻōlelo i 2 bytes no kēlā me kēia ʻano.

E komo i ka mokuʻāina encoder

E noʻonoʻo kākou i nā waiwai o nā laina ponoʻī. Aia ka puke wehewehe'ōlelo i nā hua'ōlelo i kākau 'ia ma nā hua'ōlelo o ka pī'āpā like, a he 'oia'i'o nō ho'i kēia no nā kikokikona 'ē a'e. He mea maikaʻi e hōʻike i kēia pīʻāpā i hoʻokahi manawa, a laila hōʻike wale i ka helu o ka leka i loko. E ʻike inā e kōkua ka hoʻonohonoho ʻana o nā huaʻōlelo i ka papa Unicode iā mākou.

E like me ka mea i ʻōlelo ʻia ma luna, ua māhele ʻia ʻo Unicode i mokulele 65536 mau helu pākahi. Akā ʻaʻole kēia he mahele maikaʻi loa (e like me ka mea i ʻōlelo ʻia, ʻo ka pinepine mākou i ka mokulele zero). ʻOi aku ka hoihoi o ka māhele ʻia e palaka. ʻAʻole loaʻa ka lōʻihi paʻa o kēia mau pae, a ʻoi aku ka manaʻo - ma ke ʻano he lula, ua hui kēlā me kēia me nā huaʻōlelo mai ka alphabet like.

ʻO kekahi kaʻa: mālama mākou i nā kaula Unicode 30-60% ʻoi aku ka paʻakikī ma mua o UTF-8
He poloka i loaʻa nā huaʻōlelo o ka pīʻāpā Bengali. ʻO ka mea pōʻino, no nā kumu mōʻaukala, he laʻana kēia o ka hoʻopili paʻa ʻole - 96 mau huaʻōlelo i hoʻopuehu nui ʻia ma waena o 128 mau helu poloka.

ʻO ka hoʻomaka ʻana o nā poloka a me ko lākou nui he mau helu o 16 - hana ʻia kēia no ka maʻalahi. Eia kekahi, nui nā poloka e hoʻomaka a hoʻopau i nā waiwai he nui o 128 a i ʻole 256 - no ka laʻana, lawe ʻia ka alphabet Cyrillic kumu i 256 bytes mai 0x0400 i luna 0x04FF. He mea maʻalahi kēia: inā mālama mākou i ka prefix i hoʻokahi manawa 0x04, a laila hiki ke kākau ʻia kekahi ʻano Cyrillic i hoʻokahi byte. ʻOiaʻiʻo, ma kēia ala e nalowale ai mākou i ka manawa e hoʻi ai i ASCII (a me nā mea ʻē aʻe ma ka laulā). No laila, hana mākou i kēia:

  1. ʻElua paita 10yyyyyy yxxxxxxx ʻaʻole wale e hōʻike i kahi hōʻailona me kahi helu yyyyyy yxxxxxxx, akā, hoʻololi pū kekahi alphabet o kēia manawa maluna o yyyyyy y0000000 (ʻo ia hoʻi, hoʻomanaʻo mākou i nā ʻāpana āpau koe nā mea liʻiliʻi loa 7 bit);
  2. Hoʻokahi paita 0xxxxxxx ʻo ia ke ʻano o ka pīʻāpā o kēia manawa. Pono e hoʻohui ʻia i ka offset a mākou i hoʻomanaʻo ai ma ka ʻanuʻu 1. ʻOiai ʻaʻole mākou i hoʻololi i ka pīʻāpā, ʻaʻohe o ka offset, no laila ua mālama mākou i ka launa pū me ASCII.

Pela no na code e koi ana i 3 bytes:

  1. ʻEkolu paita 110yyyyy yxxxxxxx xxxxxxxx hōʻike i kahi hōʻailona me kahi helu yyyyyy yxxxxxxx xxxxxxxx, hoololi alphabet o kēia manawa maluna o yyyyyy y0000000 00000000 (hoʻomanaʻo i nā mea a pau koe nā ʻōpio 15 bit), a nānā i ka pahu a mākou i kēia manawa lōʻihi mode (i ka hoʻololi ʻana i ka pīʻāpā i kahi pālua-byte, e hoʻihoʻi mākou i kēia hae);
  2. ʻElua paita 0xxxxxxx xxxxxxxx ma ke ʻano lōʻihi, ʻo ia ke ʻano o ka pīʻāpā o kēia manawa. Pēlā nō, hoʻohui mākou me ka offset mai ka pae 1. ʻO ka ʻokoʻa wale nō i kēia manawa mākou heluhelu ʻelua bytes (no ka mea, ua hoʻololi mākou i kēia ʻano).

Maikaʻi ka leo: ʻoiai pono mākou e hoʻopaʻa i nā huaʻōlelo mai ka laulā Unicode 7-bit like, hoʻolilo mākou i 1 paita hou ma ka hoʻomaka a me ka huina o hoʻokahi paita no kēlā me kēia ʻano.

ʻO kekahi kaʻa: mālama mākou i nā kaula Unicode 30-60% ʻoi aku ka paʻakikī ma mua o UTF-8
Ke hana nei mai kekahi o nā mana mua. Ua paʻi pinepine ia i ka UTF-8, akā aia kahi lumi no ka hoʻomaikaʻi ʻana.

He aha ka ʻino? ʻO ka mea mua, loaʻa iā mākou kahi kūlana, ʻo ia hoʻi hoʻopaʻa ʻia i kēia manawa a me ka pahu pahu ʻano lōʻihi. Hoʻopili hou kēia iā mākou: i kēia manawa hiki ke hoʻopili ʻia nā huaʻōlelo like i nā ʻano ʻokoʻa. ʻO ka ʻimi ʻana i nā substrings, no ka laʻana, pono e noʻonoʻo i kēia, ʻaʻole wale ma ka hoʻohālikelike ʻana i nā bytes. ʻO ka lua, i ka wā i hoʻololi ai mākou i ka pīʻāpā, ua lilo ia i mea ʻino me ka hoʻopili ʻana i nā huaʻōlelo ASCII (a ʻaʻole kēia wale ka alphabet Latin, akā pū kekahi me nā kaha kikoʻī, me nā hakahaka) - pono lākou e hoʻololi hou i ka alphabet i 0, ʻo ia hoʻi, hou kekahi byte (a laila e hoʻi hou i kā mākou kumu nui).

ʻOi aku ka maikaʻi o ka alphabet, ʻoi aku ka maikaʻi o ʻelua

E ho'āʻo kākou e hoʻololi iki i kā mākou mau bit prefix, e kaomi i hoʻokahi i nā mea ʻekolu i hōʻike ʻia ma luna.

0xxxxxxx - 1 byte ma ke ʻano maʻamau, 2 ma ke ʻano lōʻihi
11xxxxxx — 1 byte
100xxxxx xxxxxxxx - 2 byte
101xxxxx xxxxxxxx xxxxxxxx - 3 byte

ʻO kekahi kaʻa: mālama mākou i nā kaula Unicode 30-60% ʻoi aku ka paʻakikī ma mua o UTF-8

I kēia manawa i loko o kahi moʻolelo ʻelua-byte aia kahi mea liʻiliʻi i loaʻa - nā helu helu a hiki i 0x1FFF,ʻaʻole 0x3FFF. Eia nō naʻe, ʻoi aku ka nui o ka nui ma mua o nā code UTF-8 pālua-byte, ʻoi aku ka nui o nā ʻōlelo maʻamau i loko, ua hāʻule ka nalowale i ʻike ʻia. haigana и katakana, kaumaha na Kepani.

He aha kēia code hou? 11xxxxxx? He "stash" liʻiliʻi kēia o 64 mau huaʻōlelo i ka nui, hoʻopiha ia i kā mākou alphabet nui, no laila ua kapa wau iā ia he mea kōkua (kākoʻo) alphabet. Ke hoʻololi mākou i ka pīʻāpā o kēia manawa, lilo kahi ʻāpana o ka pīʻāpā kahiko i mea kōkua. No ka laʻana, ua hoʻololi mākou mai ASCII a i Cyrillic - aia ka stash i kēia manawa he 64 mau huaʻōlelo ʻO ka huapalapala Latin, nā helu, ka hakahaka a me ke koma (ka hoʻokomo pinepine ʻana i nā kikokikona ʻaʻole ASCII). E hoʻi hou i ka ASCII - a ʻo ka hapa nui o ka huaʻōlelo Cyrillic e lilo i ka alphabet kōkua.

Mahalo i ke komo ʻana i nā alphabets ʻelua, hiki iā mākou ke lawelawe i ka nui o nā kikokikona me nā kumukūʻai liʻiliʻi no ka hoʻololi ʻana i nā alphabets (e alakaʻi pinepine ʻia nā kaha kikoʻī i ka hoʻi ʻana i ASCII, akā ma hope o ia e loaʻa iā mākou nā huaʻōlelo he nui ʻole ASCII mai ka alphabet hou, me ka ʻole. hoʻololi hou ).

Bonus: kau mua i ka sub-alphabet 11xxxxxx a me ke koho ʻana i kāna offset mua 0xC0, loaʻa iā mākou ka hoʻohālikelike hapa me CP1252. Ma nā huaʻōlelo ʻē aʻe, nui (akā ʻaʻole nā ​​​​mea āpau) nā kikokikona ʻEulopa Komohana i hoʻopaʻa ʻia ma CP1252 e like me ka UTF-C.

Eia naʻe, hiki mai kahi pilikia: pehea e loaʻa ai kahi mea kōkua mai ka pī'āpā nui? Hiki iā ʻoe ke haʻalele i ka offset like, akā - auwe - eia ke pāʻani nei ke ʻano Unicode iā mākou. ʻAʻole pinepine ka hapa nui o ka pīʻāpā i ka hoʻomaka o ka poloka (no ka laʻana, ʻo ke kapikala Lūkini "A" ke code. 0x0410, ʻoiai ka hoʻomaka ʻana o ka poloka Cyrillic me 0x0400). No laila, i ka lawe ʻana i nā huaʻōlelo 64 mua i loko o ka waihona, hiki iā mākou ke nalowale ke komo i ka ʻāpana huelo o ka pīʻāpā.

No ka hoʻoponopono ʻana i kēia pilikia, hele lima wau i kekahi mau poloka e pili ana i nā ʻōlelo like ʻole, a kuhikuhi i ka offset o ka alphabet kōkua i loko o ka mea nui no lākou. ʻO ka huapalapala Latin, he ʻokoʻa, ua hoʻonohonoho hou ʻia e like me base64.

ʻO kekahi kaʻa: mālama mākou i nā kaula Unicode 30-60% ʻoi aku ka paʻakikī ma mua o UTF-8

Nā pā hope

E noʻonoʻo mākou i kahi e hiki ai iā mākou ke hoʻomaikaʻi i kekahi mea.

E hoʻomaopopo i ke ʻano 101xxxxx xxxxxxxx xxxxxxxx hiki iā ʻoe ke hoʻopaʻa i nā helu a hiki i 0x1FFFFF, a pau ka Unicode ma mua, ma 0x10FFFF. Ma nā huaʻōlelo ʻē aʻe, e hōʻike ʻia ka helu helu hope loa 10110000 11111111 11111111. No laila, hiki iā mākou ke ʻōlelo inā ʻo ka byte mua o ke ʻano 1011xxxx (kahi xxxx ʻoi aku ka nui ma mua o 0), a laila he mea ʻē aʻe. No ka laʻana, hiki iā ʻoe ke hoʻohui i nā huaʻōlelo 15 ʻē aʻe ma laila i loaʻa mau no ka hoʻopili ʻana i hoʻokahi byte, akā ua hoʻoholo wau e hana ʻokoʻa.

E nānā i kēlā mau poloka Unicode e koi ana i ʻekolu bytes i kēia manawa. ʻO ke kumu, e like me ka mea i ʻōlelo mua ʻia, he mau huaʻōlelo Kina kēia - akā paʻakikī ke hana i kekahi mea me lākou, aia he 21 tausani o lākou. Akā, ua lele ʻo hiragana a me katakana i laila - ʻaʻole nui ka nui o lākou, ʻoi aku ma lalo o ʻelua haneli. A, ʻoiai mākou i hoʻomanaʻo i ka poʻe Kepanī, aia pū kekahi emojis (ʻoiaʻiʻo, ua hoʻopuehu ʻia lākou ma nā wahi he nui ma Unicode, akā aia nā poloka nui i ka laulā. 0x1F300 - 0x1FBFF). Inā ʻoe e noʻonoʻo i ka ʻoiaʻiʻo i kēia manawa aia nā emojis i hōʻuluʻulu ʻia mai nā helu helu i ka manawa hoʻokahi (e like me ka emoji ‍‍‍ʻO kekahi kaʻa: mālama mākou i nā kaula Unicode 30-60% ʻoi aku ka paʻakikī ma mua o UTF-8 he 7 mau code!), a laila lilo ia i mea hilahila loa e hoʻolilo i ʻekolu bytes ma kēlā me kēia (7 × 3 = 21 bytes no ka pono o hoʻokahi ikona, he moeʻuhane).

No laila, koho mākou i kekahi mau pae i koho ʻia e pili ana i ka emoji, hiragana a me katakana, e helu hou iā lākou i hoʻokahi papa inoa mau a hoʻopili iā lākou i ʻelua bytes ma kahi o ʻekolu:

1011xxxx xxxxxxxx

Nui: ʻo ka emoji i ʻōlelo ʻia ma lunaʻO kekahi kaʻa: mālama mākou i nā kaula Unicode 30-60% ʻoi aku ka paʻakikī ma mua o UTF-8, he 7 mau helu helu, lawe i 8 bytes ma UTF-25, a hookomo makou ia mea 14 (ʻelua paita pololei no kēlā me kēia helu helu). Ma ke ala, ua hōʻole ʻo Habr e ʻeli iā ia (ma ka mea kahiko a me ka mea hoʻoponopono hou), no laila pono wau e hoʻokomo iā ia me kahi kiʻi.

E ho'āʻo kāua e hoʻoponopono i hoʻokahi pilikia. E like me kā mākou e hoʻomanaʻo nei, ʻo ka alphabet kumu ka mea nui kiʻekiʻe 6 bits, a mākou e hoʻomanaʻo a hoʻopili i ke code o kēlā me kēia hōʻailona decoded e hiki mai ana. Ma ka hihia o nā hua'ōlelo Pākē i loko o ka poloka 0x4E00 - 0x9FFF, ʻo ia paha ka bit 0 a i ʻole 1. ʻAʻole maʻalahi kēia: pono mākou e hoʻololi mau i ka alphabet ma waena o kēia mau waiwai ʻelua (ʻo ia hoʻi. Akā e hoʻomaopopo ʻoe i ke ʻano lōʻihi, mai ke code ponoʻī hiki iā mākou ke unuhi i ka helu o nā huaʻōlelo a mākou e hoʻopili ai me ka hoʻohana ʻana i ke ʻano pōkole (ma hope o nā hoʻopunipuni āpau i hōʻike ʻia ma luna nei, ʻo ia ka 10240) - a laila e hoʻololi ka laulā o nā hieroglyphs i 0x2600 - 0x77FF, a ma keia hihia, ma keia laula holo'oko'a, e like ana ka 6 bits nui loa (mai ka 21) me 0. No laila, e hoʻohana nā kaʻina o nā hieroglyphs i ʻelua bytes no ka hieroglyph (ʻo ia ka mea maikaʻi loa no ia ʻano nui), me ka ʻole. e hoʻololi ana i ka pīʻāpā.

Nā hoʻonā ʻokoʻa: SCSU, BOCU-1

ʻO ka poʻe loea Unicode, i ka heluhelu ʻana i ke poʻo o ka ʻatikala, e wikiwiki paha lākou e hoʻomanaʻo iā ʻoe aia ma waena o nā kūlana Unicode. ʻO ka papahana hoʻoemi maʻamau no Unicode (SCSU), e wehewehe ana i ke ʻano hoʻopāpā like loa me ka mea i wehewehe ʻia ma ka ʻatikala.

Ke hōʻoiaʻiʻo nei au: Ua aʻo wau e pili ana i kona ola ʻana ma hope o koʻu hoʻopaʻa ʻana i kaʻu hoʻoholo. Inā wau i ʻike e pili ana iā ia mai ka hoʻomaka ʻana, ua hoʻāʻo paha wau e kākau i kahi hoʻokō ma mua o ka hele ʻana mai me kaʻu ala ponoʻī.

ʻO ka mea hoihoi, hoʻohana ʻo SCSU i nā manaʻo like loa me nā mea aʻu i loaʻa ai ma oʻu iho (ma kahi o ka manaʻo o "alphabets" hoʻohana lākou i "windows", a ʻoi aku ka nui o lākou ma mua o kaʻu). I ka manawa like, loaʻa nō hoʻi nā hemahema o kēia ʻano: ʻoi aku ka kokoke loa i nā algorithm compression ma mua o ka hoʻopili ʻana. ʻO ka mea kūikawā, hāʻawi ka maʻamau i nā ʻano hōʻike he nui, akā ʻaʻole ia e ʻōlelo pehea e koho ai i ka mea maikaʻi loa - no kēia, pono e hoʻohana ka encoder i kekahi ʻano heuristics. No laila, ʻoi aku ka paʻakikī a me ka paʻakikī o kahi encoder SCSU e hoʻopuka ana i ka ʻeke maikaʻi ma mua o kaʻu algorithm.

No ka hoʻohālikelike, ua hoʻololi au i kahi hoʻokō maʻalahi o SCSU i JavaScript - ma ke ʻano o ka helu code ua hoʻohālikelike ʻia me kaʻu UTF-C, akā i kekahi mau manawa, ʻoi aku ka maikaʻi o ka hopena (i kekahi manawa ʻoi aku paha ia, akā. ʻaʻole nui). No ka laʻana, ua hoʻopili ʻia nā kikokikona ma ka ʻōlelo Hebera a me ka Helene e UTF-C 60% maikaʻi ma mua o SCSU (ma muli paha o kā lākou mau huapalapala paʻa).

Ma kahi kaʻawale, e hoʻohui wau ma waho o SCSU aia kekahi ala ʻē aʻe e hoʻohālikelike i ka Unicode - BOCU-1, akā makemake ia no ka hoʻopili ʻana i ka MIME (ʻaʻole pono iaʻu) a lawe i kahi ala ʻokoʻa i ka hoʻopili ʻana. ʻAʻole au i loiloi i kona pono, akā i koʻu manaʻo ʻaʻole hiki ke kiʻekiʻe ma mua o SCSU.

Hiki ke hoʻomaikaʻi

ʻO ka algorithm aʻu i hōʻike ai, ʻaʻole ia ma ke ao holoʻokoʻa ma o ka hoʻolālā ʻana (ʻo ia paha kahi o kaʻu mau pahuhopu i hoʻokaʻawale loa mai nā pahuhopu o ka Unicode Consortium). Ua haʻi mua wau ua hoʻomohala mua ʻia ia no ka hana hoʻokahi (e mālama ana i kahi puke wehewehe ʻōlelo ma kahi lāʻau prefix), a ʻaʻole kūpono paha kekahi o kāna mau hiʻohiʻona no nā hana ʻē aʻe. Akā ʻo ka ʻoiaʻiʻo ʻaʻole ia he maʻamau hiki ke lilo i mea hoʻohui - hiki iā ʻoe ke hoʻololi maʻalahi e kūpono i kāu mau pono.

No ka laʻana, ma ke ala maopopo hiki iā ʻoe ke hoʻopau i ka noho ʻana o ka mokuʻāina, e hana i ka coding stateless - mai hoʻololi wale i nā loli. offs, auxOffs и is21Bit i ka encoder a me ka decoder. I kēia hihia, ʻaʻole hiki ke hoʻopaʻa pono i nā kaʻina o nā huaʻōlelo o ka pīʻāpā like, akā e hōʻoiaʻiʻo ʻia e hoʻopili mau ʻia ke ʻano like me nā bytes like, me ka ʻole o ka pōʻaiapili.

Eia kekahi, hiki iā ʻoe ke hoʻololi i ka encoder i kahi ʻōlelo kikoʻī ma o ka hoʻololi ʻana i ke kūlana paʻamau - no ka laʻana, ka nānā ʻana i nā kikokikona Lūkini, hoʻonohonoho i ka encoder a me ka decoder i ka hoʻomaka. offs = 0x0400 и auxOffs = 0. He kūpono kēia i ka hihia o ke ʻano stateless. Ma keʻano laulā, e like kēia me ka hoʻohana ʻana i ka hoʻopili ʻewalu-bit kahiko, akā me ka wehe ʻole ʻana i ka hiki ke hoʻokomo i nā huaʻōlelo mai Unicode āpau e like me ka mea e pono ai.

ʻO kekahi drawback i ʻōlelo ʻia ma mua, ʻo ia i loko o nā kikokikona nui i hoʻopaʻa ʻia ma UTF-C ʻaʻohe ala wikiwiki e ʻike ai i ka palena ʻano kokoke loa i kahi byte arbitrary. Inā ʻoki ʻoe i ka mea hope loa, e ʻōlelo ʻoe, 100 bytes mai ka buffer i hoʻopaʻa ʻia, pilikia ʻoe i ka loaʻa ʻana o ka ʻōpala i hiki ʻole iā ʻoe ke hana i kekahi mea. ʻAʻole i hoʻolālā ʻia ka encoding no ka mālama ʻana i nā logs multi-gigabyte, akā ma ka laulā hiki ke hoʻoponopono ʻia kēia. Byte 0xBF ʻAʻole pono e ʻike ʻia e like me ka byte mua (akā ʻo ka lua a i ʻole ke kolu paha). No laila, i ka hoʻopili ʻana, hiki iā ʻoe ke hoʻokomo i ke kaʻina 0xBF 0xBF 0xBF kēlā me kēia, e ʻōlelo, 10 KB - a laila, inā pono ʻoe e ʻimi i kahi palena, e lawa ia e nānā i ka ʻāpana i koho ʻia a loaʻa kahi māka like. Ma hope o ka hope 0xBF ʻoiaʻiʻo ʻo ia ka hoʻomaka ʻana o kahi ʻano. (I ka hoʻokaʻawale ʻana, ʻoiaʻiʻo, pono e haʻalele ʻia kēia kaʻina o ʻekolu bytes.)

E hōʻuluʻulu

Inā ua heluhelu ʻoe i kēia mamao, mahalo! Manaʻo wau e aʻo ʻoe, e like me aʻu, i kahi mea hou (a i hōʻoluʻolu paha i kou hoʻomanaʻo) e pili ana i ke ʻano o Unicode.

ʻO kekahi kaʻa: mālama mākou i nā kaula Unicode 30-60% ʻoi aku ka paʻakikī ma mua o UTF-8
ʻaoʻao demo. Hōʻike ka laʻana o ka Hebera i nā mea maikaʻi ma luna o UTF-8 a me SCSU.

ʻAʻole pono e noʻonoʻo ʻia ka noiʻi i hōʻike ʻia ma luna nei he hoʻopiʻi i nā kūlana. Akā naʻe, hauʻoli wau i nā hopena o kaʻu hana, no laila hauʻoli wau iā lākou kaʻana: no ka laʻana, he 1710 bytes wale nō ke kaumaha o kahi waihona JS liʻiliʻi (a ʻaʻohe mea hilinaʻi, ʻoiaʻiʻo). E like me kaʻu i ʻōlelo ai ma luna, hiki ke loaʻa kāna hana ma ʻaoʻao demo (aia kekahi pūʻulu o nā kikokikona i hiki ke hoʻohālikelike ʻia me UTF-8 a me SCSU).

ʻO ka hope, e huki hou au i ka nānā ʻana i nā hihia i hoʻohana ʻia ai ka UTF-C ʻaʻole pono:

  • Inā lōʻihi ka lōʻihi o kāu mau laina (mai 100-200 mau huapalapala). I kēia hihia, pono ʻoe e noʻonoʻo e pili ana i ka hoʻohana ʻana i nā algorithm compression e like me deflate.
  • Inā pono ʻoe ASCII māliko, ʻo ia hoʻi, he mea nui iā ʻoe ʻaʻole i loaʻa i nā kaʻina i hoʻopaʻa ʻia nā code ASCII ʻaʻole i loko o ke kaula kumu. Hiki ke pale ʻia ka pono no kēia inā, i ka wā e launa pū ai me nā API ʻaoʻao ʻekolu (e like me ka hana ʻana me kahi waihona), hāʻawi ʻoe i ka hopena hoʻopili ma ke ʻano he abstract set of bytes, ʻaʻole ma ke ʻano he kaula. A i ʻole, pilikia ʻoe i ka loaʻa ʻana o nā nāwaliwali i manaʻo ʻole ʻia.
  • Inā makemake ʻoe e ʻimi koke i nā palena o ke ʻano ma kahi hoʻopiʻi kūʻokoʻa (e laʻana, i ka wā i pōʻino ai kekahi hapa o kahi laina). Hiki ke hana i kēia, akā ma ka nānā ʻana i ka laina mai ka hoʻomaka ʻana (a i ʻole ka hoʻohana ʻana i ka hoʻololi i wehewehe ʻia ma ka ʻāpana mua).
  • Inā pono ʻoe e hana wikiwiki i nā hana ma nā kiko o nā kaula (e hoʻokaʻawale iā lākou, e ʻimi i nā substrings i loko o lākou, concatenate). Pono kēia e unuhi mua i nā kaula, no laila e ʻoi aku ka lohi o ka UTF-C ma mua o UTF-8 i kēia mau hihia (akā ʻoi aku ka wikiwiki ma mua o nā algorithms kaomi). No ka mea ua hoʻopili ʻia ke kaula like i ke ala like, ʻaʻole koi ʻia ka hoʻohālikelike pololei o ka decoding a hiki ke hana ʻia ma ke kumu byte-by-byte.

Kiʻi hou: mea hoʻohana ʻO Tyomitch ma nā manaʻo ma lalo nei hoʻopuka i kahi pakuhi e hōʻike ana i nā palena kūpono o UTF-C. Hōʻike ia he ʻoi aku ka maikaʻi o ka UTF-C ma mua o kahi algorithm compression maʻamau (kahi hoʻololi o LZW) inā ʻoi aku ka pōkole o ke kaula i hoʻopaʻa ʻia. ~140 huapalapala (akā, ʻike wau ua hana ʻia ka hoʻohālikelike ʻana ma kahi kikokikona; no nā ʻōlelo ʻē aʻe, ʻokoʻa paha ka hopena).
ʻO kekahi kaʻa: mālama mākou i nā kaula Unicode 30-60% ʻoi aku ka paʻakikī ma mua o UTF-8

Source: www.habr.com

Pākuʻi i ka manaʻo hoʻopuka