ããªããéçºè
ã§ããšã³ã³ãŒãã£ã³ã°ãéžæãããšãã課é¡ã«çŽé¢ããŠããå Žåãã»ãšãã©ã®å ŽåãUnicode ãé©åãªãœãªã¥ãŒã·ã§ã³ãšãªãã§ãããã å
·äœçãªè¡šçŸæ¹æ³ã¯ã³ã³ããã¹ãã«ãã£ãŠç°ãªããŸãããã»ãšãã©ã®å Žåãããã§ãæ®éçãªçããã€ãŸã UTF-8 ãååšããŸãã ããã®è¯ãç¹ã¯ãè²»çšããããã«ãã¹ãŠã® Unicode æåã䜿çšã§ããããšã§ãã ããŸãã«ã ã»ãšãã©ã®å Žåããã€ãæ°ãå€ããªããŸãã 確ãã«ãã©ãã³æå以äžã®ãã®ã䜿çšããèšèªã§ã¯ãå°ãªããšããå€ãããªãããšããããšã§ãã XNUMXæåãããXNUMXãã€ãã 䜿çšã§ããæåæ°ã 256 æåã«å¶éãããŠããå
å²æ代ã®ãšã³ã³ãŒãã£ã³ã°ã«æ»ããã«ããã£ãšããŸãã§ããã§ãããã?
以äžã§ã¯ããã®è³ªåã«çããããã®ç§ã®è©Šã¿ãç解ããUTF-8 ã®ãããªåé·æ§ãè¿œå ããããšãªããäžçäžã®ã»ãšãã©ã®èšèªã§è¡ãä¿åã§ããæ¯èŒçåçŽãªã¢ã«ãŽãªãºã ãå®è£
ããããšãææ¡ããŸãã
å 責äºé ã ããã«ããã€ãã®éèŠãªäºçŽãããŠãããŸãã 説æãããŠãããœãªã¥ãŒã·ã§ã³ã¯ãUTF-8 ã®æ±çšçãªä»£æ¿ãšããŠæäŸãããŠããŸããããããã¯éãããã±ãŒã¹ã®ãªã¹ãã«ã®ã¿é©ããŠãã (詳现ã¯åŸè¿°ããŸã)ããããªãå Žåã§ãããµãŒãããŒã㣠API (ããã«ã€ããŠããç¥ããªã) ãšã®å¯Ÿè©±ã«äœ¿çšãã¹ãã§ã¯ãããŸããã ã»ãšãã©ã®å Žåãæ±çšå§çž®ã¢ã«ãŽãªãºã (deflate ãªã©) ã¯ã倧éã®ããã¹ã ããŒã¿ãã³ã³ãã¯ãã«ä¿åããã®ã«é©ããŠããŸãã ããã«ããã§ã«ãœãªã¥ãŒã·ã§ã³ãäœæããŠããéäžã§ãåãåé¡ã解決ãã Unicode èªäœã®æ¢åã®æšæºãèŠã€ããŸãããããã¯ããè€éã§ã (ãããŠå€ãã®å Žåãããæªããã®ã§ã)ãããããããã§ãåãå ¥ããããŠããæšæºã§ãããåãªãå®çŸ©ã§ã¯ãããŸãããäžç·ã«èã®äžã«ã 圌ã®ããšã«ã€ããŠãã話ããŸãã
Unicode ãš UTF-8 ã«ã€ããŠ
ãŸãããããäœã§ãããã«ã€ããŠå°ã説æããŸã Unicode О UTF-8.
ãåç¥ã®ãšããããã€ãŠã¯ 8 ããã ãšã³ã³ãŒãã£ã³ã°ãäžè¬çã§ããã ãããã䜿çšãããšããã¹ãŠãç°¡åã«ãªããŸããã256 æåã« 0 ãã 255 ãŸã§ã®æ°åãä»ããããšãã§ãã0 ãã 255 ãŸã§ã®æ°åã¯æããã« 7 ãã€ããšããŠè¡šãããšãã§ããŸãã æåã«æ»ããšãASCII ãšã³ã³ãŒãã¯å®å šã« 8 ãããã«å¶éãããŠããããããã€ãè¡šçŸã®æäžäœãããã¯ãŒãã§ãããã»ãšãã©ã® XNUMX ããã ãšã³ã³ãŒããšäºææ§ããããŸã (ãäžäœãã®ã¿ãç°ãªããŸã)ãã®éšåãæäžäœããã㯠XNUMX ã§ã)ã
Unicode ã¯ãããã®ãšã³ã³ãŒãã£ã³ã°ãšã©ã®ããã«ç°ãªããŸãã?ãŸããUnicode ã«éåžžã«å€ãã®ç¹å®ã®è¡šçŸ (UTF-8ãUTF-16 (BE ããã³ LE)ãUTF-32) ãé¢é£ä»ããããŠããã®ã¯ãªãã§ãã? é çªã«æŽçããŠã¿ãŸãããã
åºæ¬ç㪠Unicode æšæºã§ã¯ãæå (å Žåã«ãã£ãŠã¯æåã®åã
ã®ã³ã³ããŒãã³ã) ãšãã®æ°å€ã®éã®å¯Ÿå¿é¢ä¿ã®ã¿ãèšè¿°ãããŠããŸãã ãããŠããã®æšæºã«ã¯å€ãã®å¯èœãªæ°å€ããããŸã - 0x00
ЎП 0x10FFFF
(1 å)ã ãã®ãããªç¯å²ã®æ°å€ãå€æ°ã«å
¥ãããå Žåã114 ãã€ãã§ã 112 ãã€ãã§ãååã§ã¯ãããŸããã ãããŠãç§ãã¡ã®ããã»ããµã¯ 1 ãã€ãã®æ°å€ãæ±ãããã«ããŸãèšèšãããŠããªãããã2 æåãããæ倧 4 ãã€ãã䜿çšããããšã«ãªããŸãã ãããUTF-32ãªã®ã§ããããã®ãç¡é§ãããããããããã®åœ¢åŒãæ®åããªãã®ã§ãã
幞ããªããšã«ãUnicode å
ã®æåã®é åºã¯ã©ã³ãã ã§ã¯ãããŸããã ã»ããå
šäœã¯17ã€ã³ãã«åãããŠããŸããé£è¡æ©"ãããããã« 65536 (0x10000
ïŒÂ«ã³ãŒããã€ã³ãã ããã§ã®ãã³ãŒããã€ã³ããã®æŠå¿µã¯ç°¡åã«èšããšã æåçªå·ãUnicodeã«ãã£ãŠå²ãåœãŠãããŸãã ããããäžã§è¿°ã¹ãããã«ãUnicode ã§ã¯åã
ã®æåã ãã§ãªãããã®ã³ã³ããŒãã³ãããµãŒãã¹ ããŒã¯ã«ãçªå·ãä»ããããŸã (ãããŠãå Žåã«ãã£ãŠã¯çªå·ã«ãŸã£ãã察å¿ããªãããšããããŸããããããåœé¢ã¯ãããã¯ç§ãã¡ã«ãšã£ãŠã¯ããã»ã©éèŠã§ã¯ãããŸãã)ãèšå·ã§ã¯ãªããåžžã«æ°åèªäœã®æ°ã«ã€ããŠå
·äœçã«è©±ãã®ãããæ£ç¢ºã§ãã ãã ãã以äžã§ã¯ãç°¡æœã«ããããã«ããã³ãŒã ãã€ã³ãããšããçšèªãæ瀺ããããã«ããã·ã³ãã«ããšããèšèãé »ç¹ã«äœ¿çšããŸãã
Unicode ãã¬ãŒã³ã ã芧ã®ãšããããã®ã»ãšãã© (é¢ 4 ïœ 13) ã¯ãŸã 䜿çšãããŠããŸããã
æã泚ç®ãã¹ãããšã¯ããã¹ãŠã®äž»èŠãªããã«ããããŒãé¢ã«ããããšã§ããããã¯ãããšåŒã°ããŸããåºæ¬å€èšèªãã¬ãŒã³"ãè¡ã«çŸä»£èšèª (äžåœèªãå«ã) ã®ããã¹ããå«ãŸããŠããå Žåããã®å¹³é¢ãè¶
ããããšã¯ã§ããŸããããã ããUnicode ã®æ®ãã®éšåãåãåãããšãã§ããŸãããããšãã°ãçµµæåã¯äž»ã«è¡ã®æåŸã«ãããŸãã次ã®é£è¡æ©ã§ãããè£è¶³å€èšèªé¢ãïŒãã䌞ã³ãŠããŸãïŒ 0x10000
ЎП 0x1FFFF
ïŒã ãããã£ãŠãUTF-16 ã§ã¯æ¬¡ã®ããã«ãªããŸãããã¹ãŠã®æåã以äžã®ç¯å²ã«å«ãŸããŸãã åºæ¬å€èšèªãã¬ãŒã³ã察å¿ãã XNUMX ãã€ãã®æ°å€ã䜿çšããŠããã®ãŸãŸããšã³ã³ãŒããããŸãã ãã ãããã®ç¯å²ã®æ°å€ã®äžéšã¯ç¹å®ã®æåããŸã£ãã瀺ããŠããŸãããããã®ãã€ãã®ãã¢ã®åŸã«å¥ã®ãã€ããèæ
®ããå¿
èŠãããããšã瀺ããŠããŸãããããã® XNUMX ãã€ãã®å€ãçµã¿åãããããšã§ã以äžã®ç¯å²ãã«ããŒããæ°å€ãåŸãããŸããæå¹ãª Unicode ç¯å²å
šäœã ãã®èãæ¹ã¯ã代çã«ããã«ããšåŒã°ããŠãããèããããšããããããããŸããã
ãããã£ãŠãUTF-16 ã§ã¯ããã³ãŒã ãã€ã³ããããšã« 8 ãã€ãããŸã㯠(éåžžã«ãŸããªã±ãŒã¹ã§ã¯) XNUMX ãã€ããå¿
èŠã«ãªããŸãã ããã¯ãåžžã« XNUMX ãã€ãã䜿çšããããã¯åªããŠããŸãããã©ãã³èª (ããã³ä»ã® ASCII æå) ããã®æ¹æ³ã§ãšã³ã³ãŒããããšããŒãã®ã¹ããŒã¹ã®ååãç¡é§ã«ãªããŸãã UTF-XNUMX ã¯ãããä¿®æ£ããããã«èšèšãããŠããŸãã以åãšåæ§ã«ããã®äžã® ASCII 㯠XNUMX ãã€ãã®ã¿ãå ããŸãã ããã®ã³ãŒã 0x80
ЎП 0x7FF
- XNUMXãã€ã; ãã 0x800
ЎП 0xFFFF
- XNUMXã€ããããŠãã 0x10000
ЎП 0x10FFFF
- åã äžæ¹ã§ãã©ãã³æåã¯è¯ããªããŸãããASCII ãšã®äºææ§ãæ»ããååžã¯ 1 ãã€ããã 4 ãã€ããŸã§ããåçã«ãåæ£ããããŸããã ããããæ®å¿µãªãããã©ãã³èªä»¥å€ã®ã¢ã«ãã¡ãããã«ã¯ UTF-16 ãšæ¯ã¹ãŠäœã®ã¡ãªããããããŸãããå€ãã®ã¢ã«ãã¡ãããã§ã¯ 32 ãã€ãã§ã¯ãªã XNUMX ãã€ããå¿
èŠã«ãªããŸããXNUMX ãã€ãã®ã¬ã³ãŒãã§ã«ããŒãããç¯å²ã¯ XNUMX åã® XNUMX ã«çãŸã£ãŠããŸãã 0xFFFF
ЎП 0x7FF
ãäžåœèªããããšãã°ã°ã«ãžã¢èªãå«ãŸããŠããŸããã ããªã«æåãšãã®ä»ã® 2 ã€ã®ã¢ã«ãã¡ããã - äžæ³ - 幞éãXNUMX æåããã XNUMX ãã€ãã
ãªããã®ãããªããšãèµ·ããã®ã§ãããã? UTF-8 ãæåã³ãŒããã©ã®ããã«è¡šçŸããããèŠãŠã¿ãŸãããã
æ°å€ãçŽæ¥è¡šãããã«ãããã§ã¯èšå·ã®ä»ãããããã䜿çšãããŸãã x
ã 11 ãã€ãã®ã¬ã³ãŒãã«ã¯ããã®ãããªãããã (16 åäž) 21 åãããªãããšãããããŸãã ããã®å
é ãããã«ã¯è£å©çãªæ©èœãããããŸããã 32 ãã€ãã®ã¬ã³ãŒãã®å Žåã24 ãããã®ãã¡ XNUMX ããããã³ãŒã ãã€ã³ãçªå·ã«å²ãåœãŠãããŸããXNUMX ãã€ã (åèš XNUMX ããã) ã§ååã§ããããã«èŠããŸããããµãŒãã¹ ããŒã«ãŒãå€ããæ¶è²»ããŸãã
ããã¯ãã¡ã§ããïŒ ããŸãã äžæ¹ã§ãã¹ããŒã¹ãéèŠããå Žåã¯ãäœåãªãšã³ããããŒãšåé·æ§ããã¹ãŠç°¡åã«é€å»ã§ããå§çž®ã¢ã«ãŽãªãºã ããããŸãã äžæ¹ãUnicode ã®ç®æšã¯ãå¯èœãªéãæãæ±çšçãªã³ãŒãã£ã³ã°ãæäŸããããšã§ããã ããšãã°ãUTF-8 ã§ãšã³ã³ãŒããããè¡ãã以å㯠ASCII ã§ã®ã¿æ©èœããŠããã³ãŒãã«å§ããããšãã§ããå®éã«ã¯ååšããªã ASCII ç¯å²ã®æåã衚瀺ãããããšãå¿é
ããå¿
èŠã¯ãããŸãã (çµå±ã®ãšãããUTF-8 ã§ã¯ãã¹ãŠãŒããããããå§ãŸããã€ã - ããããŸãã« ASCII ã§ã)ã ãŸããæåãããã³ãŒãããã«ã倧ããªæååããå°ããªæ«å°Ÿãçªç¶åãåãããå Žå (ãŸãã¯ãç Žæããã»ã¯ã·ã§ã³ã®åŸã«æ
å ±ã®äžéšã埩å
ãããå Žå)ãæåãå§ãŸããªãã»ãããèŠã€ããã®ã¯ç°¡åã§ã (ããã§ååã§ã)ããããã¬ãã£ãã¯ã¹ãæã€ãã€ããã¹ãããããã«ã¯ 10
).
ã§ã¯ããªãäœãæ°ãããã®ãçºæããã®ã§ããããïŒ
åæã«ãdeflate ãªã©ã®å§çž®ã¢ã«ãŽãªãºã ã¯ããŸãé©çšã§ããªãããæååãã³ã³ãã¯ãã«ä¿åãããå ŽåããããŸãã å人çã«ã¯ãæ§ç¯ãèããŠãããšãã«ãã®åé¡ã«ééããŸãã
ãããšã¯å¥ã«ããã®ãããªããŒã¿æ§é 㧠UTF-8 ã䜿çšãããšãã«çãããã XNUMX ã€ã®äžå¿«ãªãã¥ã¢ã³ã¹ã«æ³šæããããšæããŸãã äžã®å³ã¯ãæåã XNUMX ãã€ãã§æžã蟌ãŸããå Žåããã®çªå·ã«é¢é£ããããããé£ç¶ãããXNUMX 察ã®ãããã§åºåãããŠããããšã瀺ããŠããŸãã 10
çãäžã«ïŒ 110xxxxx 10xxxxxx
ã ãã®ãããæåã³ãŒãã®6ãã€ãç®ã®äžäœXNUMXãããããªãŒããŒãããŒããå ŽåïŒé·ç§»ãçºçããå ŽåïŒã 10111111
â 10000000
)ãæåã®ãã€ããå€æŽãããŸãã æåãpãã¯ãã€ãã§è¡šãããããšãããããŸã 0xD0 0xBF
ã次ã®ãrãã¯ãã§ã« 0xD1 0x80
ã ãã¬ãã£ãã¯ã¹ ããªãŒã§ã¯ãããã«ãã芪ããŒãã XNUMX ã€ã«åå²ãããXNUMX ã€ã¯ãã¬ãã£ãã¯ã¹çšã«ãªããŸãã 0xD0
ãããXNUMXã€ã¯ 0xD1
(ãã ããããªã«æåå
šäœã¯ XNUMX ãã€ãç®ã§ã®ã¿ãšã³ã³ãŒãã§ããŸã)ã
äœãæã«å ¥ããŸããã
ãã®åé¡ã«çŽé¢ããŠãç§ã¯ãããã䜿ã£ãŠã²ãŒã ããã¬ã€ããç·Žç¿ãããåæã« Unicode å šäœã®æ§é ãããå°ãããç解ããããšã«ããŸããã çµæ㯠UTF-C ãšã³ã³ãŒãåœ¢åŒ (ãCã㯠ã³ã³ãã¯ãïŒãã³ãŒã ãã€ã³ãããšã« 3 ãã€ã以äžããæ¶è²»ãããå€ãã®å Žåã䜿çšã§ããã®ã¯ XNUMX ãã€ãã®ã¿ã§ãã ãšã³ã³ãŒããããè¡å šäœã« XNUMX ãã€ãè¿œå ã ããã¯ãå€ãã®é ASCII ã¢ã«ãã¡ãããã§ã¯ããã®ãããªãšã³ã³ãŒãã£ã³ã°ã次ã®ããã«ãªããšããäºå®ã«ã€ãªãããŸãã UTF-30 ãã 60 ïœ 8% ã³ã³ãã¯ã.
ãšã³ã³ãŒãããã³ãã³ãŒãã¢ã«ãŽãªãºã ã®å®è£
äŸã次ã®åœ¢åŒã§çŽ¹ä»ããŸããã
ãã¹ãçµæãšUTF-8ãšã®æ¯èŒ
ç§ãããããŸãã
åé·ãããã®åé€
ãã¡ãããç§ã¯ UTF-8 ãããŒã¹ã«ããŸããã ãã®äžã§å€æŽã§ããæåã®æãæçœãªç¹ã¯ãåãã€ãã®ãµãŒãã¹ ãããæ°ãæžããããšã§ãã ããšãã°ãUTF-8 ã®æåã®ãã€ãã¯åžžã«æ¬¡ã®ããããã§å§ãŸããŸãã 0
ããŸã㯠11
- æ¥é èŸ 10
次ã®ãã€ãã®ã¿ããããæã£ãŠããŸãã æ¥é èŸã眮ãæããŠã¿ãŸããã 11
Ма 1
ã次ã®ãã€ãã§ã¯ãã¬ãã£ãã¯ã¹ãå®å
šã«åé€ããŸãã äœãèµ·ãããïŒ
0xxxxxxx
â 1ãã€ã
10xxxxxx xxxxxxxx
- 2ãã€ã
110xxxxx xxxxxxxx xxxxxxxx
- 3ãã€ã
åŸ
ã£ãŠã21 ãã€ãã®ã¬ã³ãŒãã¯ã©ãã«ããã®ã§ãããã? ããããããã¯ããå¿
èŠãããŸãããXNUMX ãã€ãã§æžã蟌ãå ŽåãçŸåšã¯ XNUMX ããããå©çšå¯èœã§ãããããã¯æ倧ã®ãã¹ãŠã®æ°å€ã«å¯ŸããŠååã§ãã 0x10FFFF
.
ç§ãã¡ã¯ããã§äœãç ç²ã«ããã®ã§ããããïŒ æãéèŠãªããšã¯ããããã¡å ã®ä»»æã®äœçœ®ããæåå¢çãæ€åºããããšã§ãã ä»»æã®ãã€ãããã€ã³ãããŠããããã次ã®æåã®å é ãèŠã€ããããšã¯ã§ããŸããã ããã¯ç§ãã¡ã®åœ¢åŒã®å¶éã§ãããå®éã«ã¯ãããå¿ èŠã«ãªãããšã¯ã»ãšãã©ãããŸããã éåžžããããã¡ãæåããå®è¡ã§ããŸã (ç¹ã«çãè¡ã®å Žå)ã
2 ãã€ãã§èšèªãã«ããŒããç¶æ³ãæ¹åãããŸãããçŸåšã14 ãã€ã圢åŒã§ã¯ XNUMX ãããã®ç¯å²ãäžãããããããã¯æ倧㧠XNUMX ãããã®ã³ãŒãã«ãªããŸãã 0x3FFF
ã äžåœäººã¯äžéã ïŒäžåœäººã®æ§æ Œã¯äž»ã«ä»¥äžã®ãããªãã®ã§ããïŒ 0x4E00
ЎП 0x9FFF
ïŒããããã°ã«ãžã¢äººãä»ã®å€ãã®äººã
ã¯ãã£ãšæ¥œããã§ããŸã - 圌ãã®èšèªã2æåãããXNUMXãã€ãã«åãŸããŸãã
ãšã³ã³ãŒãç¶æ ã«å ¥ã
次ã«ãç·èªäœã®æ§è³ªã«ã€ããŠèããŠã¿ãŸãããã ã»ãšãã©ã®å ŽåãèŸæžã«ã¯åãã¢ã«ãã¡ãããã®æåã§æžãããåèªãå«ãŸããŠãããããã¯ä»ã®å€ãã®ããã¹ãã«ãåœãŠã¯ãŸããŸãã ãã®ã¢ã«ãã¡ããããäžåºŠæå®ããŠããã®äžã®æåã®çªå·ã ããæå®ãããšè¯ãã§ãããã Unicode ããŒãã«å ã®æåã®é 眮ã圹ç«ã€ãã©ãããèŠãŠã¿ãŸãããã
åè¿°ããããã«ãUnicode ã¯æ¬¡ã®ããã«åé¡ãããŸãã é£è¡æ© ãããã65536ã³ãŒãã ããããããã¯ããŸãæçšãªåå²ã§ã¯ãããŸãã (ãã§ã«è¿°ã¹ãããã«ãã»ãšãã©ã®å Žåãç§ãã¡ã¯ãŒãå¹³é¢ã«ããŸã)ã ããã«èå³æ·±ãã®ã¯ãã«ããé€ç®ã§ãã ãããã¯ã ãããã®ç¯å²ã«ã¯åºå®é·ããªããªããããæå³ã®ãããã®ã«ãªããŸãããååãšããŠããããããåãã¢ã«ãã¡ãããã®æåãçµã¿åããããã®ã«ãªããŸãã
ãã³ã¬ã«èªã®ã¢ã«ãã¡ãããã®æåãå«ããããã¯ã æ®å¿µãªãããæŽå²çãªçç±ã«ãããããã¯ããŸãé«å¯åºŠã«ããã±ãŒãžåãããŠããªãäŸã§ãã96 æåã 128 ã®ããã㯠ã³ãŒã ãã€ã³ãã«ç¡ç§©åºã«åæ£ãããŠããŸãã
ãããã¯ã®å
é ãšãã®ãµã€ãºã¯åžžã« 16 ã®åæ°ã§ããããã¯åã«äŸ¿å®äžè¡ãããŠããã ãã§ãã ããã«ãå€ãã®ãããã¯ã¯ 128 ãŸã㯠256 ã®åæ°ã®å€ã§å§ãŸããçµãããŸããããšãã°ãåºæ¬çãªããªã«æå㯠256 ãã€ããå ããŸãã 0x0400
ЎП 0x04FF
ã ããã¯éåžžã«äŸ¿å©ã§ãããã¬ãã£ãã¯ã¹ãäžåºŠä¿åââãããšã 0x04
ã®å Žåãä»»æã®ããªã«æåã XNUMX ãã€ãã§æžãããšãã§ããŸãã 確ãã«ããã®æ¹æ³ã§ã¯ ASCII (ããã³äžè¬ã«ä»ã®æå) ã«æ»ãæ©äŒã倱ãããšã«ãªããŸãã ãããã£ãŠã次ã®ããã«ããŸãã
- XNUMXãã€ã
10yyyyyy yxxxxxxx
èšå·ãæ°åã§è¡šãã ãã§ã¯ãããŸããyyyyyy yxxxxxxx
ãããããŸãå€åããŸã çŸåšã®ã¢ã«ãã¡ããã Маyyyyyy y0000000
(ã€ãŸããæäžäœããããé€ããã¹ãŠã®ããããèŠããŠããŸã 7ããã); - XNUMXãã€ã
0xxxxxxx
ããã¯çŸåšã®ã¢ã«ãã¡ãããã®æåã§ãã æé 1 ã§èšæ¶ãããªãã»ããã«è¿œå ããã ãã§ããã¢ã«ãã¡ãããã¯å€æŽããŸããã§ãããããªãã»ããã¯ãŒããªã®ã§ãASCII ãšã®äºææ§ãç¶æããŸããã
3 ãã€ããå¿ èŠãšããã³ãŒãã®å Žåãåæ§ã§ãã
- XNUMXãã€ã
110yyyyy yxxxxxxx xxxxxxxx
èšå·ãæ°åã§ç€ºãyyyyyy yxxxxxxx xxxxxxxx
ã å€å çŸåšã®ã¢ã«ãã¡ããã Маyyyyyy y0000000 00000000
(è¥ãå以å€ã¯å šéšèŠããŠã) 15ããã)ãçŸåšã®ããã¯ã¹ã«ãã§ãã¯ãå ¥ããŸã é·ã ã¢ãŒã (ã¢ã«ãã¡ããããå šè§ã«æ»ããšãããã®ãã©ã°ããªã»ããããŸã)ã - XNUMXãã€ã
0xxxxxxx xxxxxxxx
ãã³ã°ã¢ãŒãã§ã¯ãçŸåšã®ã¢ã«ãã¡ãããã®æåã«ãªããŸãã åæ§ã«ãã¹ããã 1 ã®ãªãã»ããã䜿çšããŠãããè¿œå ããŸããå¯äžã®éãã¯ã(ãã®ã¢ãŒãã«åãæ¿ãããã) XNUMX ãã€ããèªã¿åãããšã§ãã
ããã§ãããåã 7 ããã Unicode ç¯å²ã®æåããšã³ã³ãŒãããå¿ èŠããããšãã«ãå é ã« 1 ãã€ãäœåã«è²»ãããæåããšã«åèš XNUMX ãã€ããè²»ãããŸãã
以åã®ããŒãžã§ã³ã®ããããã䜿çšããŠããŸãã ãã§ã« UTF-8 ãäžåãããšããããããŸããããŸã æ¹åã®äœå°ããããŸãã
äœãæªãã®ïŒ ãŸããæ¡ä»¶ããããŸãã çŸåšã®ã¢ã«ãã¡ãããã®ãªãã»ãã ãšãã§ãã¯ããã¯ã¹ ãã³ã°ã¢ãŒãã ããã«ããããã«å¶éãå ããããåãæåãç°ãªãã³ã³ããã¹ãã§ç°ãªãæ¹æ³ã§ãšã³ã³ãŒããããå¯èœæ§ããããŸãã ããšãã°ãéšåæååã®æ€çŽ¢ã¯ãåã«ãã€ããæ¯èŒããã ãã§ã¯ãªãããããèæ ®ããŠè¡ãå¿ èŠããããŸãã 第äºã«ãã¢ã«ãã¡ããããå€æŽãããšããã«ãASCII æåã®ãšã³ã³ãŒããæªããªããŸãã (ããã¯ã©ãã³æåã ãã§ãªããã¹ããŒã¹ãå«ãåºæ¬çãªå¥èªç¹ãåæ§ã§ã)ãã¢ã«ãã¡ããããå床 0 ã«å€æŽããå¿ èŠããããŸããããäžåºŠè¿œå ã®ãã€ããè¿œå ããŸã (ãããŠãæ¬é¡ã«æ»ãããã«ãã XNUMX ãã€ããè¿œå ããŸã)ã
ã¢ã«ãã¡ããã㯠XNUMX æåãè¯ããXNUMX æåãè¯ã
äžèšã® XNUMX ã€ã«ããã« XNUMX ã€ãå ããŠãããã ãã¬ãã£ãã¯ã¹ãå°ãå€æŽããŠã¿ãŸãããã
0xxxxxxx
â éåžžã¢ãŒãã§ã¯ 1 ãã€ãããã³ã°ã¢ãŒãã§ã¯ 2 ãã€ã
11xxxxxx
â 1ãã€ã
100xxxxx xxxxxxxx
- 2ãã€ã
101xxxxx xxxxxxxx xxxxxxxx
- 3ãã€ã
XNUMX ãã€ãã®ã¬ã³ãŒãã§ã¯ãå©çšå¯èœãªãããã XNUMX ã€æžããŸãããã³ãŒã ãã€ã³ãã¯æ倧 XNUMX ãã€ãã§ãã 0x1FFF
ãšããªã 0x3FFF
ã ãã ããäŸç¶ãšã㊠8 ãã€ãã® UTF-XNUMX ã³ãŒããããèãã倧ãããã»ãšãã©ã®äžè¬çãªèšèªã¯ãŸã åãŸããŸãããæãé¡èãªæ倱ã¯æãèœã¡ãŠããŸãã
ãã®æ°ããã³ãŒãã¯äœã§ãã? 11xxxxxx
? ãã㯠64 æåã®å°ããªãé ãå Žæãã§ãããã¡ã€ã³ã®ã¢ã«ãã¡ããããè£å®ãããã®ã§ããããããããè£å©ãšåŒã³ãŸãã (è£å©ïŒ ã¢ã«ãã¡ãããã çŸåšã®ã¢ã«ãã¡ããããåãæ¿ãããšãå€ãã¢ã«ãã¡ãããã®äžéšãè£å©ã«ãªããŸãã ããšãã°ãASCII ããããªã«æåã«åãæ¿ããŸãããã¹ã¿ãã·ã¥ã«ã¯ã次ã®å
容ãå«ã 64 æåãå«ãŸããããã«ãªããŸããã ã©ãã³æåãæ°åãã¹ããŒã¹ãã«ã³ã (é ASCII ããã¹ããžã®æãé »ç¹ãªæ¿å
¥)ã ASCII ã«æ»ããšãããªã«æåã®äž»èŠéšåãè£å©æåã«ãªããŸãã
XNUMX ã€ã®ã¢ã«ãã¡ãããã«ã¢ã¯ã»ã¹ã§ãããããã¢ã«ãã¡ããããåãæ¿ããã³ã¹ããæå°éã«æããªããã倧éã®ããã¹ããåŠçã§ããŸã (å¥èªç¹ã䜿çšãããšãã»ãšãã©ã®å Žå ASCII ã«æ»ããŸããããã®åŸãè¿œå ã®ã¢ã«ãã¡ãããããå€ãã®é ASCII æåãååŸãããŸããå床åãæ¿ããŸãïŒã
ããŒãã¹: ãµãã¢ã«ãã¡ãããã®æ¥é èŸ 11xxxxxx
ãããŠãã®åæãªãã»ããã次ã®ããã«éžæããŸã 0xC0
ãCP1252ãšéšåçãªäºææ§ãåŸãããŸãã ã€ãŸããCP1252 ã§ãšã³ã³ãŒãããã西ãšãŒãããã®ããã¹ãã®å€ã (ãã¹ãŠã§ã¯ãããŸãã) ã¯ãUTF-C ã§ã¯åãããã«èŠããŸãã
ããããããã§åé¡ãçºçããŸããã¡ã€ã³ã®ã¢ã«ãã¡ãããããè£å©æåãã©ã®ããã«ååŸãããã§ãã åããªãã»ããããã®ãŸãŸã«ããããšãã§ããŸãããæ®å¿µãªããšã«ãããã§ã¯ Unicode æ§é ããã§ã«äžå©ã«ãªã£ãŠããŸãã å€ãã®å Žåãã¢ã«ãã¡ãããã®äž»èŠéšåã¯ãããã¯ã®å
é ã«ãããŸãã (ããšãã°ããã·ã¢ã®å€§æåãAãã«ã¯æ¬¡ã®ã³ãŒãããããŸã) 0x0410
ãã ããããªã«æåãããã¯ã¯æ¬¡ã§å§ãŸããŸãã 0x0400
ïŒã ãããã£ãŠãæåã® 64 æåãé ãå Žæã«åã蟌ããšãã¢ã«ãã¡ãããã®æ«å°Ÿã®éšåã«ã¢ã¯ã»ã¹ã§ããªããªãå¯èœæ§ããããŸãã
ãã®åé¡ã解決ããããã«ãããŸããŸãªèšèªã«å¯Ÿå¿ããããã€ãã®ãããã¯ãæåã§èª¿ã¹ããããã®ã¡ã€ã³ ã¢ã«ãã¡ãããå ã®è£å©ã¢ã«ãã¡ãããã®ãªãã»ãããæå®ããŸããã äŸå€ãšããŠãã©ãã³æåã¯éåžžãbase64 ã®ããã«äžŠã¹æ¿ããããŸããã
æåŸã®ä»äžã
æåŸã«ãä»ã«æ¹åã§ããç¹ãèããŠã¿ãŸãããã
圢åŒã«æ³šæããŠãã ãã 101xxxxx xxxxxxxx xxxxxxxx
ãŸã§ã®æ°å€ããšã³ã³ãŒãã§ããŸã 0x1FFFFF
ãUnicode ã¯ããæ©ãçµäºããŸãã 0x10FFFF
ã ã€ãŸããæåŸã®ã³ãŒããã€ã³ãã¯æ¬¡ã®ããã«è¡šãããŸãã 10110000 11111111 11111111
ã ãããã£ãŠãæåã®ãã€ãã次ã®åœ¢åŒã§ããå Žåã次ã®ããã«èšããŸãã 1011xxxx
ïŒã©ã xxxx
0 ãã倧ããå ŽåïŒãããã¯å¥ã®æå³ã«ãªããŸãã ããšãã°ãããã«ããã« 15 æåãè¿œå ããŠãåžžã« XNUMX ãã€ãã®ãšã³ã³ãŒãã«å©çšã§ããããã«ããããšãã§ããŸãããç§ã¯å¥ã®æ¹æ³ã§è¡ãããšã«ããŸããã
21 ãã€ããå¿
èŠãšãã Unicode ãããã¯ãèŠãŠã¿ãŸãããã åºæ¬çã«ããã§ã«è¿°ã¹ãããã«ããããã¯æŒ¢åã§ãããXNUMX ãã®æåããããããããã䜿ã£ãŠäœããããã®ã¯å°é£ã§ãã ããããã²ãããªãšã«ã¿ã«ããããã«é£ãã§ããŸãã - ãããŠãããã®æ°ã¯ããããã»ã©å€ãã¯ãªããXNUMXæªæºã§ãã ãããŠãæ¥æ¬èªãèŠããã®ã§ãçµµæåããããŸãïŒå®éãçµµæåã¯Unicodeã®ããŸããŸãªå Žæã«æ£åšããŠããŸãããäž»èŠãªãããã¯ã¯ç¯å²å
ã«ãããŸãïŒ 0x1F300
â 0x1FBFF
ïŒã çŸåšãè€æ°ã®ã³ãŒããã€ã³ãããäžåºŠã«çµã¿ç«ãŠãããçµµæåããããšããäºå®ãèããŠã¿ãŠãã ãã (ããšãã°ãçµµæå âââ
ãããã£ãŠãçµµæåãã²ãããªãã«ã¿ã«ãã«å¯Ÿå¿ããç¯å²ãããã€ãéžæããçªå·ãä»ãçŽã㊠XNUMX ã€ã®é£ç¶ãªã¹ãã«ããXNUMX ãã€ãã§ã¯ãªã XNUMX ãã€ããšããŠãšã³ã³ãŒãããŸãã
1011xxxx xxxxxxxx
çŽ æŽããã: åè¿°ã® âââ çµµæå
ãã XNUMX ã€åé¡ã解決ããŠã¿ãŸãããã ç§ãã¡ãèŠããŠããããã«ãåºæ¬çãªã¢ã«ãã¡ãããã¯æ¬è³ªçã«ã¯ äžäœ6ãããããã念é ã«çœ®ãã次ã«ãã³ãŒããããåã·ã³ãã«ã®ã³ãŒãã«è²Œãä»ããŸãã ãããã¯å
ã®æŒ¢åã®å Žå 0x4E00
â 0x9FFF
ãããã¯ããã 0 ãŸã㯠1 ã®ããããã§ããããã¯ããŸã䟿å©ã§ã¯ãããŸãããããã 10240 ã€ã®å€ã®éã§ã¢ã«ãã¡ããããåžžã«åãæ¿ããå¿
èŠããããŸã (ã€ãŸããXNUMX ãã€ããè²»ãããŸã)ã ãã ãããã³ã° ã¢ãŒãã§ã¯ãã³ãŒãèªäœãããã·ã§ãŒã ã¢ãŒãã䜿çšããŠãšã³ã³ãŒãããæåæ°ãå·®ãåŒãããšãã§ããããšã«æ³šæããŠãã ãã (äžèšã®ããªãã¯ããã¹ãŠå®è¡ãããšããã㯠XNUMX ã«ãªããŸã)ããããããšã象圢æåã®ç¯å²ã ã«ã·ããããŸãã 0x2600
â 0x77FF
ãã®å Žåããã®ç¯å²å
šäœãéããŠã(6 ã®ãã¡ã®) æäžäœ 21 ããã㯠0 ã«çãããªããŸãããããã£ãŠã象圢æåã®ã·ãŒã±ã³ã¹ã¯ã象圢æåããšã« XNUMX ãã€ãã䜿çšããŸã (ããã¯ããã®ãããªåºãç¯å²ã«ã¯æé©ã§ã)ãã¢ã«ãã¡ããããåãæ¿ããåå ãšãªããŸãã
代æ¿ãœãªã¥ãŒã·ã§ã³: SCSUãBOCU-1
Unicode ã®å°é家ã¯ããã®èšäºã®ã¿ã€ãã«ãèªãã ã ãã§ãUnicode æšæºã®äžã«æ¬¡ã®ãã®ãããããšãããã«æãåºãããã§ãããã
æ£çŽã«èªããŸããç§ããã®ååšãç¥ã£ãã®ã¯ãèªåã®æ±ºå®ãæžãããšã«æ·±ã没é ããåŸã§ããã æåãããã®ããšãç¥ã£ãŠããããããããç¬èªã®ã¢ãããŒããèãåºãã®ã§ã¯ãªããå®è£ ãæžãããšããã§ãããã
èå³æ·±ãããšã«ãSCSU ã¯ç§ãèªåã§èãåºããã¢ã€ãã¢ãšéåžžã«ãã䌌ãã¢ã€ãã¢ã䜿çšããŠããŸã (ãã¢ã«ãã¡ããããã®æŠå¿µã®ä»£ããã«ãSCSU ã§ã¯ããŠã£ã³ããŠãã䜿çšãããŠãããç§ãæã£ãŠãããã®ãããå€ãã®ã¢ã€ãã¢ãå©çšå¯èœã§ã)ã åæã«ããã®åœ¢åŒã«ã¯æ¬ ç¹ããããŸãããšã³ã³ãŒãã¢ã«ãŽãªãºã ãããå§çž®ã¢ã«ãŽãªãºã ã«å°ãè¿ããšããããšã§ãã ç¹ã«ããã®èŠæ Œã§ã¯å€ãã®è¡šçŸæ¹æ³ãæäŸãããŠããŸãããæé©ãªè¡šçŸæ¹æ³ãéžæããæ¹æ³ã«ã€ããŠã¯èŠå®ãããŠããŸããããã®ããã«ããšã³ã³ãŒãã¯ããçš®ã®ãã¥ãŒãªã¹ãã£ãã¯ã䜿çšããå¿ èŠããããŸãã ãããã£ãŠãé©åãªããã±ãŒãžã³ã°ãçæãã SCSU ãšã³ã³ãŒãã¯ãç§ã®ã¢ã«ãŽãªãºã ãããè€éã§æ±ãã«ãããã®ã«ãªããŸãã
æ¯èŒã®ããã«ãSCSU ã®æ¯èŒçåçŽãªå®è£ ã JavaScript ã«è»¢éããŸãããã³ãŒãéã®ç¹ã§ã¯ãUTF-C ãšåçã§ããããšãããããŸããããå Žåã«ãã£ãŠã¯çµæãæ°åããŒã»ã³ãæªãã£ãããšããããŸã (å Žåã«ãã£ãŠã¯ãããè¶ ããå ŽåããããŸãããããã»ã©å€ãã¯ãããŸããïŒã ããšãã°ãããã©ã€èªãšã®ãªã·ã£èªã®ããã¹ã㯠UTF-C ã§ãšã³ã³ãŒããããŸããã SCSUãã60%åªããŠããŸã (ããããã¢ã«ãã¡ããããã³ã³ãã¯ããªãã)ã
ãããšã¯å¥ã«ãSCSU 以å€ã«ã Unicode ãã³ã³ãã¯ãã«è¡šçŸããå¥ã®æ¹æ³ãããããšãä»ãå ããŠãããŸãã
èããããæ¹åç¹
ç§ãæ瀺ããã¢ã«ãŽãªãºã ã¯ãèšèšäžæ®éçãªãã®ã§ã¯ãããŸãã (ããããããããç§ã®ç®æšã Unicode ã³ã³ãœãŒã·ã¢ã ã®ç®æšãšæãç°ãªãç¹ã§ã)ã ããã¯äž»ã« XNUMX ã€ã®ã¿ã¹ã¯ (ãã¬ãã£ãã¯ã¹ ããªãŒã«å€èšèªèŸæžãä¿åãã) ã®ããã«éçºããããã®ã§ããããã®æ©èœã®äžéšã¯ä»ã®ã¿ã¹ã¯ã«ã¯ããŸãé©ããŠããªãå¯èœæ§ãããããšã¯ãã§ã«è¿°ã¹ãŸããã ãããããããæšæºã§ã¯ãªããšããäºå®ã¯ãã©ã¹ã«ãªãå¯èœæ§ããããŸã - ããŒãºã«åãããŠç°¡åã«å€æŽã§ããŸã.
ããšãã°ãæãããªæ¹æ³ãšããŠãç¶æ
ã®ååšãåãé€ããã¹ããŒãã¬ã¹ãªã³ãŒãã£ã³ã°ãäœæã§ããŸããå€æ°ãæŽæ°ããªãã ãã§ãã offs
, auxOffs
О is21Bit
ãšã³ã³ãŒããšãã³ãŒãã§ã ãã®å Žåãåãã¢ã«ãã¡ãããã®æåã®ã·ãŒã±ã³ã¹ãå¹æçã«ããã¯ããããšã¯ã§ããŸããããã³ã³ããã¹ãã«é¢ä¿ãªããåãæåãåžžã«åããã€ãã§ãšã³ã³ãŒããããããšãä¿èšŒãããŸãã
ããã«ãããã©ã«ãã®ç¶æ
ãå€æŽããããšã§ããšã³ã³ãŒããŒãç¹å®ã®èšèªã«åãããŠèª¿æŽã§ããŸããããšãã°ããã·ã¢èªã®ããã¹ãã«çŠç¹ãåœãŠãæåã«ãšã³ã³ãŒããŒãšãã³ãŒããŒãèšå®ããŸãã offs = 0x0400
О auxOffs = 0
ã ããã¯ãã¹ããŒãã¬ã¹ ã¢ãŒãã®å Žåã«ç¹ã«æå³ããããŸãã äžè¬ã«ãããã¯å€ã XNUMX ããã ãšã³ã³ãŒãã£ã³ã°ã®äœ¿çšã«äŒŒãŠããŸãããå¿
èŠã«å¿ããŠãã¹ãŠã® Unicode ããæåãæ¿å
¥ããæ©èœã¯åé€ãããŸããã
åè¿°ãããã 100 ã€ã®æ¬ ç¹ã¯ãUTF-C ã§ãšã³ã³ãŒãããã倧ããªããã¹ãã§ã¯ãä»»æã®ãã€ãã«æãè¿ãæåå¢çãç°¡åã«èŠã€ããæ¹æ³ããªãããšã§ãã ãšã³ã³ãŒãããããããã¡ãŒããæåŸã®ãããšãã° XNUMX ãã€ããåãåããšãäœãã§ããªããŽããçºçããå±éºããããŸãã ãšã³ã³ãŒãã¯ãæ°ã®ã¬ãã€ãã®ãã°ãä¿åããããã«èšèšãããŠããŸããããäžè¬çã«ã¯ä¿®æ£ã§ããŸãã ãã€ã 0xBF
決ããŠæåã®ãã€ããšããŠåºçŸããŠã¯ãªããŸãã (ãã ããXNUMX çªç®ãŸã㯠XNUMX çªç®ã®ãã€ãã§ããå ŽåããããŸã)ã ãããã£ãŠããšã³ã³ãŒãæã«ã·ãŒã±ã³ã¹ãæ¿å
¥ã§ããŸãã 0xBF 0xBF 0xBF
ããšãã° 10 KB ããš - å¢çãèŠã€ããå¿
èŠãããå Žåã¯ãåæ§ã®ããŒã«ãŒãèŠã€ãããŸã§éžæããéšåãã¹ãã£ã³ããã ãã§ååã§ãã æåŸã«ç¶ã㊠0xBF
æåã®å§ãŸãã§ããããšãä¿èšŒãããŠããŸãã (ãã³ãŒãæã«ã¯ããã® XNUMX ãã€ãã®ã·ãŒã±ã³ã¹ã¯åœç¶ç¡èŠããå¿
èŠããããŸãã)
èŠçŽ
ãããŸã§èªãã§ãã ãã£ãæ¹ãããã§ãšãããããŸã! ããªããç§ãšåãããã«ãUnicode ã®æ§é ã«ã€ããŠäœãæ°ããããšãåŠãã (ãŸãã¯èšæ¶ãæ°ãã«ãã) ããšãé¡ã£ãŠããŸãã
ãã¢ããŒãžã ããã©ã€èªã®äŸã¯ãUTF-8 ãš SCSU ã®äž¡æ¹ã«å¯Ÿããå©ç¹ã瀺ããŠããŸãã
äžèšã®ç 究ã¯ãèŠæ Œãžã®äŸµå®³ãšã¿ãªãããã¹ãã§ã¯ãããŸããã ããããä»äºã®çµæã«ã¯æŠãæºè¶³ããŠããã®ã§ãæºè¶³ããŠããŸãã
æåŸã«ãUTF-C ã䜿çšãããã±ãŒã¹ã«ã€ããŠããäžåºŠæ³šæããŠãã ããã ããã»ã©äŸ¡å€ããªã:
- è¡ãååã«é·ãå Žå (100 ïœ 200 æå)ã ãã®å Žåãdeflate ãªã©ã®å§çž®ã¢ã«ãŽãªãºã ã®äœ¿çšãæ€èšããå¿ èŠããããŸãã
- å¿ èŠãªå Žå㯠ASCII ã®ééæ§ã€ãŸãããšã³ã³ãŒããããã·ãŒã±ã³ã¹ã«ã¯ãå ã®æååã«å«ãŸããŠããªã ASCII ã³ãŒããå«ãŸããŠããªãããšãéèŠã§ãã ãµãŒãããŒã㣠API ãšå¯Ÿè©±ãããšã (ããšãã°ãããŒã¿ããŒã¹ãæäœãããšã)ããšã³ã³ãŒãçµæãæååãšããŠã§ã¯ãªãæœè±¡çãªãã€ãã®ã»ãããšããŠæž¡ãå Žåããã®å¿ èŠæ§ãåé¿ã§ããŸãã ããããªããšãäºæããªãè匱æ§ãçºçããå±éºããããŸãã
- ä»»æã®ãªãã»ããã§æåã®å¢çããã°ããèŠã€ãããå Žå (ããšãã°ãè¡ã®äžéšãç ŽæããŠããå Žå)ã ããã¯è¡ãæåããã¹ãã£ã³ãã (ãŸãã¯åã®ã»ã¯ã·ã§ã³ã§èª¬æããå€æŽãé©çšãã) ããšã«ãã£ãŠã®ã¿å®è¡ã§ããŸãã
- æååã®å 容ã«å¯ŸããŠæäœããã°ããå®è¡ããå¿ èŠãããå Žå (æååã®äžŠã¹æ¿ããæååå ã®éšåæååã®æ€çŽ¢ãé£çµ)ã ããã«ã¯ãæåã«æååããã³ãŒãããå¿ èŠãããããããã®ãããªå ŽåãUTF-C 㯠UTF-8 ãããé ããªããŸã (ãã ããå§çž®ã¢ã«ãŽãªãºã ããã¯é«éã§ã)ã åãæååã¯åžžã«åãæ¹æ³ã§ãšã³ã³ãŒããããããããã³ãŒãã®æ£ç¢ºãªæ¯èŒã¯å¿ èŠãªãããã€ãããšã«è¡ãããšãã§ããŸãã
ã¢ããããŒãïŒ ãŠãŒã¶ãŒ
åºæïŒ habr.com