Yuav ua li cas Linux cov kev xaiv xaiv cov hlua

Taw qhia

Nws tag nrho pib nrog ib tsab ntawv luv luv uas yuav tsum muab cov ntaub ntawv chaw nyob e-mail cov neeg ua haujlwm tau txais los ntawm cov npe ntawm cov neeg siv xa ntawv, nrog rau cov neeg ua haujlwm tau txais los ntawm HR department database. Ob daim ntawv teev npe tau raug xa tawm mus rau cov ntawv nyeem Unicode UTF-8 thiab txuag nrog Unix kab xaus.

Cov ntsiab lus mail.txt

Иванов АндрСй;[email protected]

Cov ntsiab lus buhg.txt

Иванова Алла;маляр
Ёлкина Π­Π»Π»Π°;ΠΊΡ€Π°Π½ΠΎΠ²Ρ‰ΠΈΡ†Π°
Иванов АндрСй;ΡΠ»Π΅ΡΠ°Ρ€ΡŒ
Абаканов ΠœΠΈΡ…Π°ΠΈΠ»;маляр

Txhawm rau ua ke, cov ntaub ntawv raug txheeb xyuas los ntawm Unix hais kom ua tsi thiab xa mus rau cov tswv yim ntawm Unix program koom, uas poob nthav poob nrog ib qho yuam kev:

$> sort buhg.txt > buhg.srt
$> sort mail.txt > mail.srt
$> join buhg.srt mail.srt > result
join: buhg.srt:4: is not sorted: Иванов АндрСй;ΡΠ»Π΅ΡΠ°Ρ€ΡŒ

Saib qhov kev txheeb xyuas nrog koj ob lub qhov muag pom tau tias, feem ntau, kev txheeb xyuas yog qhov tseeb, tab sis nyob rau hauv cov ntaub ntawv ntawm coincidences ntawm txiv neej thiab poj niam lub xeem, cov poj niam sawv daws tuaj ua ntej cov txiv neej:

$> sort buhg.txt
Абаканов ΠœΠΈΡ…Π°ΠΈΠ»;маляр
Ёлкина Π­Π»Π»Π°;ΠΊΡ€Π°Π½ΠΎΠ²Ρ‰ΠΈΡ†Π°
Иванова Алла;маляр
Иванов АндрСй;ΡΠ»Π΅ΡΠ°Ρ€ΡŒ

Zoo li kev txheeb xyuas qhov tsis sib xws hauv Unicode lossis zoo li qhov tshwm sim ntawm feminism hauv kev txheeb xyuas algorithm. Thawj yog, ntawm chav kawm, ntau plausible.

Cia peb muab tso rau tam sim no koom thiab tsom ntsoov rau tsi. Cia peb sim daws qhov teeb meem uas siv scientific poking. Ua ntej, cia peb hloov lub zos los ntawm en_US rau ru_RU. Txhawm rau txheeb xyuas, nws yuav txaus los teeb tsa ib puag ncig hloov pauv LC_COLLATE, tab sis peb yuav tsis nkim lub sij hawm ntawm trifles:

$> LANG=ru_RU.UTF-8 sort buhg.txt
Абаканов ΠœΠΈΡ…Π°ΠΈΠ»;маляр
Ёлкина Π­Π»Π»Π°;ΠΊΡ€Π°Π½ΠΎΠ²Ρ‰ΠΈΡ†Π°
Иванова Алла;маляр
Иванов АндрСй;ΡΠ»Π΅ΡΠ°Ρ€ΡŒ

Tsis muaj dab tsi hloov.

Cia peb sim recode cov ntaub ntawv rau hauv ib-byte encoding:

$> iconv -f UTF-8 -t KOI8-R buhg.txt 
 | LANG=ru_RU.KOI8-R sort 
 | iconv -f KOI8-R -t UTF8

Dua tsis muaj dab tsi hloov.

Tsis muaj ib yam dab tsi uas koj tuaj yeem ua tau, koj yuav tsum nrhiav kev daws teeb meem hauv Is Taws Nem. Tsis muaj ib yam dab tsi ncaj qha txog cov npe Lavxias, tab sis muaj cov lus nug txog lwm yam kev xaiv tsis zoo. Piv txwv li, ntawm no yog ib qho teeb meem: unix txheeb xyuas '-' (dash) cov cim raws li pom tsis tau. Hauv luv luv, cov hlua "a-b", "aa", "ac" raug txheeb raws li "aa", "a-b", "ac".

Cov lus teb yog tus qauv txhua qhov chaw: siv lub programmer hauv zos "C" thiab koj yuav zoo siab. Wb sim:

$> LANG=C sort buhg.txt
Ёлкина Π­Π»Π»Π°;ΠΊΡ€Π°Π½ΠΎΠ²Ρ‰ΠΈΡ†Π°
Абаканов ΠœΠΈΡ…Π°ΠΈΠ»;маляр
Иванов АндрСй;ΡΠ»Π΅ΡΠ°Ρ€ΡŒ
Иванова Алла;Π°Π΄Π²ΠΎΠΊΠ°Ρ‚

Ib yam dab tsi tau hloov. Lub Ivanovs lined nyob rau hauv qhov kev txiav txim, txawm tias Yolkina swb qhov chaw. Cia peb rov qab mus rau qhov teeb meem qub:

$> LANG=C sort buhg.txt > buhg.srt
$> LANG=C sort mail.txt > mail.srt
$> LANG=C join buhg.srt mail.srt > result

Nws ua haujlwm yam tsis muaj qhov yuam kev, raws li Internet tau cog lus tseg. Thiab qhov no txawm Yolkina hauv thawj kab.

Qhov teeb meem zoo li yuav daws tau, tab sis nyob rau hauv cov ntaub ntawv, cia peb sim lwm Lavxias teb sab encoding - Windows CP1251:

$> iconv -f UTF-8 -t CP1251 buhg.txt 
 | LANG=ru_RU.CP1251 sort 
 | iconv -f CP1251 -t UTF8 

Cov txiaj ntsig sorting, oddly txaus, yuav coincide nrog lub zos "C", thiab tag nrho cov piv txwv, raws li, khiav tsis raug. Qee yam ntawm mysticism.

Kuv tsis nyiam mysticism hauv programming vim nws feem ntau npog qhov yuam kev. Peb yuav tau saib xyuas seb nws ua haujlwm li cas. tsi thiab nws cuam tshuam li cas? LC_COLLATE .

Thaum kawg kuv yuav sim teb cov lus nug:

  • yog vim li cas poj niam lub xeem tau txheeb tsis raug?
  • vim li cas LANG=ru_RU.CP1251 muab sib npaug LANG = C
  • vim li cas tsi ΠΈ koom cov tswv yim sib txawv txog qhov kev txiav txim ntawm cov hlua khi
  • vim li cas thiaj muaj qhov yuam kev hauv tag nrho kuv cov qauv?
  • thaum kawg yuav txheeb cov hlua li koj nyiam

Kev txheeb xyuas hauv Unicode

Thawj qhov chaw nres tsheb yuav yog daim ntawv tshaj tawm xov xwm 10 muaj cai Unicode collation algorithm online ib unicode.org. Daim ntawv tshaj tawm muaj ntau yam kev paub meej, yog li cia kuv muab cov ntsiab lus luv luv ntawm cov tswv yim tseem ceeb.

collation - "sib piv" cov hlua yog lub hauv paus ntawm txhua qhov kev txheeb xyuas algorithm. Cov algorithms lawv tus kheej yuav txawv (" npuas", "merge", "ceev"), tab sis lawv txhua tus yuav siv kev sib piv ntawm ib khub ntawm cov hlua los txiav txim qhov kev txiav txim uas lawv tshwm sim.

Kev txheeb cov hlua hauv cov lus ntuj yog ib qho teeb meem nyuaj. Txawm hais tias nyob rau hauv qhov yooj yim tshaj plaws ib-byte encodings, qhov kev txiav txim ntawm cov tsiaj ntawv nyob rau hauv cov tsiaj ntawv, txawm nyob rau hauv ib co txoj kev txawv los ntawm cov lus Askiv Latin cov tsiaj ntawv, yuav tsis coincide nrog rau cov kev txiav txim ntawm cov zauv qhov tseem ceeb uas cov tsiaj ntawv no encoded. Yog li ntawd nyob rau hauv cov tsiaj ntawv German tsab ntawv Γ– sawv nruab nrab О ΠΈ P, thiab nyob rau hauv lub encoding CP850 nws tau nruab nrab ΓΏ ΠΈ Ü.

Koj tuaj yeem sim paub daws teeb meem los ntawm qhov tshwj xeeb encoding thiab xav txog cov ntawv "zoo tagnrho" uas tau teem rau hauv qee qhov kev txiav txim, raws li ua tiav hauv Unicode. Encodings UTF 8, UTF 16 los yog ib-byte KOI 8-R (yog tias xav tau ib qho txwv subset ntawm Unicode) yuav muab cov lej sib txawv ntawm cov ntawv, tab sis xa mus rau tib lub ntsiab lus ntawm lub hauv paus rooj.

Nws hloov tawm tias txawm tias peb tsim ib lub rooj sib tham los ntawm kos, peb yuav tsis muaj peev xwm muab lub cim thoob ntiaj teb rau nws. Hauv cov tsiaj ntawv txawv tebchaws uas siv tib cov tsiaj ntawv, qhov kev txiav txim ntawm cov ntawv no yuav txawv. Piv txwv li, hauv Fab Kis Γ† yuav suav hais tias yog ib tug ligature thiab txheeb raws li ib txoj hlua AE. Hauv Norwegian Γ† yuav yog ib tsab ntawv cais, uas nyob tom qab Z. Los ntawm txoj kev, ntxiv rau ligatures nyiam Γ† Muaj cov ntawv sau nrog ntau lub cim. Yog li hauv Czech cov tsiaj ntawv muaj ib tsab ntawv Ch, uas stands ntawm H ΠΈ I.

Ntxiv nrog rau qhov sib txawv ntawm cov tsiaj ntawv, muaj lwm yam kev coj noj coj ua hauv tebchaws uas cuam tshuam rau kev xaiv. Tshwj xeeb, cov lus nug tshwm sim: nyob rau hauv dab tsi kev txiav txim yuav tsum cov lus uas muaj cov tsiaj ntawv loj thiab cov tsiaj ntawv me tshwm nyob rau hauv phau ntawv txhais lus? Kev txheeb xyuas kuj yuav cuam tshuam los ntawm kev siv cov cim cim. Hauv lus Mev, ib qho lus nug inverted yog siv thaum pib ntawm kab lus nug (Koj puas nyiam suab paj nruag?). Hauv qhov no, nws yog qhov pom tseeb tias cov lus nug yuav tsum tsis txhob muab tso rau hauv ib pawg cais sab nraum cov tsiaj ntawv, tab sis yuav ua li cas txheeb cov kab nrog lwm cov cim cim?

Kuv yuav tsis nyob ntawm kev txheeb cov hlua hauv cov lus sib txawv ntawm cov neeg European. Nco ntsoov tias hauv cov lus nrog txoj cai-rau-sab laug lossis sab saud-rau-qab sau cov lus qhia, cov cim hauv kab feem ntau yuav khaws cia rau hauv kev nyeem ntawv, thiab txawm tias cov ntawv sau tsis yog tsiaj ntawv muaj lawv tus kheej txoj hauv kev txiav txim cov kab cim los ntawm tus cwj pwm. . Piv txwv li, hieroglyphs tuaj yeem xaj los ntawm style (Suav cov yuam sij) los yog hais lus. Ua ncaj ncees, kuv tsis paub yuav ua li cas emojis yuav tsum tau npaj, tab sis koj tuaj yeem tuaj nrog qee yam rau lawv thiab.

Raws li cov yam ntxwv tau teev tseg saum toj no, cov kev cai yooj yim rau kev sib piv cov hlua raws li cov lus Unicode tau tsim:

  • kev sib piv ntawm cov hlua tsis yog nyob ntawm txoj hauj lwm ntawm cov cim hauv cov lus code;
  • sequences ntawm cov cim ua ib tug cim yog txo mus rau canonical daim ntawv (A + lub voj voog saum toj kawg nkaus yog tib yam Γ…);
  • Thaum sib piv cov hlua, tus cwj pwm yog suav tias yog nyob rau hauv cov ntsiab lus ntawm txoj hlua thiab, yog tias tsim nyog, ua ke nrog nws cov neeg nyob ze rau hauv ib chav sib piv (Ch hauv Czech) lossis muab faib ua ob peb (Γ† hauv Fabkis);
  • tag nrho cov yam ntxwv ntawm lub teb chaws (cov tsiaj ntawv, tus tsiaj ntawv loj / tus lej, cov cim cim, kev txiav txim ntawm hom kev sau ntawv) yuav tsum tau teeb tsa mus txog qhov kev taw qhia ntawm qhov kev txiav txim (emoji);
  • Kev sib piv yog qhov tseem ceeb tsis yog rau kev txheeb xyuas nkaus xwb, tab sis kuj tseem nyob hauv ntau qhov chaw, piv txwv li rau kev qhia cov kab ntau yam (hloov {A... z} hauv bash);
  • kev sib piv yuav tsum tau ua sai sai.

Tsis tas li ntawd, daim ntawv tshaj tawm cov kws sau ntawv tsim cov khoom sib piv uas cov neeg tsim tawm algorithm yuav tsum tsis txhob cia siab rau:

  • qhov kev sib piv algorithm yuav tsum tsis txhob xav kom muaj cais cov cim rau txhua hom lus (Lavxias teb sab thiab Ukrainian hom lus feem ntau Cyrillic cim);
  • kev sib piv yuav tsum tsis txhob cia siab rau qhov kev txiav txim ntawm cov cim hauv cov ntxhuav Unicode;
  • txoj hlua hnyav yuav tsum tsis yog tus cwj pwm ntawm txoj hlua, txij li cov hlua tib yam hauv cov kab lis kev cai sib txawv tuaj yeem muaj qhov hnyav sib txawv;
  • kab qhov hnyav tuaj yeem hloov pauv thaum sib koom ua ke lossis sib cais (los ntawm x < y nws tsis ua raws li ntawd xz < yz);
  • Cov hlua sib txawv uas muaj qhov hnyav tib yam suav hais tias sib npaug los ntawm qhov pom ntawm qhov kev txheeb xyuas algorithm. Kev nthuav qhia ntxiv kev txiav txim ntawm cov hlua zoo li no ua tau, tab sis nws tuaj yeem ua rau kev ua haujlwm tsis zoo;
  • Thaum rov sorting, kab nrog tib qhov hnyav tuaj yeem hloov pauv. Robustness yog cov cuab yeej ntawm ib qho kev txheeb xyuas qhov tshwj xeeb, thiab tsis yog cov cuab yeej ntawm txoj hlua sib piv algorithm (saib kab lus dhau los);
  • Kev txheeb xyuas cov cai tuaj yeem hloov pauv raws sijhawm raws li kev coj noj coj ua ntawm kev coj noj coj ua hloov kho / hloov pauv.

Nws kuj tau teev tseg tias qhov kev sib piv algorithm tsis paub dab tsi txog cov semantics ntawm cov hlua ua tiav. Yog li, cov hlua uas muaj tus lej nkaus xwb yuav tsum tsis txhob muab piv rau cov lej, thiab hauv cov npe ntawm cov npe lus Askiv hauv kab lus (Beatles, Lub).

Txhawm rau ua kom tau raws li txhua qhov kev xav tau tshwj xeeb, ntau theem (tiam plaub-theem) cov lus xaiv algorithm tau thov.

Yav dhau los, cov cim hauv txoj hlua raug txo rau hauv daim ntawv canonical thiab muab faib ua pawg sib piv. Txhua chav sib piv tau muab ntau qhov hnyav sib xws rau ntau theem ntawm kev sib piv. Qhov hnyav ntawm kev sib piv units yog cov ntsiab lus ntawm kev txiav txim (hauv qhov no, cov lej) uas tuaj yeem muab piv rau ntau dua lossis tsawg dua. Lub ntsiab lus tshwj xeeb Tsis quav ntsej (0x0) txhais tau hais tias nyob rau theem sib piv sib xws chav tsev no tsis koom nrog kev sib piv. Kev sib piv ntawm cov hlua tuaj yeem rov ua dua ob peb zaug, siv qhov hnyav ntawm cov qib sib thooj. Nyob rau hauv txhua theem, qhov hnyav ntawm qhov sib piv units ntawm ob kab yog sequentially piv nrog rau ib leeg.

Hauv kev sib txawv ntawm kev siv algorithm rau cov kab lis kev cai txawv teb chaws, cov txiaj ntsig ntawm cov coefficients yuav txawv, tab sis tus qauv Unicode suav nrog cov lus yooj yim ntawm qhov hnyav - "Default Unicode Collation Element Table" (DUCET). Kuv xav kom nco ntsoov tias kev teeb tsa qhov sib txawv LC_COLLATE yog qhov tseeb qhov qhia txog kev xaiv ntawm lub rooj hnyav hauv txoj hlua sib piv.

Qhov hnyav coefficients DUCET npaj raws li nram no:

  • nyob rau hauv thawj theem, tag nrho cov tsiaj ntawv raug txo mus rau tib rooj plaub, diacritics raug muab pov tseg, cov cim cim (tsis yog tag nrho) yog ignored;
  • nyob rau theem ob, tsuas yog diacritics raug coj mus rau hauv tus account;
  • nyob rau theem peb, tsuas yog cov ntaub ntawv raug coj mus rau hauv tus account;
  • nyob rau theem plaub, tsuas yog cov cim cim sau rau hauv tus account.

Kev sib piv yuav tshwm sim nyob rau hauv ob peb kis: ua ntej, cov coefficients ntawm thawj theem yog muab piv; yog tias qhov hnyav sib npaug, tom qab ntawd rov ua qhov sib piv nrog qhov ntsuas thib ob; ces tej zaum peb thiab plaub.

Qhov kev sib piv xaus thaum cov kab muaj qhov sib txuam ntawm qhov sib piv nrog qhov hnyav sib txawv. Cov kab uas muaj qhov hnyav sib npaug ntawm plaub theem yog suav tias yog sib npaug.

Qhov no algorithm (nrog ib pawg ntawm cov lus qhia ntxiv) tau muab lub npe los qhia No. 10 - "Unicode Collation Algorithm" (ACU).

Qhov no yog qhov uas kev txheeb xyuas tus cwj pwm los ntawm peb qhov piv txwv dhau los ua qhov tseeb me ntsis. Nws yuav zoo dua piv nws nrog tus qauv Unicode.

Txhawm rau kuaj kev siv ACU muaj qhov tshwj xeeb xeem, siv luj file, kev siv DUCET. Koj tuaj yeem pom txhua yam kev lom zem hauv cov ntaub ntawv teev. Piv txwv li, muaj qhov kev txiav txim ntawm mahjong thiab European dominoes, nrog rau kev txiav txim ntawm suits nyob rau hauv ib lub lawj ntawm daim npav (cim 1F000 thiab ntxiv). Daim npav suits tau muab tso rau raws li cov cai ntawm choj - PCBT, thiab daim npav hauv cov khaub ncaws yog nyob rau theem T, 2,3, XNUMX ... K.

Manually tshuaj xyuas cov kab uas raug txheeb xyuas kom raug raws li DUCET yuav nyuaj heev, tab sis, hmoov zoo rau peb, muaj ib qho piv txwv ntawm kev siv lub tsev qiv ntawv rau kev ua haujlwm nrog Unicode - "International Cheebtsam rau Unicode"((ICU).

Nyob rau hauv lub website ntawm lub tsev qiv ntawv no, tsim nyob rau hauv IBM, muaj nplooj ntawv demo, suav nrog kab sib piv algorithm nplooj. Peb nkag mus rau peb cov kab xeem nrog lub neej ntawd teeb tsa thiab, thiab saib seb, peb tau txais kev txheeb xyuas Lavxias zoo meej.

Абаканов ΠœΠΈΡ…Π°ΠΈΠ»;маляр
Ёлкина Π­Π»Π»Π°;ΠΊΡ€Π°Π½ΠΎΠ²Ρ‰ΠΈΡ†Π°
Иванов АндрСй;ΡΠ»Π΅ΡΠ°Ρ€ΡŒ
Иванова Алла;Π°Π΄Π²ΠΎΠΊΠ°Ρ‚

Los ntawm txoj kev, lub website ICU Koj tuaj yeem pom qhov qhia meej ntawm kev sib piv algorithm thaum ua cov cim cim. Hauv piv txwv Collation FAQ apostrophe thiab hyphen tsis quav ntsej.

Unicode tau pab peb, tab sis nrhiav cov laj thawj rau tus cwj pwm coj txawv txawv tsi Π² Linux yuav tau mus rau lwm qhov.

Kev txheeb xyuas hauv glibc

Saib nrawm ntawm cov lej siv hluav taws xob tsi los ntawm GNU Core Utils qhia tau hais tias nyob rau hauv cov nqi hluav taws xob nws tus kheej, localization los ntawm luam tawm tus nqi tam sim no ntawm qhov sib txawv LC_COLLATE thaum khiav hauv hom kev debug:

$ sort --debug buhg.txt > buhg.srt
sort: using β€˜en_US.UTF8’ sorting rules

Cov hlua sib piv yog ua los ntawm kev siv tus qauv ua haujlwm ua strcoll, uas txhais tau tias txhua yam nthuav yog nyob rau hauv lub tsev qiv ntawv glibc.

rau wiki peb tes num glibc mob siab rau txoj hlua sib piv ib kab lus. Los ntawm kab lus no nws tuaj yeem nkag siab tias hauv glibc sorting yog raws li ib tug algorithm twb paub rau peb ACU (Unicode collation algorithm) thiab/los yog ntawm tus qauv ze rau nws ISO 14651 (International hlua ordering thiab sib piv). Hais txog tus qauv tshiab, nws yuav tsum tau muab sau tseg tias nyob rau ntawm qhov chaw standard.iso.org ISO 14651 officially tshaj tawm rau pej xeem muaj, tab sis qhov sib txuas txuas ua rau ib nplooj ntawv tsis muaj nyob. Google rov qab ob peb nplooj ntawv nrog cov kev txuas mus rau cov vev xaib raug cai uas muab kev yuav daim ntawv theej hluav taws xob ntawm tus qauv rau ib puas euros, tab sis ntawm nplooj thib peb lossis plaub ntawm nplooj ntawv tshawb fawb kuj tseem muaj kev txuas ncaj qha rau PDF. Feem ntau, tus qauv yog xyaum tsis txawv ntawm ACU, tab sis yog kev nyeem ntau dua vim nws tsis muaj cov piv txwv meej ntawm lub teb chaws cov yam ntxwv ntawm txoj hlua sorting.

Cov ntaub ntawv nthuav dav tshaj plaws ntawm wiki muaj qhov txuas rau kab laum tracker nrog kev sib tham txog kev siv txoj hlua sib piv hauv glibc. Los ntawm kev sib tham nws tuaj yeem kawm tau tias glibc siv los sib piv cov hlua ISOtus kheej lub rooj Lub Rooj Sib Tham Template (CTT), qhov chaw nyob uas tuaj yeem pom hauv daim ntawv thov A txheem ISO 14651. Nyob nruab nrab ntawm 2000 thiab 2015 no rooj nyob rau hauv glibc tsis muaj tus tuav tswj thiab sib txawv heev (tsawg kawg yog sab nraud) los ntawm cov qauv tam sim no. Los ntawm 2015 mus rau 2018, hloov mus rau lub tshiab version ntawm lub rooj coj qhov chaw, thiab tam sim no koj muaj lub sij hawm los ntsib nyob rau hauv lub neej tiag tiag ib tug tshiab version ntawm lub rooj (CentOS 8), thiab laus (CentOS 7).

Tam sim no peb muaj tag nrho cov ntaub ntawv hais txog cov algorithm thiab cov rooj pabcuam, peb tuaj yeem rov qab mus rau qhov teeb meem qub thiab nkag siab yuav ua li cas txheeb cov hlua kom raug hauv thaj chaw Lavxias.

ISO 14651 / 14652

Qhov chaws ntawm lub rooj peb txaus siab rau CTT ntawm kev faib khoom feem ntau Linux yog nyob rau hauv catalog /usr/share/i18n/locales/. Lub rooj nws tus kheej yog nyob rau hauv cov ntaub ntawv iso14651_t1_common. Ces qhov no yog cov ntaub ntawv qhia luam iso14651_t1_common suav nrog hauv cov ntaub ntawv iso14651_t1, uas, nyob rau hauv lem, muaj nyob rau hauv lub teb chaws cov ntaub ntawv, nrog rau en_US ΠΈ ru_RU. Ntawm kev faib khoom feem ntau Linux tag nrho cov ntaub ntawv tau suav nrog hauv kev teeb tsa yooj yim, tab sis yog tias lawv tsis nyob, koj yuav tsum tau nruab ib pob ntxiv los ntawm kev faib tawm.

Cov ntaub ntawv qauv iso14651_t1 tej zaum yuav zoo li hais lus phem heev, nrog cov cai tsis pom tseeb rau kev tsim cov npe, tab sis yog tias koj saib nws, txhua yam yooj yim heev. Cov qauv tau piav qhia hauv tus qauv ISO 14652, ib daim ntawv luam uas tuaj yeem rub tawm los ntawm lub vev xaib qhib-std.org. Lwm cov lus piav qhia ntawm cov ntaub ntawv hom tuaj yeem nyeem hauv specifications POSIX los ntawm OpenGroup. Raws li lwm txoj hauv kev nyeem cov qauv, koj tuaj yeem kawm txog qhov chaws ntawm txoj haujlwm collate_read Π² glibc/locale/programs/ld-collate.c.

Cov qauv ntaub ntawv zoo li no:

Los ntawm lub neej ntawd, tus cwj pwm yog siv los ua tus cwj pwm khiav tawm, thiab qhov kawg ntawm kab tom qab tus cim # yog cov lus pom. Ob lub cim tuaj yeem raug rov txhais dua, uas yog ua tiav hauv cov lus tshiab ntawm lub rooj:

escape_char /
comment_char %

Cov ntaub ntawv yuav muaj cov tokens hauv hom ntawv los yog (qhov twg x - tus lej hexadecimal). Qhov no yog hexadecimal sawv cev ntawm Unicode code cov ntsiab lus hauv encoding UAS-4 (UTF-32). Tag nrho lwm cov ntsiab lus hauv lub kaum sab xis (suav nrog , <2> thiab cov zoo li) raug suav hais tias yog cov hlua yooj yim tas li uas muaj lub ntsiab lus me me sab nraud ntawm cov ntsiab lus.

Txoj hlua LC_COLLATE qhia peb tias tom ntej no pib cov ntaub ntawv piav qhia txog kev sib piv ntawm cov hlua.

Ua ntej, cov npe tau teev tseg rau qhov hnyav hauv lub rooj sib piv thiab cov npe rau cov cim ua ke. Feem ntau hais lus, ob hom npe muaj nyob rau ob qhov chaw sib txawv, tab sis hauv cov ntaub ntawv tiag tiag lawv sib xyaw. Cov npe ntawm qhov hnyav yog teev los ntawm lo lus tseem ceeb collating-cim (tus cwj pwm sib piv) vim tias thaum sib piv, cov cim Unicode uas muaj qhov hnyav tib yam yuav suav tias yog cov cim sib npaug.

Tag nrho qhov ntev ntawm ntu hauv cov ntaub ntawv hloov kho tam sim no yog li 900 kab. Kuv rub cov piv txwv los ntawm ob peb qhov chaw los qhia qhov kev txiav txim siab ntawm cov npe thiab ntau hom syntax.

LC_COLLATE

collating-symbol <RES-1>
collating-symbol <BLK>
collating-symbol <MIN>
collating-symbol <WIDE>
...
collating-symbol <ARABIC>
collating-symbol <ETHPC>
collating-symbol <OSMANYA>
...
collating-symbol <S1D000>..<S1D35F>
collating-symbol <SFFFF> % Guaranteed largest symbol value. Keep at end of this list
...
collating-element <U0413_0301> from "<U0413><U0301>"
collating-element <U0413_0341> from "<U0413><U0341>"

  • collating-symbol log ib txoj hlua OSMANYA hauv daim ntawv teev npe
  • collating-symbol .. sau npe ib ntus ntawm cov npe uas muaj cov npe ua ntej S thiab hexadecimal numeric suffix los ntawm 1D000 rau 1D 35F.
  • Ffff Π² collating-symbol zoo li ib tug loj unsigned integer nyob rau hauv hexadecimal, tab sis nws tsuas yog lub npe uas yuav zoo li
  • имя txhais tau tias code point hauv encoding UAS-4
  • collating-element from "" sau npe tshiab rau ib khub ntawm Unicode dots.

Thaum cov npe ntawm qhov hnyav tau raug txheeb xyuas, qhov hnyav tiag tiag tau teev tseg. Txij li thaum tsuas yog ntau dua-tsawg dua kev sib raug zoo tseem ceeb hauv kev sib piv, qhov hnyav yog txiav txim siab los ntawm ib qho yooj yim sib lawv li teev npe. Qhov hnyav "sib dua" yog teev ua ntej, tom qab ntawd cov "hnyav dua". Cia kuv ceeb toom koj tias txhua tus cim Unicode raug muab plaub qhov sib txawv. Ntawm no lawv tau muab tso rau hauv ib qho kev txiav txim ib ntus. Hauv txoj kev xav, txhua lub npe cim tuaj yeem siv nyob rau ntawm plaub theem, tab sis cov lus pom tau hais tias cov neeg tsim kev puas siab puas ntsws cais cov npe rau hauv qib.

% Symbolic weight assignments

% Third-level weight assignments
<RES-1>
<BLK>
<MIN>
<WIDE>
...
% Second-level weight assignments
<BASE>
<LOWLINE> % COMBINING LOW LINE
<PSILI> % COMBINING COMMA ABOVE
<DASIA> % COMBINING REVERSED COMMA ABOVE
...
% First-level weight assignments
<S0009> % HORIZONTAL TABULATION 
<S000A> % LINE FEED
<S000B> % VERTICAL TABULATION
...
<S0434> % CYRILLIC SMALL LETTER DE
<S0501> % CYRILLIC SMALL LETTER KOMI DE
<S0452> % CYRILLIC SMALL LETTER DJE
<S0503> % CYRILLIC SMALL LETTER KOMI DJE
<S0453> % CYRILLIC SMALL LETTER GJE
<S0499> % CYRILLIC SMALL LETTER ZE WITH DESCENDER
<S0435> % CYRILLIC SMALL LETTER IE
<S04D7> % CYRILLIC SMALL LETTER IE WITH BREVE
<S0454> % CYRILLIC SMALL LETTER UKRAINIAN IE
<S0436> % CYRILLIC SMALL LETTER ZHE

Thaum kawg, qhov tseeb qhov hnyav.

Qhov hnyav seem yog nyob rau hauv cov kab lus tseem ceeb xaj_start ΠΈ xaj_end. Ntxiv kev xaiv xaj_start txiav txim siab nyob rau hauv cov kev taw qhia kab yog scanned ntawm txhua theem ntawm kev sib piv. Qhov kev teeb tsa ua ntej yog tom ntej. Lub cev ntawm ntu muaj cov kab uas muaj cov cim cim thiab nws plaub qhov hnyav. Tus cwj pwm code tuaj yeem sawv cev los ntawm tus cwj pwm nws tus kheej, tus lej cim, lossis lub cim lub npe uas tau hais tseg yav dhau los. Qhov hnyav kuj tuaj yeem muab rau cov cim npe, cov ntsiab lus code, lossis cov cim lawv tus kheej. Yog tias siv cov ntsiab lus lossis cov cim, lawv qhov hnyav yog tib yam li tus lej ntawm tus lej taw tes (txoj haujlwm hauv lub rooj Unicode). Cov cim tsis tau teev meej meej (raws li kuv nkag siab) raug suav hais tias yog muab rau lub rooj nrog qhov hnyav hnyav uas phim txoj haujlwm hauv lub rooj Unicode. Tus nqi tshwj xeeb hnyav KAWG txhais tau hais tias lub cim yog tsis quav ntsej nyob rau theem tsim nyog ntawm kev sib piv.

Txhawm rau ua kom pom cov qauv ntawm cov nplai, kuv xaiv peb qhov pom tseeb tawg:

  • cov cim uas tsis quav ntsej kiag li
  • cov cim sib npaug rau tus lej peb hauv thawj ob theem
  • qhov pib ntawm cov tsiaj ntawv Cyrillic, uas tsis muaj cov ntawv sau, thiab yog li ntawd tau txheeb xyuas los ntawm qib thawj thiab thib peb.

order_start forward;forward;forward;forward,position
<U0000> IGNORE;IGNORE;IGNORE;IGNORE % NULL (in 6429)
<U0001> IGNORE;IGNORE;IGNORE;IGNORE % START OF HEADING (in 6429)
<U0002> IGNORE;IGNORE;IGNORE;IGNORE % START OF TEXT (in 6429)
...
<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
<UFF13> <S0033>;<BASE>;<WIDE>;<UFF13> % FULLWIDTH DIGIT THREE
<U2476> <S0033>;<BASE>;<COMPAT>;<U2476> % PARENTHESIZED DIGIT THREE
<U248A> <S0033>;<BASE>;<COMPAT>;<U248A> % DIGIT THREE FULL STOP
<U1D7D1> <S0033>;<BASE>;<FONT>;<U1D7D1> % MATHEMATICAL BOLD DIGIT THREE
...
<U0430> <S0430>;<BASE>;<MIN>;<U0430> % CYRILLIC SMALL LETTER A
<U0410> <S0430>;<BASE>;<CAP>;<U0410> % CYRILLIC CAPITAL LETTER A
<U04D1> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
<U0430_0306> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
...
<U0431> <S0431>;<BASE>;<MIN>;<U0431> % CYRILLIC SMALL LETTER BE
<U0411> <S0431>;<BASE>;<CAP>;<U0411> % CYRILLIC CAPITAL LETTER BE
<U0432> <S0432>;<BASE>;<MIN>;<U0432> % CYRILLIC SMALL LETTER VE
<U0412> <S0432>;<BASE>;<CAP>;<U0412> % CYRILLIC CAPITAL LETTER VE
...
order_end

Tam sim no koj tuaj yeem rov qab mus txheeb cov piv txwv los ntawm qhov pib ntawm tsab xov xwm. Lub ambush yog nyob rau hauv no ib feem ntawm lub luj lub rooj:

<U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE
<U0021> IGNORE;IGNORE;IGNORE;<U0021> % EXCLAMATION MARK
<U0022> IGNORE;IGNORE;IGNORE;<U0022> % QUOTATION MARK
...

Nws tuaj yeem pom tau tias hauv cov lus no cov cim cim los ntawm lub rooj ASCII (nrog rau qhov chaw) yuav luag tsis quav ntsej thaum sib piv cov hlua. Cov kev zam tsuas yog cov kab uas sib phim hauv txhua yam tshwj tsis yog cov cim sau cim pom hauv cov haujlwm sib xws. Cov kab los ntawm kuv qhov piv txwv (tom qab txheeb xyuas) rau kev sib piv algorithm zoo li no:

ΠΠ±Π°ΠΊΠ°Π½ΠΎΠ²ΠœΠΈΡ…Π°ΠΈΠ»ΠΌΠ°Π»ΡΡ€
ЁлкинаЭллакрановщица
Π˜Π²Π°Π½ΠΎΠ²Π°ΠΠ»Π»Π°ΠΌΠ°Π»ΡΡ€
Π˜Π²Π°Π½ΠΎΠ²ΠΠ½Π΄Ρ€Π΅ΠΉΡΠ»Π΅ΡΠ°Ρ€ΡŒ

Xav tias nyob rau hauv cov lus teev, cov tsiaj ntawv loj hauv Lavxias tuaj tom qab cov tsiaj ntawv me (ntawm qib peb hnyav dua ), qhov sorting zoo nkaus li muaj tseeb.

Thaum teeb tsa qhov sib txawv LC_COLLATE = C ib lub rooj tshwj xeeb yog loaded uas qhia txog kev sib piv byte-by-byte

static const uint32_t collseqwc[] =
{
  8, 1, 8, 0x0, 0xff,
  /* 1st-level table */
  6 * sizeof (uint32_t),
  /* 2nd-level table */
  7 * sizeof (uint32_t),
  /* 3rd-level table */
  L'x00', L'x01', L'x02', L'x03', L'x04', L'x05', L'x06', L'x07',
  L'x08', L'x09', L'x0a', L'x0b', L'x0c', L'x0d', L'x0e', L'x0f',

...
  L'xf8', L'xf9', L'xfa', L'xfb', L'xfc', L'xfd', L'xfe', L'xff'
};

Txij li thaum nyob rau hauv Unicode code point Ё los ua ntej A, cov hlua raug txheeb raws li.

Cov ntawv nyeem thiab binary rooj

Obviously, txoj hlua sib piv yog ib qho kev ua haujlwm tsis tshua muaj neeg, thiab kev sib piv cov lus CTT heev tus txheej txheem kim. Txhawm rau txhim kho kev nkag mus rau lub rooj, nws tau muab tso ua ke rau hauv daim ntawv binary nrog cov lus txib localdef.

pab neeg localdef lees txais raws li qhov tsis muaj cov ntaub ntawv nrog lub rooj ntawm lub teb chaws tus yam ntxwv (kev xaiv -i), nyob rau hauv uas txhua tus cim tau sawv cev los ntawm Unicode dots, thiab cov ntaub ntawv ntawm kev sau ntawv ntawm Unicode dots thiab cov cim ntawm kev encoding tshwj xeeb (kev xaiv -f). Raws li qhov tshwm sim ntawm kev ua haujlwm, cov ntaub ntawv binary yog tsim rau thaj chaw nrog lub npe teev nyob rau hauv qhov kawg parameter.

glibc ua txhawb ob hom ntaub ntawv binary: "ib txwm" thiab "ntau yam".

Cov hom ntawv ib txwm txhais tau hais tias lub npe ntawm thaj chaw yog lub npe ntawm subdirectory hauv /usr/lib/locale/. Qhov no subdirectory khaws cov ntaub ntawv binary LC_COLLATE, LC_CTYPE, LC_TIME thiab lwm yam. Cov ntaub ntawv LC_IDENTIFICATION muaj lub npe raug cai ntawm thaj chaw (uas yuav txawv ntawm lub npe ntawm cov npe) thiab cov lus pom.

Cov hom ntawv niaj hnub no suav nrog kev khaws cia txhua qhov chaw hauv ib qho archive /usr/lib/locale/locale-archive, uas yog mapped rau lub cim xeeb virtual ntawm tag nrho cov txheej txheem siv glibc. Lub npe hauv cheeb tsam nyob rau hauv hom niaj hnub no raug rau qee qhov canonization - tsuas yog cov lej thiab cov ntawv txo qis rau hauv cov ntawv sau npe. Yog li ru_RU.KOI8-R, yuav tau txais kev cawmdim li ru_RU.koi8r.

Cov ntaub ntawv nkag tau raug tshawb nrhiav hauv cov npe tam sim no, nrog rau hauv cov npe /usr/share/i18n/locales/ ΠΈ /usr/share/i18n/charmaps/ rau cov ntaub ntawv CTT thiab encoding cov ntaub ntawv, feem.

Piv txwv li, cov lus txib

localedef -i ru_RU -f MAC-CYRILLIC ru_RU.MAC-CYRILLIC

yuav sau cov ntaub ntawv /usr/share/i18n/locales/ru_RU siv cov ntaub ntawv encoding /usr/share/i18n/charmaps/MAC-CYRILLIC.gz thiab txuag qhov tshwm sim hauv /usr/lib/locale/locale-archive nyob rau hauv lub npe ru_RU.maccyrillic ua

Yog tias koj teeb tsa qhov sib txawv LANG = en_US.UTF-8 ces glibc yuav saib rau hauv zos binaries nyob rau hauv cov nram qab no ib theem zuj zus ntawm cov ntaub ntawv thiab directory:

/usr/lib/locale/locale-archive
/usr/lib/locale/en_US.UTF-8/
/usr/lib/locale/en_US/
/usr/lib/locale/enUTF-8/
/usr/lib/locale/en/

Yog hais tias ib cheeb tsam tshwm sim nyob rau hauv ob qho tib si ib txwm thiab niaj hnub hom, ces qhov tseem ceeb yog muab rau cov niaj hnub.

Koj tuaj yeem saib cov npe ntawm cov chaw sib sau ua ke nrog cov lus txib chaw-a.

Npaj koj lub rooj sib piv

Tam sim no, armed nrog kev paub, koj tuaj yeem tsim koj tus kheej txoj hlua zoo sib piv cov lus. Cov lus no yuav tsum sib piv cov tsiaj ntawv Lavxias kom raug, suav nrog tsab ntawv Ё, thiab tib lub sijhawm coj mus rau hauv tus lej cim cim raws li lub rooj. ASCII.

Cov txheej txheem ntawm kev npaj koj tus kheej sorting rooj muaj ob theem: kho lub luj lub rooj thiab muab tso ua ke rau hauv binary daim ntawv nrog cov lus txib localdef.

Txhawm rau kom cov lus sib piv yuav raug kho nrog cov nqi kho tsawg, hauv hom ntawv ISO 14652 Cov ntu rau kev kho qhov hnyav ntawm lub rooj uas twb muaj lawm muaj. Tshooj pib nrog lo lus tseem ceeb reorder-tom qab thiab qhia txog txoj haujlwm tom qab qhov kev hloov pauv tau ua tiav. Tshooj xaus nrog kab reorder-kawg. Yog tias tsim nyog los kho ob peb ntu ntawm lub rooj, ces ib ntu yog tsim rau txhua ntu.

Kuv tau theej cov ntawv tshiab iso14651_t1_common ΠΈ ru_RU los ntawm qhov chaw cia khoom glibc mus rau kuv lub tsev directory ~/.local/share/i18n/locales/ thiab me ntsis kho cov seem LC_COLLATE Π² ru_RU. Cov ntawv tshiab ntawm cov ntaub ntawv tau ua tiav nrog kuv version glibc. Yog tias koj xav siv cov ntaub ntawv qub qub, koj yuav tau hloov cov cim npe thiab qhov chaw uas qhov hloov pauv pib hauv lub rooj.

LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
reorder-after <U000D>
<U0020> <S0020>;<BASE>;<MIN>;<U0020> % SPACE
<U0021> <S0021>;<BASE>;<MIN>;<U0021> % EXCLAMATION MARK
<U0022> <S0022>;<BASE>;<MIN>;<U0022> % QUOTATION MARK
...
<U007D> <S007D>;<BASE>;<MIN>;<U007D> % RIGHT CURLY BRACKET
<U007E> <S007E>;<BASE>;<MIN>;<U007E> % TILDE
reorder-end
END LC_COLLATE

Qhov tseeb, nws yuav tsim nyog los hloov cov teb hauv LC_IDENTIFICATION kom lawv taw tes rau thaj chaw ru_MY, tab sis hauv kuv qhov piv txwv no tsis tas yuav tsum tau, txij li kuv tau tshem tawm cov ntaub ntawv los ntawm kev tshawb nrhiav hauv zos chaw-archive.

uas localdef ua haujlwm nrog cov ntaub ntawv hauv kuv daim nplaub tshev los ntawm kev sib txawv I18 NPE Koj tuaj yeem ntxiv ib phau ntawv ntxiv los tshawb nrhiav cov ntaub ntawv nkag, thiab cov npe khaws cia cov ntaub ntawv binary tuaj yeem raug teev raws li txoj hauv kev nrog slashes:

$> I18NPATH=~/.local/share/i18n localedef -i ru_RU -f UTF-8 ~/.local/lib/locale/ru_MY.UTF-8

POSIX xav tias hauv LANGUAGE koj tuaj yeem sau tag nrho txoj hauv kev rau cov npe nrog cov ntaub ntawv hauv zos, pib nrog rau pem hauv ntej, tab sis glibc Π² Linux txhua txoj hauv kev raug suav los ntawm lub hauv paus directory, uas tuaj yeem hla dhau los ntawm qhov sib txawv LOCPATH. Tom qab installation LOCPATH=~/.local/lib/locale/ tag nrho cov ntaub ntawv ntsig txog localization yuav raug tshawb nrhiav hauv kuv daim nplaub tshev nkaus xwb. Archive ntawm thaj chaw nrog cov txheej txheem sib txawv LOCPATH tsis quav ntsej.

Nov yog qhov kev xeem txiav txim:

$> LANG=ru_MY.UTF-8 LOCPATH=~/.local/lib/locale/ sort buhg.txt
Абаканов ΠœΠΈΡ…Π°ΠΈΠ»;маляр
Ёлкина Π­Π»Π»Π°;ΠΊΡ€Π°Π½ΠΎΠ²Ρ‰ΠΈΡ†Π°
Иванов АндрСй;ΡΠ»Π΅ΡΠ°Ρ€ΡŒ
Иванова Алла;Π°Π΄Π²ΠΎΠΊΠ°Ρ‚

Hooray! Peb tau ua!

Ua haujlwm ntawm kev ua yuam kev

Kuv twb tau teb cov lus nug txog txoj hlua sorting thaum pib, tab sis tseem muaj ob peb nqe lus nug txog qhov yuam kev - pom thiab pom tsis pom.

Cia peb rov qab mus rau qhov teeb meem qub.

Thiab qhov program tsi thiab program koom siv tib txoj hlua sib piv ua haujlwm los ntawm glibc. Nws tshwm sim li cas koom muab sorting yuam kev ntawm kab sorted los ntawm cov lus txib tsi hauv zos en_US.UTF-8? Cov lus teb yog yooj yim: tsi piv tag nrho txoj hlua, thiab koom piv tsuas yog tus yuam sij, uas los ntawm lub neej ntawd yog qhov pib ntawm txoj hlua mus rau thawj lub cim whitespace. Hauv kuv qhov piv txwv, qhov no ua rau muaj lus yuam kev vim qhov kev txheeb xyuas thawj cov lus hauv kab tsis sib xws rau kev txheeb xyuas cov kab tiav.

Hauv zos "C" guarantees tias nyob rau hauv sorted strings thawj substrings mus txog rau thawj qhov chaw kuj yuav sorted, tab sis qhov no tsuas yog npog qhov yuam kev. Nws tuaj yeem xaiv cov ntaub ntawv (cov neeg uas muaj tib lub npe, tab sis cov npe sib txawv) uas, tsis muaj cov lus yuam kev, yuav muab cov ntaub ntawv tsis raug. Yog peb xav tau koom merged cov ntaub ntawv kab los ntawm lub npe tag nrho, ces txoj kev yog yuav tsum qhia meej meej lub teb cais thiab txheeb los ntawm lub ntsiab teb, thiab tsis yog tag nrho cov kab. Hauv qhov no, kev sib koom ua ke yuav ua kom raug thiab yuav tsis muaj qhov yuam kev hauv ib cheeb tsam:

$> sort -t ; -k 1 buhg.txt > buhg.srt
$> sort -t ; -k 1 mail.txt > mail.srt
$> join -t ; buhg.srt mail.srt > result

Ua tau zoo ua piv txwv hauv encoding CP1251 muaj lwm qhov yuam kev. Qhov tseeb yog tias nyob rau hauv txhua qhov kev faib khoom paub rau kuv Linux tej pob khoom uas ploj lawm compiled locale ru_RU.CP1251 ua. Yog tsis pom qhov chaw muab tso ua ke, ces tsi ntsiag to siv kev sib piv byte-by-byte, uas yog qhov peb pom.

Los ntawm txoj kev, muaj lwm qhov me me glitch ntsig txog kev nkag tsis tau ntawm cov chaw sib sau ua ke. Pab neeg LOCPATH=/tmp chaw -a yuav muab ib daim ntawv teev tag nrho cov zos nyob rau hauv chaw-archive, tab sis nrog cov txheej txheem sib txawv LOCPATH rau tag nrho cov kev pab cuam (nrog rau feem ntau hauv zos) Cov chaw no yuav tsis muaj.

$> LOCPATH=/tmp locale -a | grep en_US
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
en_US
en_US.iso88591
en_US.iso885915
en_US.utf8

$> LC_COLLATE=en_US.UTF-8 sort --debug
sort: using β€˜en_US.UTF-8’ sorting rules

$> LOCPATH=/tmp LC_COLLATE=en_US.UTF-8 sort --debug
sort: using simple byte comparison

xaus

Yog tias koj yog tus programmer uas tau siv los xav tias cov hlua yog ib txheej ntawm bytes, ces koj xaiv LC_COLLATE = C.

Yog hais tias koj yog ib tug linguist los yog phau ntawv txhais lus compiler, ces koj zoo dua compile nyob rau hauv koj lub zos.

Yog tias koj yog ib tus neeg siv yooj yim, ces koj tsuas yog yuav tsum tau siv rau qhov tseeb tias cov lus txib ls -a outputs cov ntaub ntawv pib nrog ib tug dot tov nrog cov ntaub ntawv pib nrog ib tsab ntawv, thiab Hmo ntuj tus thawj coj, uas siv nws cov haujlwm sab hauv los txheeb cov npe, tso cov ntaub ntawv pib nrog ib qho chaw nyob ntawm qhov pib ntawm daim ntawv teev npe.

ua tim khawv

Daim Ntawv Qhia No. 10 Unicode collation algorithm

Tus cwj pwm hnyav ntawm unicode.org

ICU - Kev siv lub tsev qiv ntawv rau kev ua haujlwm nrog Unicode los ntawm IBM.

Kev ntsuas ntsuas siv ICU

Tus cwj pwm hnyav hauv ISO 14651

Kev piav qhia ntawm cov ntaub ntawv hom nrog cov nplai ISO 14652

Kev sib tham ntawm txoj hlua sib piv hauv glibc

Tau qhov twg los: www.hab.com

Ntxiv ib saib