Sidee Linux u kala saartaa xargaha

Horudhac

Waxaas oo dhami waxay ku bilowdeen qoraal gaaban oo la rabay in lagu daro macluumaadka cinwaanka email shaqaalaha laga helay liiska isticmaalayaasha liiska boostada, oo leh jagooyin shaqaale laga helay xogta waaxda HR. Labada liisba waxa loo dhoofiyay faylasha qoraalka ee Unicode UTF-8 oo lagu badbaadiyay dhamaadka line Unix.

Nuxurka mail.txt

Иванов АндрСй;[email protected]

Nuxurka buhg.txt

Иванова Алла;маляр
Ёлкина Π­Π»Π»Π°;ΠΊΡ€Π°Π½ΠΎΠ²Ρ‰ΠΈΡ†Π°
Иванов АндрСй;ΡΠ»Π΅ΡΠ°Ρ€ΡŒ
Абаканов ΠœΠΈΡ…Π°ΠΈΠ»;маляр

Si loo midoobo, faylasha waxaa lagu kala soocay taliska Unix qaddar waxaana loo gudbiyay fikradda barnaamijka Unix biiro, kaas oo si lama filaan ah ugu fashilmay khalad:

$> sort buhg.txt > buhg.srt
$> sort mail.txt > mail.srt
$> join buhg.srt mail.srt > result
join: buhg.srt:4: is not sorted: Иванов АндрСй;ΡΠ»Π΅ΡΠ°Ρ€ΡŒ

Marka la eego natiijada kala soocida indhahaaga waxay muujisay in, guud ahaan, kala soocidu ay sax tahay, laakiin marka la eego dhacdooyinka isku midka ah ee magacyada ragga iyo dumarka, kuwa dheddigga ah ayaa ka horreeya kuwa labka ah:

$> sort buhg.txt
Абаканов ΠœΠΈΡ…Π°ΠΈΠ»;маляр
Ёлкина Π­Π»Π»Π°;ΠΊΡ€Π°Π½ΠΎΠ²Ρ‰ΠΈΡ†Π°
Иванова Алла;маляр
Иванов АндрСй;ΡΠ»Π΅ΡΠ°Ρ€ΡŒ

Waxay u egtahay cilad kala soocida Unicode ama sida muujinta dheddigga ee algorithm-ka-soocidda. Midda kowaad, dabcan, waa ka sii macquulsan.

Aan iska dhigno hadda biiro oo diiradda saara qaddar. Aan isku dayno inaan dhibaatada ku xallinno annaga oo adeegsanayna turjumaad cilmiyeed. Marka hore, aynu ka bedelno deegaanka en_US on ru_RU. Si loo kala saaro, way ku filnaan lahayd in la dejiyo doorsoomiyaha deegaanka LC_COLLATE, laakiin waqti ku lumin mayno waxyaabo yaryar:

$> LANG=ru_RU.UTF-8 sort buhg.txt
Абаканов ΠœΠΈΡ…Π°ΠΈΠ»;маляр
Ёлкина Π­Π»Π»Π°;ΠΊΡ€Π°Π½ΠΎΠ²Ρ‰ΠΈΡ†Π°
Иванова Алла;маляр
Иванов АндрСй;ΡΠ»Π΅ΡΠ°Ρ€ΡŒ

Wax is beddelay ma jiraan.

Aan isku dayno inaan dib u codeeyno faylasha hal-byte codeing:

$> iconv -f UTF-8 -t KOI8-R buhg.txt 
 | LANG=ru_RU.KOI8-R sort 
 | iconv -f KOI8-R -t UTF8

Haddana waxba isma beddelin.

Ma jiraan wax aad sameyn karto, waa inaad xal ka raadisaa internetka. Ma jiraan wax si toos ah ugu saabsan magacyada Ruushka, laakiin waxaa jira su'aalo ku saabsan kala-soocidda kale. Tusaale ahaan, halkan waa dhibaato: unix sort waxay ula dhaqantaa '-' (dhash) jilayaasha sida aan la arki karin. Marka la soo koobo, xargaha "ab", "aa", "ac" waxaa loo kala soocay sida "aa", "ab", "ac".

Jawaabtu waa halbeeg meel walba: isticmaal degaanka programmer-ka "C" waadna faraxsanaan doontaa. Aan isku dayno:

$> LANG=C sort buhg.txt
Ёлкина Π­Π»Π»Π°;ΠΊΡ€Π°Π½ΠΎΠ²Ρ‰ΠΈΡ†Π°
Абаканов ΠœΠΈΡ…Π°ΠΈΠ»;маляр
Иванов АндрСй;ΡΠ»Π΅ΡΠ°Ρ€ΡŒ
Иванова Алла;Π°Π΄Π²ΠΎΠΊΠ°Ρ‚

Wax baa is beddelay. Ivanovs ayaa si sax ah u safay, inkasta oo Yolkina uu meel ku siibtay. Aan ku soo laabano dhibkii asalka ahaa:

$> LANG=C sort buhg.txt > buhg.srt
$> LANG=C sort mail.txt > mail.srt
$> LANG=C join buhg.srt mail.srt > result

Waxay u shaqeysay qalad la'aan, sida uu ballanqaaday internetka. Oo tan inkastoo Yolkina ee safka koowaad.

Dhibaatadu waxay u muuqataa in la xalliyey, laakiin haddii ay dhacdo, aan isku dayno cod-bixin kale oo Ruush ah - Windows CP1251:

$> iconv -f UTF-8 -t CP1251 buhg.txt 
 | LANG=ru_RU.CP1251 sort 
 | iconv -f CP1251 -t UTF8 

Natiijada kala-soocidda, si aan caadi ahayn, waxay la jaanqaadi doontaa deegaanka "C", iyo tusaalaha oo dhan, si waafaqsan, wuxuu socdaa qalad la'aan. Nooc ka mid ah suufiyada.

Ma jecli suufiyada xagga barnaamijyada sababtoo ah badanaa waxay qarisaa khaladaadka. Waa inaan si dhab ah u eegno sida ay u shaqeyso. qaddar maxayse saamaynaysaa? LC_COLLATE .

Ugu dambeyntii waxaan isku dayi doonaa inaan ka jawaabo su'aalaha:

  • maxaa magacyada dumarka si qaldan loogu soocay?
  • sababta LANG=ru_RU.CP1251 waxay noqotay wax u dhigma LANG=C
  • maxaa yeelay qaddar ΠΈ biiro fikrado kala duwan oo ku saabsan sida ay u kala horreeyaan xargaha
  • waa maxay sababta khaladaad ugu jiraan dhammaan tusaalayaashayda?
  • ugu dambeyntii sida loo xalliyo xargaha sida aad jeceshahay

Ku kala soocida Unicode

Joogsiga koowaad wuxuu noqon doonaa warbixinta farsamada ee No. 10 oo xaq u leh Algorithm aruurinta Unicode Online unicode.org. Warbixintu waxay ka kooban tahay tafaasiil farsamo oo badan, haddaba aan si kooban u soo koobo fikradaha muhiimka ah.

nidaamin - Xargaha "isbarbardhigga" waa saldhigga kala-soocidda algorithm kasta. Algorithms-yada laftoodu way kala duwanaan karaan ("xumbo", "isku dar", "dhakhso"), laakiin dhammaantood waxay isticmaali doonaan isbarbardhigga xargaha si ay u go'aamiyaan sida ay u muuqdaan.

Ku kala soocida xargaha luqadda dabiiciga ah waa dhibaato aad u adag. Xataa sida ugu fudud ee hal-byte codeedyada ah, siday u kala horreeyaan xarfaha alifbeetada, xataa si ka duwan kan Ingiriisida alifbeetada Laatiinka, mar dambe kuma soo beegmi doonaan nidaamka qiyamka tirada ee xarfahan lagu xardhaynayo. Markaa alifbeetada Jarmalka xarafka Γ– u dhexeeya О ΠΈ P, iyo in koodka CP850 way u dhaxaysaa ΓΏ ΠΈ Ü.

Waxaad isku dayi kartaa inaad ka soo saarto codayn gaar ah oo aad tixgeliso xarfaha "ku habboon" kuwaas oo loo habeeyey si uun, sida lagu sameeyo Unicode. Codaynta UTF8, UTF16 ama hal-byte KOI8-R (haddii loo baahdo qayb yar oo Unicode ah) waxay siinaysaa tilmaamo tirooyin kala duwan oo xarfo ah, laakiin waxay tixraacayaan isla jaantusyada miiska saldhigga.

Waxaa soo baxday in xitaa haddii aan ka dhisno miis calaamad ah meel eber ah, ma awoodi doonno inaan u qoondeyno amar calaamad caalami ah. Xarfaha kala duwan ee waddaniga ah ee isticmaala xarfo isku mid ah, siday u kala horreeyaan xarfahani way kala duwanaan karaan. Tusaale ahaan, Faransiiska Γ† waxaa loo tixgalin doonaa seedi oo loo kala sooci doonaa sida xadhig AE. Noorwiijiga Γ† waxay noqon doontaa warqad gooni ah, taas oo ku taal ka dib Z. By habka, marka lagu daro ligatures sida Γ† Waxaa jira xarfo ku qoran dhowr calaamadood. Markaa alifbeetada Czech waxaa ku jira xaraf Ch, oo u dhexeeya H ΠΈ I.

Farqiga u dhexeeya alifbeetada ka sokow, waxaa jira caadooyin kale oo qaran oo saameeya kala-soocidda. Gaar ahaan, su'aasha ayaa soo baxaysa: sida ay u kala horreeyaan ereyada ka kooban xarfaha waaweyn iyo kuwa yar-yar ee qaamuuska? Kala-soocidda waxa kale oo laga yaabaa inay saamayso isticmaalka calaamadaha xarakaynta. Isbaanishka, calaamad su'aal rogan ayaa la adeegsadaa bilowga jumlada su'aalahaMa jeceshahay muusiga?). Xaaladdan oo kale, way iska caddahay in weedhaha su'aalo-waydiineedka aan loo qaybin koox gaar ah oo ka baxsan alifbeetada, laakiin sida loo kala saaro xariiqyada calaamadaha kale?

Ma sii joogi doono kala soocida xadhkaha afafka aad uga duwan kuwa Yurub. Ogsoonow in luqadaha leh jihada wax-qorista ee midig-bidix ama sare-ilaa-hoose, jilayaasha xariiqyada ku jira waxay u badan tahay inay ku kaydsan yihiin hab-wax-akhris, iyo xitaa hababka qorista ee aan alifbeetada ahayn waxay leeyihiin habab u gaar ah oo ay ku dalbadaan xarfaha dabeecad ahaan. . Tusaale ahaan, hieroglyphs waxaa lagu dalban karaa qaab ahaan (Furayaasha xarfaha Shiinaha) ama ku dhawaaqid. Run ahaantii, wax fikrad ah kama haysto sida loo habayn karo emojis, laakiin adiguna wax ayaad ula iman kartaa iyaga.

Iyada oo ku saleysan sifooyinka kor ku xusan, shuruudaha aasaasiga ah ee isbarbardhigga xargaha ee ku saleysan jaantusyada Unicode ayaa la sameeyay:

  • isbarbardhigga xargaha kuma xirna booska jilayaasha shaxda koodhka;
  • taxanaha jilayaasha samaynta hal xaraf ayaa lagu soo koobay qaab canonical ah (A + goobada sare waa isku mid Γ…);
  • Marka la is barbar dhigo xargaha, jilaa waxaa lagu tixgaliyaa macnaha guud ee xargaha iyo, haddii loo baahdo, lagu daro deriskeeda hal cutub oo isbarbardhigga (Ch Czech) ama waxa loo qaybiyaa dhawr (Γ† Faransiis);
  • dhammaan astaamaha qaranka (alfabeetada, far waaweyn/hoose, xarakaynta, nidaamka qorista) waa in la habeeyaa ilaa habaynta gacanta ee amarka (emoji);
  • Isbarbardhigga muhiim uma aha oo kaliya kala soocida, laakiin sidoo kale meelo kale oo badan, tusaale ahaan qeexida safafka (bedelka {A... z} in bash);
  • isbarbardhigga waa in si cadaalad ah loo sameeyaa si degdeg ah.

Intaa waxaa dheer, qorayaasha warbixintu waxay dejiyeen sifooyin isbarbardhig oo ay tahay kuwa horumariya algorithmism inaysan ku tiirsanayn:

  • Algorithm isbarbardhigga waa in uusan u baahnayn jilayaal gaar ah oo luqad kasta ah (luuqaadaha Ruushka iyo Yukreeniyaanku waxay wadaagaan inta badan jilayaasha Cyrillic);
  • isbarbardhigga waa inuusan ku tiirsanaan sida ay u kala horreeyaan jilayaasha shaxda Unicode;
  • Miisaanka xadhiggu waa inaanu noqon sifada xadhigga, maadaama xadhig isku mid ah oo dhaqamada kala duwani uu yeelan karo miisaan kala duwan;
  • Miisaanka safafka ayaa isbedeli kara marka la isku daro ama la kala tago (laga bilaabo x < y taasi ma raacdo xz < yz);
  • Xadhkaha kala duwan ee leh miisaan isku mid ah ayaa loo arkaa inay siman yihiin marka loo eego aragtida kala-soocidda algorithm. Soo bandhigida dalbashada dheeraadka ah ee xargaha noocan oo kale ah waa suurtagal, laakiin waxay hoos u dhigi kartaa waxqabadka;
  • Inta lagu jiro kala-soocidda soo noqnoqda, safafka leh miisaan isku mid ah ayaa laga yaabaa in la isdhaafsado. Xoognimadu waa hanti gaar ah oo kala soocida algorithm, ee maaha hanti isbarbardhigga algoorithmamka (eeg cutubka hore);
  • Xeerarka kala-soocidda ayaa laga yaabaa inay isbeddelaan muddo ka dib marka caadooyinka dhaqameedku sifeeyaan/ beddelaan.

Waxa kale oo lagu qeexay in isbarbardhigga algorithm-ka uusan waxba ka ogeyn semantics ee xargaha la farsamaynayo. Sidaa darteed, xargaha ka kooban tirooyinka kaliya waa inaan la barbar dhigin tiro ahaan, iyo liisaska magacyada Ingiriisiga maqaalka (Beatles, The).

Si loo qanciyo dhammaan shuruudahan, algorithm-kala soocida jaangooyooyin heer-badan ah (dhab ahaantii afar-heer) ayaa la soo jeediyay.

Markii hore, jilayaasha xargaha ayaa loo dhimay qaab canonical ah waxaana loo qaybiyaa cutubyo isbarbardhigga. Halbeeg kasta oo isbarbardhigga ah waxaa loo qoondeeyay dhowr miisaan oo u dhigma dhowr heerar isbarbardhigga. Miisaanka cutubyada isbarbardhigga waa xubno ka mid ah jaangooyooyin la dalbaday (xaaladdan, tirooyin) kuwaas oo la barbar dhigi karo wax ka badan ama ka yar. Macnaha gaarka ah LA ISKA DHAAFAY (0x0) macnaheedu waa in heerka isbarbardhigga u dhigma cutubkani kuma lug laha isbarbardhigga. Isbarbardhigga xargaha ayaa lagu soo celin karaa dhowr jeer, iyadoo la adeegsanayo miisaanka heerarka u dhigma. Heer kasta, miisaanka cutubyada isbarbardhigga ee laba saf ayaa si isdaba joog ah loo barbardhigayaa midba midka kale.

Hirgelinta kala duwan ee algorithm ee dhaqamada kala duwan ee qaranka, qiyamka isbarbardhigga ayaa kala duwanaan kara, laakiin halbeegga Unicode waxaa ku jira miis aasaasi ah oo miisaanno - "Jadwalka Cunsurka Isku-dhafka Unicode" (DUCET). Waxaan jeclaan lahaa in aan ogaado in dejinta doorsoomaha LC_COLLATE dhab ahaantii waa calaamad muujinaysa xulashada miiska miisaanka ee shaqada isbarbardhigga xadhigga.

Isku-dhafka miisaanka DUCET loo habeeyey sida soo socota:

  • heerka koowaad, dhammaan xarfaha waxaa lagu dhimay isla kiis, af-hayeennada waa la tuuray, calaamadaha xarakaynta (maaha dhammaan) waa la iska indhatiray;
  • heerka labaad, kaliya diacritics ayaa la tixgeliyaa;
  • heerka saddexaad, kiis kaliya ayaa lagu xisaabtamayaa;
  • heerka afraad, kaliya calaamadaha xarakaynta ayaa lagu xisaabtamayaa.

Isbarbardhigga wuxuu ku dhacaa dhowr baas: marka hore, isbarbardhigga heerka koowaad ayaa la barbar dhigayaa; haddii miisaanku isku mid yahay, markaa isbarbardhigga soo noqnoqda ee miisaanka heerka labaad ayaa la sameeyaa; ka dibna laga yaabee seddexaad iyo afaraad.

Isbarbardhigga wuxuu dhamaanayaa marka safafka ay ka kooban yihiin cutubyo isbarbar dhigaya miisaanyo kala duwan. Safafka leh miisaan siman dhammaan afarta heer ayaa loo arkaa inay siman yihiin midba midka kale.

Algorithm-ka (oo leh farabadan faahfaahin farsamo oo dheeri ah) ayaa siisay magaca warbixinta lambarka 10 - "Unicode Collation Algorithm" (UCA).

Tani waa halka hab-dhaqanka kala-soocidda ee tusaalaheennu uu noqdo mid ka yara caddeeya. Way fiicnaan lahayd in la barbar dhigo heerka Unicode.

Si loo tijaabiyo fulinta UCA waxaa jira gaar ah baaris, isticmaalka faylka miisaanka, fulinta DUCET. Waxaad ka heli kartaa dhammaan noocyada waxyaabaha qosolka leh ee faylka miisaanka. Tusaale ahaan, waxaa jira nidaamka mahjong iyo dominoes Yurub, iyo sidoo kale nidaamka suudadka ee sagxada kaararka (calaamadaha). 1F000 iyo ka sii badan). Suudhadhka kaadhka waxa loo dhigaa si waafaqsan xeerarka buundada - PCBT, kaadhadhka suudhka ku jirana waxay u kala horreeyaan T, 2,3, XNUMX... K.

Adigoo gacanta ku hubinaya in safafka loo kala saaray si sax ah DUCET waxay noqon doontaa mid aad u daallan, laakiin, nasiib wanaag anaga, waxaa jira hirgelinta ku dayasho mudan ee maktabadda ee la shaqeynta Unicode - "Qaybaha caalamiga ah ee Unicode"(ICU).

Bogga internetka ee maktabaddan, oo lagu sameeyay IBM, waxaa jira bogag demo, oo ay ku jiraan isbarbardhigga xadhigga bogga algorithm. Waxaan galeynaa khadadkayaga tijaabada iyadoo leh habayn aan caadi ahayn, bal eeg, waxaan helnaa kala-soocidda Ruushka qumman.

Абаканов ΠœΠΈΡ…Π°ΠΈΠ»;маляр
Ёлкина Π­Π»Π»Π°;ΠΊΡ€Π°Π½ΠΎΠ²Ρ‰ΠΈΡ†Π°
Иванов АндрСй;ΡΠ»Π΅ΡΠ°Ρ€ΡŒ
Иванова Алла;Π°Π΄Π²ΠΎΠΊΠ°Ρ‚

By habka, website-ka ICU Waxaad ka heli kartaa caddaynta isbarbardhigga algorithmaadka marka la samaynayo calaamadaha xarakaynta. Tusaalooyinka Soo-ururinta FAQ rukunka iyo jiidhka waa la iska indhatiray.

Unicode way na caawisay, laakiin raadi sababaha dhaqanka qariibka ah qaddar Π² Linux waa inuu meel kale aadaa.

Ku kala soocida glibc

Aragtida degdega ah ee koodka isha tamarta qaddar ka GNU Core Utils waxay muujisay in utility laftiisa, degaanayntu ay hoos ugu dhacdo daabacaadda qiimaha hadda ee doorsoomayaasha LC_COLLATE markaad ku socoto qaabka debug:

$ sort --debug buhg.txt > buhg.srt
sort: using β€˜en_US.UTF8’ sorting rules

Isbarbardhigga xargaha waxaa lagu sameeyaa iyadoo la adeegsanayo shaqada caadiga ah strcoll, taas oo macnaheedu yahay wax kasta oo xiiso leh ayaa ku jira maktabadda shaki.

In wiki mashruuca shaki u heellan isbarbardhigga xadhigga hal cutub. Baaragaraafkan waxaa laga fahmi karaa in shaki kala-soocidda waxay ku salaysan tahay algorithm horeba annagu naqaannay UCA (Algorithm-ka ururinta Unicode) iyo/ama heerka u dhow ISO 14651 (Dalbashada xadhiga caalamiga ah iyo isbarbardhigga). Marka la eego heerka ugu dambeeyay, waa in la ogaadaa in goobta standards.iso.org ISO 14651 si rasmi ah loogu dhawaaqay in si guud loo heli karo, laakiin isku xirka u dhigma ayaa horseedaya bog aan jirin. Google ayaa soo celisa dhowr bog oo xiriir la leh goobaha rasmiga ah ee bixiya iibsashada nuqul elegtaroonig ah oo heerka boqolka Yuuro, laakiin bogga saddexaad ama afraad ee natiijooyinka raadinta waxaa sidoo kale jira xiriir toos ah PDF. Guud ahaan, halbeeggu dhab ahaantii kama duwana UCA, laakiin way ka caajis badan tahay in la akhriyo sababtoo ah kuma jiraan tusaalayaal cad oo tilmaamaya sifooyinka qaranka ee kala-soocidda xargaha.

Xogta ugu xiisaha badan wiki waxaa jiray xiriir dabagalka cayayaanka iyadoo laga doodayo hirgelinta isbarbardhigga xadhigga shaki. Doodda waxaa laga baran karaa shaki loo isticmaalo in lagu barbar dhigo xargaha ISOmiiska shakhsi ahaaneed Shaxda Qaabka Guud (CTT), ciwaanka kaas oo laga heli karo codsiga A heerka ISO 14651. Inta u dhaxaysa 2000 iyo 2015 shaxdan in shaki ma lahayn ilaaliye oo aad ayuu uga duwanaa (ugu yaraan dibadda) nooca hadda ee heerka. Laga soo bilaabo 2015 ilaa 2018, la qabsiga nooca cusub ee miiska ayaa dhacay, oo hadda waxaad fursad u leedahay inaad nolosha dhabta ah kula kulanto nooc cusub oo miiska ah (CentOS 8), iyo duq (CentOS 7).

Hadda oo aan hayno dhammaan macluumaadka ku saabsan algorithm-ka iyo miisaska caawinta, waxaan ku noqon karnaa dhibaatada asalka ah oo aan fahamno sida saxda ah ee loo kala saaro xargaha gudaha Ruushka.

ISO 14651 / 14652

Koodhka isha ee shaxda ayaanu xiisaynaynaa CTT inta badan qaybinta Linux waxay ku jirtaa buug-yaraha /usr/share/i18n/locales/. Jadwalka laftiisa ayaa ku jira faylka iso14651_t1_common. Markaa kani waa dardaaranka faylka koobi iso14651_t1_common lagu daray faylka iso14651_t1, kuwaas oo, ku jira faylalka qaranka, oo ay ku jiraan en_US ΠΈ ru_RU. Inta badan qaybinta Linux Dhammaan faylasha ilaha waxay ku jiraan rakibaadda aasaasiga ah, laakiin haddii aysan joogin, waa inaad ku rakibtaa xirmo dheeraad ah qaybinta.

Qaab dhismeedka faylka iso14651_t1 waxa laga yaabaa inay u ekaato mid aad u hadal badan, oo leh xeerar aan caddayn oo lagu dhisayo magacyada, laakiin haddii aad eegto, wax walbaa waa sahlan yihiin. Qaab dhismeedka waxaa lagu qeexay heerka ISO 14652, koobi ka mid ah kaas oo lagala soo bixi karaa website-ka furan-std.org. Sharaxaad kale oo ku saabsan qaabka faylka ayaa lagu akhriyi karaa tilmaamo POSIX ka Kooxda Fur. Beddelka akhrinta heerka, waxaad baran kartaa koodhka isha shaqada isku dubarid_akhri Π² glibc/locale/barnaamijyada/ld-collate.c.

Qaab dhismeedka faylka ayaa u eg sidan:

Sida caadiga ah, jilaaga waxaa loo isticmaalaa sidii dabeecad baxsad ah, iyo dhamaadka xariiqda ka dib dabeecadda # waa faallo. Labada calaamadba dib ayaa loo qeexi karaa, taas oo ah waxa lagu sameeyo qaabka cusub ee shaxda:

escape_char /
comment_char %

Faylku wuxuu ka koobnaan doonaa calaamado qaabka ama ( Halkee x - nambar hexadecimal). Kani waa matalaadda hexadecimal ee dhibcaha koodka Unicode ee ku jira codaynta UCS-4 (UTF-32). Dhammaan walxaha kale ee xagalka xagasha (ay ku jiraan , <2> iyo wixii la mid ah) waxaa loo tixgeliyaa xargaha xargaha fudud oo macno yar ka baxsan macnaha guud.

String LC_COLLATE waxay noo sheegaysaa in marka xigta bilaabato xogta qeexaysa isbarbardhigga xargaha.

Marka hore, magacyada waxaa loo cayimay miisaanka shaxda isbarbardhigga iyo magacyada isku dhafka calaamadaha. Guud ahaan, labada nooc ee magacyadu waxay ka tirsan yihiin laba hay'adood oo kala duwan, laakiin faylka dhabta ah way isku dhafan yihiin. Magacyada miisaanka waxaa lagu qeexay ereyga muhiimka ah summad-ururinta (dabeecadda isbarbardhigga) sababtoo ah marka la is barbar dhigo, xuruufta Unicode ee leh miisaan isku mid ah ayaa loo tixgelin doonaa jilayaal u dhigma.

Wadarta dhererka qaybta ee dib u eegista faylka hadda waa ilaa 900 oo sadar. Waxaan tusaalooyin ka soo qaatay meelo dhowr ah si aan u muujiyo gardarrada magacyada iyo noocyada kala duwan ee syntax.

LC_COLLATE

collating-symbol <RES-1>
collating-symbol <BLK>
collating-symbol <MIN>
collating-symbol <WIDE>
...
collating-symbol <ARABIC>
collating-symbol <ETHPC>
collating-symbol <OSMANYA>
...
collating-symbol <S1D000>..<S1D35F>
collating-symbol <SFFFF> % Guaranteed largest symbol value. Keep at end of this list
...
collating-element <U0413_0301> from "<U0413><U0301>"
collating-element <U0413_0341> from "<U0413><U0341>"

  • summad-ururinta guntiyo xadhig OSMANYA oo ku jira shaxda magacyada miisaanka
  • summad-ururinta .. wuxuu diiwaangeliyaa magacyo taxane ah oo ka kooban horgale S iyo daba-gal nambareed hexadecimal oo ka yimid 1D000 si ay u 1D35F.
  • FFFF Π² summad-ururinta wuxuu u eg yahay tiro weyn oo aan saxeexin oo hexadecimal ah, laakiin waa magac uun u ekaan kara
  • имя Macnaheedu waa barta koodka ee codaynta UCS-4
  • ururinta- element ka " " waxay diiwaan gelisaa magac cusub labo dhibcood oo Unicode ah.

Marka magacyada miisaanka la qeexo, miisaanka dhabta ah ayaa la cayimay. Maaddaama xiriirka ka weyn oo kaliya uu yahay arrin isbarbardhigga, miisaannada waxaa lagu go'aamiyaa magacyo liis ah oo taxane ah. Miisaanka "fudud" ayaa marka hore la taxay, ka dibna kuwa "culus". Aan ku xasuusiyo in xaraf kasta oo Unicode ah loo qoondeeyay afar miisaan oo kala duwan. Halkan waxa la isugu geeyaa isku xigxig la dalbaday. Aragti ahaan, magac kasta oo calaamad ah ayaa loo isticmaali karaa mid kasta oo ka mid ah afarta heer, laakiin faallooyinku waxay muujinayaan in horumariyayaashu maskax ahaan u kala saaraan magacyo heerar.

% Symbolic weight assignments

% Third-level weight assignments
<RES-1>
<BLK>
<MIN>
<WIDE>
...
% Second-level weight assignments
<BASE>
<LOWLINE> % COMBINING LOW LINE
<PSILI> % COMBINING COMMA ABOVE
<DASIA> % COMBINING REVERSED COMMA ABOVE
...
% First-level weight assignments
<S0009> % HORIZONTAL TABULATION 
<S000A> % LINE FEED
<S000B> % VERTICAL TABULATION
...
<S0434> % CYRILLIC SMALL LETTER DE
<S0501> % CYRILLIC SMALL LETTER KOMI DE
<S0452> % CYRILLIC SMALL LETTER DJE
<S0503> % CYRILLIC SMALL LETTER KOMI DJE
<S0453> % CYRILLIC SMALL LETTER GJE
<S0499> % CYRILLIC SMALL LETTER ZE WITH DESCENDER
<S0435> % CYRILLIC SMALL LETTER IE
<S04D7> % CYRILLIC SMALL LETTER IE WITH BREVE
<S0454> % CYRILLIC SMALL LETTER UKRAINIAN IE
<S0436> % CYRILLIC SMALL LETTER ZHE

Ugu dambeyntii, miiska miisaanka dhabta ah.

Qaybta miisaanku waxay ku lifaaqan tahay khadadka ereyga muhiimka ah dalbashada_bilaw ΠΈ dalbashada_dhamaadka. Doorashooyin dheeraad ah dalbashada_bilaw go'aami halka safafka jihada lagu sawiray heer kasta oo isbarbardhigga. Dejinta caadiga ah waa hore. Jirka qaybtu waxa uu ka kooban yahay xariiqyo ka kooban summada summada iyo afarteeda miisaan. Nambarka jilaha waxaa lagu matali karaa jilaha laftiisa, barta koodka, ama magac calaamad ah oo horay loo qeexay. Miisaanka sidoo kale waxaa la siin karaa magacyo calaamad ah, dhibcaha koodka, ama calaamadaha laftooda. Haddii dhibcaha koodka ama xarfaha la isticmaalo, miisaankoodu wuxuu la mid yahay qiimaha lambarka barta koodka (booska miiska Unicode). Calaamadaha aan si cad loo qeexin (sida aan fahamsanahay) waxaa loo tixgaliyaa in lagu meeleeyay miiska oo leh miisaan aasaasi ah oo u dhigma booska miiska Unicode. Qiimaha miisaanka gaarka ah JAAHIL macneheedu waxa weeye in calaamadda la iska indho tiray marka la barbardhigo heerka ku habboon ee isbarbardhigga.

Si aan u muujiyo qaab-dhismeedka miisaanka, waxaan doortay saddex jajab oo cadcad:

  • jilayaasha gabi ahaanba la iska indhatiray
  • calaamado u dhigma lambarka saddexaad ee labada heer ee hore
  • bilowga alifbeetada Cyrillic, oo aan ku jirin lahjad, sidaas darteed waxaa lagu kala saaraa inta badan heerarka koowaad iyo saddexaad.

order_start forward;forward;forward;forward,position
<U0000> IGNORE;IGNORE;IGNORE;IGNORE % NULL (in 6429)
<U0001> IGNORE;IGNORE;IGNORE;IGNORE % START OF HEADING (in 6429)
<U0002> IGNORE;IGNORE;IGNORE;IGNORE % START OF TEXT (in 6429)
...
<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
<UFF13> <S0033>;<BASE>;<WIDE>;<UFF13> % FULLWIDTH DIGIT THREE
<U2476> <S0033>;<BASE>;<COMPAT>;<U2476> % PARENTHESIZED DIGIT THREE
<U248A> <S0033>;<BASE>;<COMPAT>;<U248A> % DIGIT THREE FULL STOP
<U1D7D1> <S0033>;<BASE>;<FONT>;<U1D7D1> % MATHEMATICAL BOLD DIGIT THREE
...
<U0430> <S0430>;<BASE>;<MIN>;<U0430> % CYRILLIC SMALL LETTER A
<U0410> <S0430>;<BASE>;<CAP>;<U0410> % CYRILLIC CAPITAL LETTER A
<U04D1> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
<U0430_0306> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
...
<U0431> <S0431>;<BASE>;<MIN>;<U0431> % CYRILLIC SMALL LETTER BE
<U0411> <S0431>;<BASE>;<CAP>;<U0411> % CYRILLIC CAPITAL LETTER BE
<U0432> <S0432>;<BASE>;<MIN>;<U0432> % CYRILLIC SMALL LETTER VE
<U0412> <S0432>;<BASE>;<CAP>;<U0412> % CYRILLIC CAPITAL LETTER VE
...
order_end

Hadda waxaad ku noqon kartaa kala-soocidda tusaalooyinka laga bilaabo bilawga maqaalka. Gaadadu waxay ku taal qaybtan miiska miisaanka:

<U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE
<U0021> IGNORE;IGNORE;IGNORE;<U0021> % EXCLAMATION MARK
<U0022> IGNORE;IGNORE;IGNORE;<U0022> % QUOTATION MARK
...

Waxaa la arki karaa in shaxdan calaamadaha xarakaynta ee miiska ASCII (oo ay ku jirto booska) had iyo jeer waa la iska indhatiraa marka la is barbar dhigo xargaha. Waxa kaliya ee ka reeban waa khadadka ku habboon wax kasta marka laga reebo calaamadaha xarakaynta ee laga helo boosaska isbarbar-dhiga. Khadadka tusaalahayga (kala-soocidda kadib) ee algorithm isbarbardhigga waxay u egyihiin sidan:

ΠΠ±Π°ΠΊΠ°Π½ΠΎΠ²ΠœΠΈΡ…Π°ΠΈΠ»ΠΌΠ°Π»ΡΡ€
ЁлкинаЭллакрановщица
Π˜Π²Π°Π½ΠΎΠ²Π°ΠΠ»Π»Π°ΠΌΠ°Π»ΡΡ€
Π˜Π²Π°Π½ΠΎΠ²ΠΠ½Π΄Ρ€Π΅ΠΉΡΠ»Π΅ΡΠ°Ρ€ΡŒ

Iyadoo la tixgelinayo in shaxda miisaanka, xarfaha waaweyn ee Ruushku waxay ka dambeeyaan xarfaha yaryar (heerka saddexaad ka culus ), kala-soocidda ayaa u muuqata mid sax ah.

Marka la dejiyo doorsoome LC_COLLATE=C miis gaar ah ayaa la raray kaas oo qeexaya isbarbardhigga byte-by-byte

static const uint32_t collseqwc[] =
{
  8, 1, 8, 0x0, 0xff,
  /* 1st-level table */
  6 * sizeof (uint32_t),
  /* 2nd-level table */
  7 * sizeof (uint32_t),
  /* 3rd-level table */
  L'x00', L'x01', L'x02', L'x03', L'x04', L'x05', L'x06', L'x07',
  L'x08', L'x09', L'x0a', L'x0b', L'x0c', L'x0d', L'x0e', L'x0f',

...
  L'xf8', L'xf9', L'xfa', L'xfb', L'xfc', L'xfd', L'xfe', L'xff'
};

Maadaama Unicode barta koodka Ё uu ka horeeyo A, xargaha ayaa loo kala saaray si waafaqsan.

Miisaska qoraalka iyo binary

Sida cad, isbarbardhigga xadhiggu waa hawl aad caadi u ah, iyo kala saaridda miiska CTT nidaam qaali ah. Si aad u wanaajiso gelitaanka miiska, waxa lagu soo ururiyey qaab binary oo leh amarka localdef.

kooxda localdef u aqbala sidii halbeegyo fayl leh shaxda astaamaha qaranka (ikhtiyaarka -i), kaas oo dhammaan jilayaasha ay matalaan dhibcaha Unicode, iyo faylka waraaqaha u dhexeeya dhibcaha Unicode iyo jilayaasha cod-bixin gaar ah (doorasho -f). Natiijada shaqada, faylasha binary ayaa loo abuuray degaanka oo leh magaca lagu qeexay cabbirka ugu dambeeya.

glibc waxay taageertaa laba nooc oo faylal ah: "dhaqameed" iyo "casri ah".

Qaab dhaqameedku wuxuu ka dhigan yahay in magaca deegaanku yahay magaca haga-hoosaadka gudaha /usr/lib/locale/. Hagahan hoose waxa uu kaydiyaa faylalka binary LC_COLLATE, LC_CTYPE, LC_TIME iyo wixi la mida. Faylka LC_IDENTIFICATION ka kooban yahay magaca rasmiga ah ee deegaanka (kaas oo ka duwanaan kara magaca hagaha) iyo faallooyinka.

Qaabka casriga ahi waxa uu ku lug leeyahay in lagu kaydiyo dhammaan aagagga hal kayd /usr/lib/locale/locale-archive, kaas oo lagu jaangooyay xusuusta casriga ah ee dhammaan hababka la isticmaalayo shaki. Magaca deegaanka ee qaabka casriga ah waxa uu hoos imanayaa qaar ka mid ah canonization -kaliya tirooyin iyo xarfo hoos loo dhigay ayaa ku hadhay magacyada codaynta. Markaa ru_RU.KOI8-R, badbaadi doona sida ru_RU.koi8r.

Faylasha la geliyo waxaa laga dhex raadiyaa hagaha hadda jira, iyo sidoo kale hagayaasha /usr/share/i18n/locales/ ΠΈ /usr/share/i18n/charmaps/ faylasha CTT iyo codaynta faylasha, siday u kala horreeyaan.

Tusaale ahaan, amarka

localedef -i ru_RU -f MAC-CYRILLIC ru_RU.MAC-CYRILLIC

wuxuu ururin doonaa faylka /usr/share/i18n/locales/ru_RU iyadoo la isticmaalayo faylalka codeynta /usr/share/i18n/charmaps/MAC-CYRILLIC.gz oo ku badbaadi natiijada /usr/lib/locale/locale-archive magaca ru_RU.maccyrilic

Haddii aad dejiso doorsoomiyaha LANG = en_US.UTF-8 markaa shaki waxay raadin doontaa binaries maxalliga ah ee taxanaha soo socda ee faylasha iyo hagayaasha:

/usr/lib/locale/locale-archive
/usr/lib/locale/en_US.UTF-8/
/usr/lib/locale/en_US/
/usr/lib/locale/enUTF-8/
/usr/lib/locale/en/

Haddii deegaanku uu ku dhaco qaababka dhaqameed iyo kuwa casriga ah labadaba, markaa mudnaanta ayaa la siinayaa midka casriga ah.

Waxaad ku arki kartaa liiska meelaha la soo ururiyey iyadoo la raacayo amarka dega-a.

Diyaarinta miiskaaga isbarbardhigga

Hadda, adigoo ku hubaysan aqoonta, waxaad abuuri kartaa miiskaaga isbarbardhigga xargaha ugu habboon. Shaxdani waa inay si sax ah u barbar dhigtaa xarfaha Ruushka, oo ay ku jiraan xarafka Ё, isla markaana ku xisaabtamaan calaamadaha xarakaynta ee waafaqsan shaxda. ASCII.

Habka diyaarinta miiskaaga kala-soocidda ayaa ka kooban laba marxaladood: tafatirka miiska miisaanka oo u ururinta qaabka binary oo leh amarka localdef.

Si miiska isbarbardhigga loogu hagaajiyo kharashyada tafatirka ugu yar, qaabka ISO 14652 Qaybaha hagaajinta miisaanka miiska jira ayaa la bixiyaa. Qaybta waxay ku bilaabataa kelmad muhiim ah dib u habayn-ka dib oo tilmaamaya booska ka dib markii beddelka la sameeyo. Qaybta waxay ku dhammaataa xariiqda dib u habaynta-dhamaadka. Haddii ay lagama maarmaan tahay in la saxo dhowr qaybood oo miiska ah, ka dibna qayb kasta ayaa loo abuuray qayb kasta.

Waxaan koobiyeeyay noocyo cusub oo faylasha ah iso14651_t1_common ΠΈ ru_RU laga bilaabo kaydka shaki ilaa tusaha gurigayga ~/.local/share/i18n/locales/ oo waxoogaa tafatiray qaybta LC_COLLATE Π² ru_RU. Noocyada cusub ee faylasha ayaa si buuxda ula jaan qaadaya noocayga shaki. Haddii aad rabto inaad isticmaasho noocyadii hore ee faylalka, waa inaad bedeshaa magacyada astaanta ah iyo meesha uu beddelku ka bilaabmayo miiska.

LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
reorder-after <U000D>
<U0020> <S0020>;<BASE>;<MIN>;<U0020> % SPACE
<U0021> <S0021>;<BASE>;<MIN>;<U0021> % EXCLAMATION MARK
<U0022> <S0022>;<BASE>;<MIN>;<U0022> % QUOTATION MARK
...
<U007D> <S007D>;<BASE>;<MIN>;<U007D> % RIGHT CURLY BRACKET
<U007E> <S007E>;<BASE>;<MIN>;<U007E> % TILDE
reorder-end
END LC_COLLATE

Dhab ahaantii, waxay noqon doontaa lagama maarmaan in la beddelo beeraha gudaha LC_IDENTIFICATION si ay u tilmaamaan deegaanka ru_MY, laakiin tusaale ahaan tan looma baahnayn, maadaama aan kaydka ka saaray raadinta meelaha maxalli- kayd.

in localdef la shaqeeyay faylasha galkayga ku jira doorsoome I18NPATH Waxaad ku dari kartaa hage dheeraad ah si aad u raadiso galalka gelinta, iyo tusaha lagu kaydinayo faylalka binary-ga waxaa lagu tilmaami karaa waddo leh jajabyo:

$> I18NPATH=~/.local/share/i18n localedef -i ru_RU -f UTF-8 ~/.local/lib/locale/ru_MY.UTF-8

POSIX waxay soo jeedinaysaa in LUQADDA waxaad ku qori kartaa wadooyin dhamaystiran hageyaal leh faylal deegaan, adigoo ka bilaabaya horudhac, laakiin shaki Π² Linux Wadooyinka oo dhan waxaa laga tiriyaa tusaha aasaasiga ah, kaas oo lagaga gudbi karo doorsoome LOCPATH. Ka dib markii la rakibo LOCPATH=~/.local/lib/locale/ Dhammaan faylasha la xidhiidha meelaynta waxa lagu baadhi doonaa oo keliya galkayga. Kaydka aagagga oo leh jaantuska doorsoomayaasha LOCPATH la iska indhatiray.

Waa kan tijaabada muhiimka ah:

$> LANG=ru_MY.UTF-8 LOCPATH=~/.local/lib/locale/ sort buhg.txt
Абаканов ΠœΠΈΡ…Π°ΠΈΠ»;маляр
Ёлкина Π­Π»Π»Π°;ΠΊΡ€Π°Π½ΠΎΠ²Ρ‰ΠΈΡ†Π°
Иванов АндрСй;ΡΠ»Π΅ΡΠ°Ρ€ΡŒ
Иванова Алла;Π°Π΄Π²ΠΎΠΊΠ°Ρ‚

Hooray! Waanu samaynay!

Wax ka qabashada cilad

Waxaan mar hore ka jawaabay su'aalaha ku saabsan kala-soocidda xargaha ee la soo bandhigay bilowgii, laakiin weli waxaa jira dhowr su'aalood oo ku saabsan khaladaadka - muuqda iyo kuwa aan muuqan.

Aan u soo laabano dhibkii asalka ahaa.

Iyo barnaamijka qaddar iyo barnaamijka biiro adeegso isla isbarbardhigga xargaha shaki. Sidee bay taasi ku dhacday biiro wuxuu bixiyay khalad kala soocida safafka uu amarku soo saaray qaddar gudaha en_US.UTF-8? Jawaabtu waa sahlan tahay: qaddar wuxuu is barbar dhigayaa xarigga oo dhan, iyo biiro isbarbar dhigaya kaliya furaha, kaas oo si caadi ah u ah bilawga xadhigga ilaa xarafka hore ee goobta cad. Tusaalahayga, tani waxay keentay fariin khalad ah sababtoo ah kala-soocidda ereyada ugu horreeya ee xariiqyada ayaan ku habboonayn kala-soocidda khadadka dhamaystiran.

Deegaanka "C" waxay dammaanad qaadaysaa in xadhkaha la kala soocay ee hore ilaa meesha ugu horeysa sidoo kale la kala saari doono, laakiin tani waxay qarinaysaa qaladka oo kaliya. Waa suurtagal in la doorto xogta (dadka leh magacyo isku mid ah, laakiin magacyo kala duwan) kuwaas oo, iyada oo aan fariinta khaladka ah, ay siinayso natiijada isku dhafka faylka ee khaldan. Hadii aan rabno biiro khadadka faylalka la isku daray ee magac buuxa leh, dabadeed sida saxda ah waxay noqonaysaa in si cad loo qeexo kala-soocida goobta oo lagu kala saaro goobta muhiimka ah, ee maaha in laynka oo dhan lagu kala saaro. Xaaladdan oo kale, isku darka ayaa u socon doona si sax ah mana jiri doono khaladaad meel kasta:

$> sort -t ; -k 1 buhg.txt > buhg.srt
$> sort -t ; -k 1 mail.txt > mail.srt
$> join -t ; buhg.srt mail.srt > result

Si guul leh loo fuliyay tusaale ahaan codaynta CP1251 waxaa ku jira qalad kale. Xaqiiqdu waxay tahay in dhammaan qaybinta aan ogahay Linux baakadaha ayaa ka maqan gudaha la soo ururiyey ru_RU.CP1251. Haddii meesha la soo ururiyey aan la helin, markaa qaddar si aamusnaan ah u isticmaala isbarbardhigga byte-by-byte, taas oo ah waxa aan aragnay.

Jid ahaan, waxa jira cilad kale oo yar oo la xidhiidha helitaan la'aanta meelaha la soo ururiyey. Kooxda LOCPATH=/tmp gudaha -a wuxuu ku siin doonaa liiska dhammaan deegaanada ku yaal maxalli- kayd, laakiin leh doorsoomayaasha LOCPATH dhammaan barnaamijyada (oo ay ku jiraan kuwa ugu badan maxaliga ah) meelahan lama heli doono.

$> LOCPATH=/tmp locale -a | grep en_US
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
en_US
en_US.iso88591
en_US.iso885915
en_US.utf8

$> LC_COLLATE=en_US.UTF-8 sort --debug
sort: using β€˜en_US.UTF-8’ sorting rules

$> LOCPATH=/tmp LC_COLLATE=en_US.UTF-8 sort --debug
sort: using simple byte comparison

gunaanad

Haddii aad tahay barnaamij-yaqaan u caadaystay in uu ku fikiro in xarguhu ay yihiin tiro bytes ah, markaa doorashadaada LC_COLLATE=C.

Haddii aad tahay af-yaqaan ama qaamuus-urure, markaa waxa fiican inaad ku ururiso gudahaaga.

Haddii aad tahay isticmaale fudud, markaa waxaad u baahan tahay oo kaliya inaad la qabsato xaqiiqda amarka Ls-a soo saara faylal ka bilaabma dhibco ku qasan faylal ka bilaabma xaraf, iyo Taliyaha saqda dhexe, oo adeegsata hawlaheeda gudaha si ay u kala saarto magacyada, waxay gelisaa faylal ka bilaabma dhibic bilowga liiska.

tixraacyada

Warbixinta No. 10 Unicode collation algorithm

Miisaanka jilaha ee unicode.org

ICU - hirgelinta maktabad si ay ula shaqeeyaan Unicode ka IBM.

Kala soocida tijaabada adoo isticmaalaya ICU

Miisaanka jilaha gudaha ISO 14651

Sharaxaada qaabka faylka oo leh miisaan ISO 14652

Wadahadalka isbarbardhigga xarigga ee shaki

Source: www.habr.com

Add a comment