Maitiro eLinux anoronga sei tambo

Nhanganyaya

Izvo zvese zvakatanga nepfupi script yaifanirwa kubatanidza ruzivo rwekero E-mail vashandi vakawana kubva pane rondedzero yevashandisi vetsamba yetsamba, vane zvinzvimbo zvevashandi zvakawanikwa kubva kuHR department department. Mazita ese ari maviri akatumirwa kune Unicode mameseji mafaera UTF-8 uye yakachengetedzwa neUnix mutsara magumo.

Zvemukati mail.txt

Иванов АндрСй;[email protected]

Zvemukati buhg.txt

Иванова Алла;маляр
Ёлкина Π­Π»Π»Π°;ΠΊΡ€Π°Π½ΠΎΠ²Ρ‰ΠΈΡ†Π°
Иванов АндрСй;ΡΠ»Π΅ΡΠ°Ρ€ΡŒ
Абаканов ΠœΠΈΡ…Π°ΠΈΠ»;маляр

Kubatanidza, mafaera akarongwa neiyo Unix command kworudzii uye yakatumirwa kune yekuisa iyo Unix chirongwa Join, izvo zvakakundikana zvisingatarisirwi nechikanganiso:

$> sort buhg.txt > buhg.srt
$> sort mail.txt > mail.srt
$> join buhg.srt mail.srt > result
join: buhg.srt:4: is not sorted: Иванов АндрСй;ΡΠ»Π΅ΡΠ°Ρ€ΡŒ

Kuona mhedzisiro yekuronga nemeso ako kwakaratidza kuti, kazhinji, kuronga kwacho kwakaringana, asi muchiitiko chekusangana kwemazita echirume neechikadzi, vakadzi vanouya pamberi pevarume:

$> sort buhg.txt
Абаканов ΠœΠΈΡ…Π°ΠΈΠ»;маляр
Ёлкина Π­Π»Π»Π°;ΠΊΡ€Π°Π½ΠΎΠ²Ρ‰ΠΈΡ†Π°
Иванова Алла;маляр
Иванов АндрСй;ΡΠ»Π΅ΡΠ°Ρ€ΡŒ

Inotaridzika senge glitch yekuronga muUnicode kana senge kuratidzwa kwechikadzi mune yekuronga algorithm. Yokutanga, hongu, inonzwisisika.

Ngatizviise kure izvozvi Join uye kuisa pfungwa pa kworudzii. Ngatiedzei kugadzirisa dambudziko tichishandisa sainzi poking. Kutanga, ngatichinjei nzvimbo kubva muU.S pamusoro ru_RU. Kurongedza, zvingave zvakakwana kuseta nharaunda inoshanduka LC_COLLATE, asi isu hatizotambisi nguva pazvinhu zvisingakoshi:

$> LANG=ru_RU.UTF-8 sort buhg.txt
Абаканов ΠœΠΈΡ…Π°ΠΈΠ»;маляр
Ёлкина Π­Π»Π»Π°;ΠΊΡ€Π°Π½ΠΎΠ²Ρ‰ΠΈΡ†Π°
Иванова Алла;маляр
Иванов АндрСй;ΡΠ»Π΅ΡΠ°Ρ€ΡŒ

Hapana chakachinja.

Ngatiedze kudzoreredza mafaera kuita imwechete-byte encoding:

$> iconv -f UTF-8 -t KOI8-R buhg.txt 
 | LANG=ru_RU.KOI8-R sort 
 | iconv -f KOI8-R -t UTF8

Zvakare hapana chakachinja.

Hapana chaunokwanisa kuita, uchafanirwa kutsvaga mhinduro paInternet. Hapana chakanangana nemazita echiRussia, asi pane mibvunzo nezve mamwe maitiro ekusanzwisisika. Somuenzaniso, heino dambudziko: unix mhando inobata '-' (dash) mavara seasingaoneki. Muchidimbu, tambo "ab", "aa", "ac" inorongwa se "aa", "ab", "ac".

Mhinduro yakajairika kwese kwese: shandisa iyo programmer locale "C" uye uchafara. Ngatiedze:

$> LANG=C sort buhg.txt
Ёлкина Π­Π»Π»Π°;ΠΊΡ€Π°Π½ΠΎΠ²Ρ‰ΠΈΡ†Π°
Абаканов ΠœΠΈΡ…Π°ΠΈΠ»;маляр
Иванов АндрСй;ΡΠ»Π΅ΡΠ°Ρ€ΡŒ
Иванова Алла;Π°Π΄Π²ΠΎΠΊΠ°Ρ‚

Pane zvachinja. VaIvanovs vakarongedza nenzira kwayo, kunyange zvazvo Yolkina akatsvedza pane imwe nzvimbo. Ngatidzokere kudambudziko rekutanga:

$> LANG=C sort buhg.txt > buhg.srt
$> LANG=C sort mail.txt > mail.srt
$> LANG=C join buhg.srt mail.srt > result

Yakashanda pasina zvikanganiso, sekuvimbisa kweInternet. Uye izvi pasinei neYolkina mumutsara wekutanga.

Dambudziko rinoratidzika kunge rakagadziriswa, asi kana zvikaitika, ngatiedze imwe yeRussia encoding - Windows CP1251:

$> iconv -f UTF-8 -t CP1251 buhg.txt 
 | LANG=ru_RU.CP1251 sort 
 | iconv -f CP1251 -t UTF8 

Mhedzisiro yekuronga, zvisingaite, inopindirana nenzvimbo "C", uye muenzaniso wose, maererano, unomhanya pasina kukanganisa. Imwe mhando yekusaziva.

Ini handifarire zvisinganzwisisike mukuronga nekuti zvinowanzovhara zvikanganiso. Tichafanira kunyatsotarisa kuti zvinoshanda sei. kworudzii uye zvinokanganisa chii? LC_COLLATE .

Pakupedzisira ini ndichaedza kupindura mibvunzo:

  • sei surname dzechikadzi dzakarongwa zvisirizvo?
  • nei LANG=ru_RU.CP1251 zvakava zvakaenzana LANG=C
  • sei uchidaro kworudzii ΠΈ Join pfungwa dzakasiyana pamusoro pekurongeka kwetambo dzakarongwa
  • sei paine kukanganisa mumienzaniso yangu yese?
  • pakupedzisira sei kuronga tambo sezvaunoda

Kuronga mu Unicode

Yekutanga kumira ichava technical report No. 10 ine kodzero Unicode collation algorithm Online unicode.org. Chirevo chine zvakawanda zvehunyanzvi zvehunyanzvi, saka regai ndipe pfupiso yepfungwa huru.

collation - "kuenzanisa" tambo ndiyo hwaro hwechero kurongedza algorithm. Iwo maalgorithms pachawo anogona kusiyana ("bubble", "merge", "fast"), asi vese vachashandisa kuenzanisa kwetambo mbiri kuti vaone marongero avanoita.

Kuronga tambo mumutauro wechisikigo idambudziko rakaoma. Kunyangwe mune yakapfava imwe-byte encodings, kurongeka kwemabhii muarufabheti, kunyangwe neimwe nzira yakasiyana nearufabheti yechirungu yechiLatin, haichaenderane nekurongeka kwenhamba dzenhamba dzinoiswa mavara aya. Saka muarufabheti yechiGerman tsamba Γ– inomira pakati О ΠΈ P, uye mune encoding CP850 anopinda pakati ΓΏ ΠΈ Ü.

Iwe unogona kuedza kubvisa kubva kune yakatarwa encoding uye funga "akanaka" mavara akarongwa mune imwe kurongeka, sezvinoitwa muUnicode. Encodings UTF8, UTF16 kana imwe-byte KOI8-R (kana chidimbu chidiki che Unicode chichidikanwa) chinopa akasiyana manhamba anomiririra mavara, asi tarisa kune iwo akafanana zvinhu zveiyo base tafura.

Zvinoitika kuti kunyangwe tikavaka tafura yechiratidzo kubva pakatanga, isu hatizokwanisi kugovera chiratidzo chepasirese kurongeka kwairi. Mumaarufabheti enyika akasiyana anoshandisa mavara mamwe chete, marongerwo emabhii aya anogona kusiyana. Somuenzaniso, muchiFrench Γ† ichaonekwa setambo uye yakarongedzwa setambo AE. MuNorwegian Γ† ichava tsamba yakaparadzana, iyo inowanikwa mushure Z. Nenzira, kuwedzera kune ligatures se Γ† Kune mavara akanyorwa ane zviratidzo zvakawanda. Saka muCzech alphabet pane tsamba Ch, iyo inomira pakati H ΠΈ I.

Pamusoro pekusiyana kwemaarufabheti, kune dzimwe tsika dzemunyika dzinopesvedzera kusarura. Kunyanya, mubvunzo unomuka: mune kurongeka kwemazwi ane mavara makuru uye madiki muduramazwi? Kuronga kunogonawo kukanganiswa nekushandiswa kwezviratidzo. MuchiSpanish, chiratidzo chemubvunzo chakapindurwa chinoshandiswa pakutanga kwemutsara wekubvunzurudza (Unoda mimhanzi here?) Panyaya iyi, zviri pachena kuti mitsara yekubvunzurudza haifanirwe kuunganidzwa kuita sumbu rakaparadzana kunze kwearufabheti, asi magadzirirwo emitsara nemamwe mavanga epumikira?

Ini handisi kuzogara pakusarudza tambo mumitauro yakasiyana kwazvo neyeEurope. Ziva kuti mumitauro ine nzira yekunyora kubva kurudyi kuenda kuruboshwe kana kubva kumusoro kuenda pasi, mavara ari mumitsara anonyanya kuchengetwa muhurongwa hwekuverenga, uye kunyangwe masisitimu ekunyora asiri earufabheti ane nzira dzawo dzekuronga mitsara nemavara. . Semuenzaniso, hieroglyphs inogona kurongeka nemaitiro (Makiyi emavara echiChinese) kana nemataurirwo emashoko. Kutaura chokwadi, handina zano rekuti emojis inofanira kurongwa sei, asi iwe unogona kuuya nechimwe chinhu kwavari futi.

Zvichienderana nezvakanyorwa pamusoro, izvo zvakakosha zvekufananidza tambo zvichienderana neiyo Unicode matafura akagadzirwa:

  • kuenzanisa kwetambo hakubvi pane nzvimbo yevatambi mutafura yekodhi;
  • kutevedzana kwemavara kuumba chimiro chimwe chete kunoderedzwa kuita canonical form (A + denderedzwa repamusoro rakafanana ne Γ…);
  • Kana uchienzanisa tambo, hunhu hunotariswa mumamiriro etambo uye, kana zvichidikanwa, yakasanganiswa nevavakidzani vayo kuita imwe unit yekuenzanisa (Ch muCzech) kana yakakamurwa kuita akati wandei (Γ† muchiFrench);
  • ese maficha enyika (arufabheti, mavara makuru/maduku, manyorerwo, marongero emhando dzekunyora) anofanira kugadzirwa kusvika kune bhuku remanyorerwo ehurongwa (emoji);
  • kuenzanisa kwakakosha kwete pakuronga chete, asiwo mune dzimwe nzvimbo dzakawanda, semuenzaniso pakudoma mitsara yemitsara (kutsiva {A... z} mu. Bash);
  • kuenzanisa kunofanira kuitwa zvakaringana nekukurumidza.

Pamusoro pezvo, vanyori vemushumo vakagadzira zvivakwa zvekufananidza izvo vanogadzira algorithm havafanirwe kuvimba nazvo:

  • iyo algorithm yekuenzanisa haifanire kudiwa seti yakaparadzana yemavara emutauro wega wega (mitauro yeRussia neUkraine inogovana akawanda maCyrillic mavara);
  • kuenzanisa hakufanirwe kuvimba nekurongeka kwemavara mumatafura eUnicode;
  • uremu hwetambo haifanire kuve hunhu hwetambo, sezvo tambo imwe chete mumagariro akasiyana-siyana inogona kuva nezviremu zvakasiyana;
  • uremu hwemitsara hunogona kuchinja kana uchibatanidza kana kupatsanura (kubva x < y hazviteveri izvozvo xz < yz);
  • tambo dzakasiyana dzine huremu hwakafanana dzinoonekwa dzakaenzana kubva pakuona kweiyo algorithm yekuronga. Kuunza kumwe kurongeka kwetambo dzakadaro kunogoneka, asi kunogona kuderedza kushanda;
  • Pakuronga kunodzokororwa, mitsetse ine huremu hwakafanana inogona kuchinjaniswa. Kusimba inzvimbo yealgorithm yekuronga chaiyo, uye kwete pfuma yetambo yekuenzanisa algorithm (ona ndima yapfuura);
  • Mitemo yekuronga inogona kushanduka nekufamba kwenguva sezvo tsika dzechivanhu dzinokwenenzvera/kuchinja.

Izvo zvakare zvinotemerwa kuti kuenzanisa algorithm hakuna chainoziva nezve semantics yetambo dziri kugadziriswa. Saka, tambo dzinosanganisira manhamba chete hadzifanirwe kuenzaniswa senhamba, uye mumazita emazita echiRungu chinyorwa (Beatles, The).

Kuti ugutse zvese zvakatsanangurwa zvinodiwa, tafura yekuronga tafura yealgorithm yakawanda (chaiyo mana-level) inorongwa.

Kare, mavara ari mutambo akaderedzwa kuita canonical fomu uye akaiswa muzvikamu zvekuenzanisa. Chikwata chega chega chekuenzanisa chinopihwa huremu hwakawanda hunoenderana nemazinga akati wandei ekuenzanisa. Huremu hwemayuniti ekuenzanisa zvinhu zveakarongedzerwa seti (munyaya iyi, integers) zvinogona kufananidzwa kune zvakawanda kana zvishoma. Chirevo chakakosha KUFUNGWA (0x0) zvinoreva kuti pachiyero chekuenzanisa chikamu ichi hachina kubatanidzwa mukuenzanisa. Kuenzanisa kwetambo kunogona kudzokororwa kakawanda, uchishandisa zviyero zvemazinga anoenderana. Panhanho imwe neimwe, huremu hwemayuniti ekuenzanisa emitsara miviri inoteedzana yakaenzaniswa neimwe.

Mukuita kwakasiyana kweiyo algorithm yetsika dzakasiyana dzenyika, kukosha kweiyo coefficients inogona kusiyana, asi iyo Unicode chiyero inosanganisira tafura yezviyero - "Default Unicode Collation Element Tafura" (DUCET) Ndinoda kucherechedza kuti kuseta shanduko LC_COLLATE chaizvoizvo chiratidzo chekusarudzwa kwetafura yehuremu mubasa rekuenzanisa tambo.

Weighting coefficients DUCET zvakarongwa sezvinotevera:

  • padanho rekutanga, mavara ese akaderedzwa kusvika kune imwechete kesi, madhizaini anoraswa, zviratidzo zvepumisikidzo (kwete ese) zvinoregeredzwa;
  • pachikamu chechipiri, mazita emutauro chete anotorwa;
  • padanho rechitatu, nyaya chete inotariswa;
  • padanho rechina, zvibodzwa zvepumisikidzo chete zvinotariswa.

Kuenzanisa kunoitika muzvikamu zvakawanda: kutanga, coefficients yezinga rekutanga inofananidzwa; kana uremu huchienderana, ipapo kuenzanisa kunodzokororwa nechiyero chechipiri chechiyero chinoitwa; zvino zvichida chechitatu nechechina.

Kuenzanisa kunopera kana mitsara iine zvikamu zvinofananidzwa zvekuenzanisa nezviremu zvakasiyana. Mitsara ine huremu hwakaenzana pamatanho ese mana inoonekwa yakaenzana kune imwe neimwe.

Iyi algorithm (ine boka rekuwedzera ruzivo rwehunyanzvi) yakapa zita rekutaura Nha. 10 - "Unicode Collation Algorithm" (ACU).

Apa ndipo apo maitiro ekugadzirisa kubva kumuenzaniso wedu anove akajeka zvishoma. Zvingave zvakanaka kuzvienzanisa neiyo Unicode standard.

Kuedza kuita ACU kune yakakosha bvunzo, kushandisa uremu faira, kuita DUCET. Iwe unogona kuwana marudzi ese ezvinhu zvinosekesa mune zvikero faira. Semuenzaniso, kune kurongeka kwemahjong neEuropean dominoes, pamwe nekurongeka kwemasutu mudhishi remakadhi (chiratidzo. 1F000 uyezve). Makadhi masutu akaiswa maererano nemitemo yebhiriji - PCBT, uye makadhi ari musutu ari munhevedzano T, 2,3, XNUMX... K.

Kutarisa nemaoko kuti mitsetse yakarongwa nemazvo maererano DUCET zvingave zvinonetesa, asi, nerombo rakanaka kwatiri, kune muenzaniso wekuita raibhurari yekushanda neUnicode - "Zvikamu zvepasi rose zve Unicode"(ICU).

Pawebhusaiti yeraibhurari iyi, yakagadzirwa mu IBM, kune mapeji edemo, kusanganisira tambo yekuenzanisa algorithm peji. Isu tinopinda mitsara yedu yekuyedza neyakagadzika marongero uye, tarisa uye tarisai, isu tinowana yakakwana yekurongedza yeRussia.

Абаканов ΠœΠΈΡ…Π°ΠΈΠ»;маляр
Ёлкина Π­Π»Π»Π°;ΠΊΡ€Π°Π½ΠΎΠ²Ρ‰ΠΈΡ†Π°
Иванов АндрСй;ΡΠ»Π΅ΡΠ°Ρ€ΡŒ
Иванова Алла;Π°Π΄Π²ΠΎΠΊΠ°Ρ‚

Nenzira, pane webhusaiti ICU Iwe unogona kuwana kujekeswa kwekuenzanisa algorithm paunenge uchigadzira zviratidzo zvepunctuation. Mumienzaniso Collation FAQ apostrophe uye hyphen hazvina hanya.

Unicode yakatibatsira, asi tsvaga zvikonzero zvemaitiro anoshamisa kworudzii Π² Linux uchafanira kuenda kumwewo.

Kuronga mu glibc

Kukurumidza kuona kweutility source codes kworudzii kubva GNU Core Utils yakaratidza kuti mune iyo yekushandisa pachayo, localization inosvika pakudhinda kukosha kwazvino kweiyo chinja LC_COLLATE paunenge uchimhanya mudebug mode:

$ sort --debug buhg.txt > buhg.srt
sort: using β€˜en_US.UTF8’ sorting rules

Kuenzanisa kwetambo kunoitwa uchishandisa yakajairwa basa strcoll, zvinoreva kuti zvese zvinonakidza zviri muraibhurari glibc.

pamusoro wiki Vasai chirongwa glibc yakatsaurirwa kune tambo kuenzanisa ndima imwe. Kubva mundima iyi zvinogona kunzwisiswa kuti mu glibc kuronga kunobva pane algorithm yagara ichizivikanwa kwatiri ACU (Iyo Unicode collation algorithm) uye/kana pachiyero chiri pedyo nayo ISO 14651 (International tambo kuronga uye kuenzanisa) Nezve yazvino chiyero, zvinofanirwa kucherechedzwa kuti pane saiti standards.iso.org ISO 14651 yakaziviswa zviri pamutemo kuwanikwa, asi chinongedzo chinotungamira kune peji risipo. Google inodzorera mapeji akati wandei ane zvinongedzo kunzvimbo dzepamutemo dzinopa kutenga kopi yemagetsi yeyakajairwa maeuro zana, asi pane rechitatu kana rechina peji rekutsvaga pane zvakare zvinongedzo zvakananga kune. PDF. Kazhinji, chiyero hachina kusiyana nacho ACU, asi inonakidza kuverenga nekuti haina mienzaniso yakajeka yemhando yenyika yekuronga tambo.

Mashoko anonyanya kufadza pa wiki Vasai paiva ne link ku bug tracker nehurukuro yekushandiswa kwetambo yekuenzanisa mu glibc. Kubva muhurukuro zvinogona kudzidzwa kuti glibc inoshandiswa kuenzanisa tambo ISOtafura yega The Common Template Tafura (CTT), kero yayo inogona kuwanikwa muchikumbiro A standard ISO 14651. Pakati pa2000 na2015 tafura iyi mukati glibc yakanga isina muchengeti uye yakanga yakasiyana chaizvo (kunyanya kunze) kubva kune yazvino vhezheni yechiyero. Kubva 2015 kusvika 2018, kuchinjika kune iyo nyowani vhezheni yetafura yakaitika, uye ikozvino une mukana wekusangana muhupenyu chaihwo shanduro itsva yetafura (CentOS 8), uye vakuru (CentOS 7).

Iye zvino zvatava neruzivo rwese nezve algorithm uye matafura ekubatsira, tinogona kudzokera kune dambudziko rekutanga uye tinzwisise maitiro ekugadzirisa tambo mumutauro weRussia.

ISO 14651 / 14652

Kunobva kodhi yetafura yatiri kufarira CTT pakugovera kwakawanda Linux iri mubhuku /usr/share/i18n/locales/. Tafura pachayo iri mufaira iso14651_t1_common. Ipapo iyi ndiyo dhiraivha yefaira kopi iso14651_t1_common inosanganisirwa mufaira iso14651_t1, iyo, zvakare, inosanganiswa mumafaira enyika, kusanganisira muU.S ΠΈ ru_RU. Pakugovera kwakawanda Linux ese mafaera sosi anosanganisirwa mune yekutanga kuisirwa, asi kana isipo, iwe uchafanirwa kuisa imwe pasuru kubva pakugovera.

Chimiro chefaira iso14651_t1 inogona kutaridzika sezwi rinotyisa, nemirairo isiri pachena yekuvaka mazita, asi kana iwe ukazvitarisa, zvese zviri nyore. Chimiro chinotsanangurwa muchiyero ISO 14652, kopi inogona kutorwa kubva pawebhusaiti open-std.org. Imwe tsananguro yefomati yefaira inogona kuverengerwa mukati specifications POSIX ΠΎΡ‚ OpenGroup. Seimwe nzira yekuverenga iyo yakajairwa, unogona kudzidza sosi kodhi yebasa racho batanidza_verenga Π² glibc/locale/programs/ld-collate.c.

Iyo faira chimiro chinotaridzika seizvi:

Nekumisikidza, hunhu hunoshandiswa sehunhu hwekupukunyuka, uye kupera kwemutsara mushure meiyo # hunhu ndeyekutaura. Zvose zviratidzo zvinogona kutsanangurwa patsva, zvinova izvo zvinoitwa mushanduro itsva yetafura:

escape_char /
comment_char %

Iyo faira ichange iine tokeni mufomati kana (kupi x - hexadecimal digit). Iyi ndiyo hexadecimal inomiririra yeUnicode kodhi mapoinzi mune encoding UCS-4 (UTF-32) Zvese zvimwe zvinhu mumabhuraketi emakona (kusanganisira , <2> nezvimwe zvakadaro) zvinotorwa sematambo akareruka asina zvaanoreva kunze kwechirevo.

Mutsara LC_COLLATE inotitaurira kuti inotevera inotanga iyo data inotsanangura kuenzanisa kwetambo.

Chekutanga, mazita anotaridzwa zviremu zviri patafura yekuenzanisa nemazita emusanganiswa wezviratidzo. Kazhinji kutaura, iwo marudzi maviri emazita ndeezvikamu zviviri zvakasiyana, asi mufaira chairo anosanganiswa. Mazita ehuremu anotsanangurwa nezwi guru kuunganidza-chiratidzo (kufananidza hunhu) nekuti kana uchienzanisa, Unicode mavara ane huremu hwakafanana anozotorwa seakaenzana mavara.

Hurefu hwakazara hwechikamu mune yazvino faira revision ndeye 900 mitsetse. Ini ndakadhonza mienzaniso kubva munzvimbo dzinoverengeka kuratidza kupokana kwemazita uye akati wandei marudzi e syntax.

LC_COLLATE

collating-symbol <RES-1>
collating-symbol <BLK>
collating-symbol <MIN>
collating-symbol <WIDE>
...
collating-symbol <ARABIC>
collating-symbol <ETHPC>
collating-symbol <OSMANYA>
...
collating-symbol <S1D000>..<S1D35F>
collating-symbol <SFFFF> % Guaranteed largest symbol value. Keep at end of this list
...
collating-element <U0413_0301> from "<U0413><U0301>"
collating-element <U0413_0341> from "<U0413><U0341>"

  • kuunganidza-chiratidzo matanda tambo OSMANYA patafura yemazita ezvikero
  • kuunganidza-chiratidzo .. anonyora nhevedzano yemazita ane chivakashure S uye hexadecimal nhamba suffix kubva 1D000 up to 1D35F.
  • FFFF Π² kuunganidza-chiratidzo inotaridzika senge hombe isina kusaina nhamba muhexadecimal, asi rinongori zita rinogona kutaridzika
  • zita zvinoreva code point mukukodha UCS-4
  • collating-element kubva" " inonyoresa zita idzva rema Unicode madotsi.

Kana mazita ezviremu atsanangurwa, huremu chaihwo hunotsanangurwa. Sezvo chete hukama hukuru-pane-shoma hune basa mukuenzanisa, huremu hunotarwa nekutevedzana kwakapfava kwemazita ekunyora. Izvo zviyero "zvakareruka" zvakarongwa kutanga, zvino "zvinorema" izvo. Rega ndikuyeuchidze kuti hunhu hweUnicode hunopihwa huremu ina dzakasiyana. Pano ivo vakabatanidzwa kuva imwechete yakarongedzerwa kutevedzana. Mupfungwa, chero zita rekufananidzira rinogona kushandiswa pane chero mazinga mana, asi zvirevo zvinoratidza kuti vanogadzira pfungwa vanopatsanura mazita mumatanho.

% Symbolic weight assignments

% Third-level weight assignments
<RES-1>
<BLK>
<MIN>
<WIDE>
...
% Second-level weight assignments
<BASE>
<LOWLINE> % COMBINING LOW LINE
<PSILI> % COMBINING COMMA ABOVE
<DASIA> % COMBINING REVERSED COMMA ABOVE
...
% First-level weight assignments
<S0009> % HORIZONTAL TABULATION 
<S000A> % LINE FEED
<S000B> % VERTICAL TABULATION
...
<S0434> % CYRILLIC SMALL LETTER DE
<S0501> % CYRILLIC SMALL LETTER KOMI DE
<S0452> % CYRILLIC SMALL LETTER DJE
<S0503> % CYRILLIC SMALL LETTER KOMI DJE
<S0453> % CYRILLIC SMALL LETTER GJE
<S0499> % CYRILLIC SMALL LETTER ZE WITH DESCENDER
<S0435> % CYRILLIC SMALL LETTER IE
<S04D7> % CYRILLIC SMALL LETTER IE WITH BREVE
<S0454> % CYRILLIC SMALL LETTER UKRAINIAN IE
<S0436> % CYRILLIC SMALL LETTER ZHE

Pakupedzisira, iyo chaiyo uremu tafura.

Chikamu chehuremu chakavharirwa mumitsara yemazwi makuru order_start ΠΈ order_end. Zvimwe zvingasarudzwa order_start sarudza kuti ndeipi nzira mitsetse inoongororwa padanho rega rega rekuenzanisa. The default setting is Mberi. Mutumbi wechikamu une mitsetse ine kodhi yechiratidzo uye huremu hwayo ina. Iyo kodhi kodhi inogona kumiririrwa nehunhu pachayo, kodhi poindi, kana zita rekufananidzira rakatsanangurwa kare. Huremu hunogonawo kupihwa kumazita ekufananidzira, macode mapoinzi, kana iwo iwo zviratidzo. Kana kodhi mapoinzi kana mavara akashandiswa, huremu hwavo hwakafanana nehuwandu hwenhamba yekodhi point (chinzvimbo muiyo Unicode tafura). Mavara asina kutaurwa zvakajeka (sekunzwisisa kwandinoita) anoonekwa seakaiswa patafura ine huremu hwekutanga hunofanana nechinzvimbo mutafura yeUnicode. Huremu hwakakosha REGAI zvinoreva kuti chiratidzo chinofuratirwa pamwero wakakodzera wekuenzanisa.

Kuti ndiratidze chimiro chezviyero, ndakasarudza zvidimbu zvitatu zvakajeka:

  • mavara asina hanya zvachose
  • zviratidzo zvakaenzana nenhamba yetatu mumatanho maviri ekutanga
  • kutanga kweCyrillic alphabet, iyo isina diacritics, uye naizvozvo inorongwa zvakanyanya nekutanga uye yechitatu nhanho.

order_start forward;forward;forward;forward,position
<U0000> IGNORE;IGNORE;IGNORE;IGNORE % NULL (in 6429)
<U0001> IGNORE;IGNORE;IGNORE;IGNORE % START OF HEADING (in 6429)
<U0002> IGNORE;IGNORE;IGNORE;IGNORE % START OF TEXT (in 6429)
...
<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
<UFF13> <S0033>;<BASE>;<WIDE>;<UFF13> % FULLWIDTH DIGIT THREE
<U2476> <S0033>;<BASE>;<COMPAT>;<U2476> % PARENTHESIZED DIGIT THREE
<U248A> <S0033>;<BASE>;<COMPAT>;<U248A> % DIGIT THREE FULL STOP
<U1D7D1> <S0033>;<BASE>;<FONT>;<U1D7D1> % MATHEMATICAL BOLD DIGIT THREE
...
<U0430> <S0430>;<BASE>;<MIN>;<U0430> % CYRILLIC SMALL LETTER A
<U0410> <S0430>;<BASE>;<CAP>;<U0410> % CYRILLIC CAPITAL LETTER A
<U04D1> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
<U0430_0306> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
...
<U0431> <S0431>;<BASE>;<MIN>;<U0431> % CYRILLIC SMALL LETTER BE
<U0411> <S0431>;<BASE>;<CAP>;<U0411> % CYRILLIC CAPITAL LETTER BE
<U0432> <S0432>;<BASE>;<MIN>;<U0432> % CYRILLIC SMALL LETTER VE
<U0412> <S0432>;<BASE>;<CAP>;<U0412> % CYRILLIC CAPITAL LETTER VE
...
order_end

Iye zvino unogona kudzokera kukugadzirisa mienzaniso kubva pakutanga kwechinyorwa. Vavandiri vari munzvimbo iyi yetafura yekuyeresa:

<U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE
<U0021> IGNORE;IGNORE;IGNORE;<U0021> % EXCLAMATION MARK
<U0022> IGNORE;IGNORE;IGNORE;<U0022> % QUOTATION MARK
...

Zvinogona kuonekwa kuti patafura iyi zviratidzo zvepumikira kubva patafura ASCII (kusanganisira nzvimbo) inowanzoregeredzwa kana ichienzanisa tambo. Inongosiya mitsetse inoenderana mune zvese kunze kwezviratidzo zvepungisho zvinowanikwa munzvimbo dzinoenderana. Mitsetse kubva kumuenzaniso wangu (mushure mekugadzirisa) yekuenzanisa algorithm inotaridzika seizvi:

ΠΠ±Π°ΠΊΠ°Π½ΠΎΠ²ΠœΠΈΡ…Π°ΠΈΠ»ΠΌΠ°Π»ΡΡ€
ЁлкинаЭллакрановщица
Π˜Π²Π°Π½ΠΎΠ²Π°ΠΠ»Π»Π°ΠΌΠ°Π»ΡΡ€
Π˜Π²Π°Π½ΠΎΠ²ΠΠ½Π΄Ρ€Π΅ΠΉΡΠ»Π΅ΡΠ°Ρ€ΡŒ

Tichifunga kuti patafura yezvikero, mavara makuru muchiRussia anouya mushure memavara madiki (padanho rechitatu huremu kupfuura ), kurongedza kunoratidzika kunge kwakaringana.

Pakuisa shanduko LC_COLLATE=C tafura yakakosha inotakurwa inotsanangura kuenzanisa kwebyte-byte

static const uint32_t collseqwc[] =
{
  8, 1, 8, 0x0, 0xff,
  /* 1st-level table */
  6 * sizeof (uint32_t),
  /* 2nd-level table */
  7 * sizeof (uint32_t),
  /* 3rd-level table */
  L'x00', L'x01', L'x02', L'x03', L'x04', L'x05', L'x06', L'x07',
  L'x08', L'x09', L'x0a', L'x0b', L'x0c', L'x0d', L'x0e', L'x0f',

...
  L'xf8', L'xf9', L'xfa', L'xfb', L'xfc', L'xfd', L'xfe', L'xff'
};

Sezvo muUnicode poindi yekodhi Ё inouya pamberi peA, tambo dzakarongwa zvinoenderana.

Zvinyorwa uye mabhinari matafura

Zviripachena, kuenzanisa tambo ibasa rakajairika, uye kupatsanura tafura CTT nzira inodhura chaizvo. Kuti uwedzere kuwana tafura, inounganidzwa kuita binary fomu nemirairo localdef.

chikwata localdef inogamuchira semaparamita faira rine tafura yehunhu hwenyika (sarudzo -i), umo mavara ese anomiririrwa neiyo Unicode madotsi, uye faira retsamba pakati peiyo Unicode dots uye mavara eiyo encoding chaiyo (sarudzo. -f) Somugumisiro webasa racho, mafaira ebhinari anosikirwa nzvimbo ine zita rinotsanangurwa muparameter yekupedzisira.

glibc inotsigira maviri mabhinari faira mafomati: "chinyakare" uye "azvino".

Mamiriro echinyakare anoreva kuti zita renzvimbo izita re subdirectory mukati /usr/lib/locale/. Iyi subdirectory inochengeta mabhinari mafaera LC_COLLATE, LC_CTYPE, LC_TIME zvichingoenda zvakadaro. File LC_IDENTIFICATION ine zita repamutemo renzvimbo (rinogona kunge rakasiyana kubva kune zita redhairekitori) uye makomendi.

Chimiro chemazuva ano chinosanganisira kuchengetedza nzvimbo dzese mudura rimwe chete /usr/lib/locale/locale-archive, iyo inomepu kune chaiyo ndangariro yemaitiro ese ari kushandisa glibc. Iro zita renzvimbo mune yazvino fomati inoiswa kune imwe canonization - nhamba chete nemabhii akaderedzwa kuita madiki anoramba ari mumazita encoding. Saka ru_RU.KOI8-R, achaponeswa sa ru_RU.koi8r.

Mafaira ekupinda anotsvagwa mudhairekitori razvino, pamwe nemadhairekitori /usr/share/i18n/locales/ ΠΈ /usr/share/i18n/charmaps/ zvemafaira CTT uye encoding mafaera, zvichiteerana.

Somuenzaniso, murairo

localedef -i ru_RU -f MAC-CYRILLIC ru_RU.MAC-CYRILLIC

ichaunganidza faira /usr/share/i18n/locales/ru_RU uchishandisa encoding file /usr/share/i18n/charmaps/MAC-CYRILLIC.gz uye chengetedza mhedzisiro mukati /usr/lib/locale/locale-archive pasi pezita ru_RU.maccyrillic

Kana iwe ukaisa shanduko LANG = en_US.UTF-8 ipapo glibc ichatsvaga mabhinari emunharaunda mune inotevera kutevedzana kwemafaira nemadhairekitori:

/usr/lib/locale/locale-archive
/usr/lib/locale/en_US.UTF-8/
/usr/lib/locale/en_US/
/usr/lib/locale/enUTF-8/
/usr/lib/locale/en/

Kana nzvimbo ikaitika mune zvese zvechinyakare uye zvemazuva ano mafomati, saka zvakakosha zvinopihwa kune yemazuva ano.

Unogona kuona rondedzero yenzvimbo dzakaunganidzwa nemurairo nzvimbo -a.

Kugadzirira tafura yako yekuenzanisa

Iye zvino, wakashongedzerwa neruzivo, unogona kugadzira yako yakanakira tambo yekufananidza tafura. Tafura iyi inofanira kunyatsoenzanisa mavara echiRussia, kusanganisira tsamba Ё, uye panguva imwe chete funga nezvezviratidzo zvepumisheni maererano netafura. ASCII.

Maitiro ekugadzirira yako wega tafura yekuronga ine matanho maviri: kugadzirisa zviyero zvetafura uye kuigadzira kuita bhinari fomu nemurairo. localdef.

Kuti tafura yekuenzanisa igadziriswe nemari shoma yekugadzirisa, mufomati ISO 14652 Zvikamu zvekugadzirisa uremu hwetafura iripo zvinopihwa. Chikamu chinotanga neshoko rinokosha reorder-after uye zvichiratidza nzvimbo mushure mekunge kutsiva kunoitwa. Chikamu chinopera nemutsara reorder-end. Kana zvakakosha kugadzirisa zvikamu zvakawanda zvetafura, ipapo chikamu chinogadzirwa kune chimwe nechimwe chikamu chakadaro.

Ndakakopa shanduro itsva dzemafaira iso14651_t1_common ΠΈ ru_RU kubva ku repository glibc kudhairekitori rekumba kwangu ~/.local/share/i18n/locales/ uye ndakagadzirisa zvishoma chikamu LC_COLLATE Π² ru_RU. Shanduro itsva dzemafaira dzinonyatsoenderana neshanduro yangu glibc. Kana iwe uchida kushandisa ekare mavhezheni emafaira, iwe uchafanirwa kushandura mazita ekufananidzira uye nzvimbo iyo kutsiva kunotanga mutafura.

LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
reorder-after <U000D>
<U0020> <S0020>;<BASE>;<MIN>;<U0020> % SPACE
<U0021> <S0021>;<BASE>;<MIN>;<U0021> % EXCLAMATION MARK
<U0022> <S0022>;<BASE>;<MIN>;<U0022> % QUOTATION MARK
...
<U007D> <S007D>;<BASE>;<MIN>;<U007D> % RIGHT CURLY BRACKET
<U007E> <S007E>;<BASE>;<MIN>;<U007E> % TILDE
reorder-end
END LC_COLLATE

Muchokwadi, zvingave zvakafanira kuchinja minda mukati LC_IDENTIFICATION zvekuti vanongedza kunzvimbo ru_MY, asi mumuenzaniso wangu izvi zvaisadiwa, sezvo ini ndakabvisa dura kubva mukutsvaga kwenzvimbo locale-archive.

kuti localdef yakashanda nemafaira mufolda yangu kuburikidza nekusiyana I18NPATH Iwe unogona kuwedzera imwe dhairekitori yekutsvaga mafaera ekuisa, uye dhairekitori rekuchengetedza mabhinari mafaera anogona kutsanangurwa senzira ine slashes:

$> I18NPATH=~/.local/share/i18n localedef -i ru_RU -f UTF-8 ~/.local/lib/locale/ru_MY.UTF-8

POSIX inoratidza kuti mu JUST unogona kunyora nzira dzakakwana kumadhairekitori ane mafaera enzvimbo, kutanga nemberi slash, asi glibc Π² Linux nzira dzese dzinoverengerwa kubva kune base dhairekitori, iyo inogona kudhindwa kuburikidza nekusiyana LOCPATH. Mushure mekuisa LOCPATH=~/.local/lib/locale/ mafaera ese ane chekuita nekugadziriswa kwenzvimbo anozosechwa mufolda yangu chete. Archive yenzvimbo dzine vhezheni seti LOCPATH kufuratirwa.

Heino bvunzo yakasimba:

$> LANG=ru_MY.UTF-8 LOCPATH=~/.local/lib/locale/ sort buhg.txt
Абаканов ΠœΠΈΡ…Π°ΠΈΠ»;маляр
Ёлкина Π­Π»Π»Π°;ΠΊΡ€Π°Π½ΠΎΠ²Ρ‰ΠΈΡ†Π°
Иванов АндрСй;ΡΠ»Π΅ΡΠ°Ρ€ΡŒ
Иванова Алла;Π°Π΄Π²ΠΎΠΊΠ°Ρ‚

Hooray! Tazviita!

Shandisai mishonga

Ini ndatopindura mibvunzo pamusoro pekugadzirisa tambo yakabvunzwa pakutanga, asi pachine mibvunzo mishoma pamusoro pezvikanganiso - zvinoonekwa uye zvisingaonekwe.

Ngatidzokere kudambudziko rekutanga.

Uye purogiramu kworudzii uye purogiramu Join shandisa tambo imwechete yekuenzanisa mabasa kubva glibc. Zvakaitika sei izvozvo Join akapa kukanganisa kwekuronga pamitsara yakarongwa nemurairo kworudzii munharaunda en_US.UTF-8? Mhinduro iri nyore: kworudzii inoenzanisa tambo yose, uye Join inoenzanisa chete kiyi, iyo nekusarudzika ndiko kutanga kwetambo kusvika kune yekutanga whitespace character. Mumuenzaniso wangu, izvi zvakakonzera meseji yemhosho nekuti marongerwo emazwi ekutanga mumitsetse haana kuenderana nekurongwa kwemitsetse yakazara.

Locale "C" inovimbisa kuti mumatambo akarongwa ma substrings ekutanga kusvika panzvimbo yekutanga anozogadziriswa, asi izvi zvinongovhara kukanganisa. Zvinogoneka kusarudza data (vanhu vane surnames, asi akasiyana ekutanga mazita) ayo, pasina meseji yekukanganisa, achapa faira risiri iro rekubatanidza mhedzisiro. Kana tichida Join akabatanidza mitsara yefaira nezita rakazara, zvino nzira yakarurama ingave yekutsanangura zvakajeka muparadzi wemunda uye kuronga nekiyi munda, uye kwete nemutsara wose. Muchiitiko ichi, kusanganisa kuchaenderera nemazvo uye hakuzove nekukanganisa mune chero nzvimbo:

$> sort -t ; -k 1 buhg.txt > buhg.srt
$> sort -t ; -k 1 mail.txt > mail.srt
$> join -t ; buhg.srt mail.srt > result

Muenzaniso wakabudirira mukukodha CP1251 ine chimwe chikanganiso. Chokwadi ndechekuti mukugovera kwese kunozivikanwa kwandiri Linux mapakeji arikushaikwa akaunganidzwa enzvimbo ru_RU.CP1251. Kana iyo yakaunganidzwa nzvimbo isingawanikwe, saka kworudzii chinyararire anoshandisa byte-by-byte kuenzanisa, izvo zvatakaona.

Nenzira, pane imwe glitch diki ine chekuita nekusasvikika kwenzvimbo dzakaunganidzwa. Team LOCPATH=/tmp nzvimbo -a ichapa runyoro rwenzvimbo dzese mukati locale-archive, asi neinosiyana set LOCPATH yezvirongwa zvese (kusanganisira zvakanyanya vemunzvimbo) nzvimbo idzi hadzizovepo.

$> LOCPATH=/tmp locale -a | grep en_US
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
en_US
en_US.iso88591
en_US.iso885915
en_US.utf8

$> LC_COLLATE=en_US.UTF-8 sort --debug
sort: using β€˜en_US.UTF-8’ sorting rules

$> LOCPATH=/tmp LC_COLLATE=en_US.UTF-8 sort --debug
sort: using simple byte comparison

mhedziso

Kana iwe uri programmer akajaira kufunga kuti tambo seti yemabheti, saka sarudzo yako LC_COLLATE=C.

Kana iwe uri nyanzvi yemitauro kana duramazwi, saka zvirinani uunganidze munharaunda yako.

Kana iwe uri mushandisi ari nyore, saka iwe unongoda kujaira kuti iwo murairo ls -a anoburitsa mafaera achitanga nedoti rakasanganiswa nemafaira anotanga nebhii, uye Pakati peusiku mutungamiri, iyo inoshandisa mabasa ayo emukati kuronga mazita, inoisa mafaera achitanga nekadoti panotangira rondedzero.

nezvakanyorwa

Report No. 10 Unicode collation algorithm

Huremu hwehunhu paunicode.org

ICU -Kuitwa kweraibhurari yekushanda neUnicode kubva kuIBM.

Kuronga bvunzo uchishandisa ICU

Huremu hwehunhu mukati ISO 14651

Tsanangudzo yefaira yefaira ine zviyero ISO 14652

Hurukuro yetambo yekuenzanisa mu glibc

Source: www.habr.com

Voeg