Nhanganyaya
Izvo zvese zvakatanga nepfupi script yaifanirwa kubatanidza ruzivo rwekero E-mail vashandi vakawana kubva pane rondedzero yevashandisi vetsamba yetsamba, vane zvinzvimbo zvevashandi zvakawanikwa kubva kuHR department department. Mazita ese ari maviri akatumirwa kune Unicode mameseji mafaera UTF-8 uye yakachengetedzwa neUnix mutsara magumo.
Zvemukati mail.txt
ΠΠ²Π°Π½ΠΎΠ² ΠΠ½Π΄ΡΠ΅ΠΉ;[email protected]
Zvemukati buhg.txt
ΠΠ²Π°Π½ΠΎΠ²Π° ΠΠ»Π»Π°;ΠΌΠ°Π»ΡΡ
ΠΠ»ΠΊΠΈΠ½Π° ΠΠ»Π»Π°;ΠΊΡΠ°Π½ΠΎΠ²ΡΠΈΡΠ°
ΠΠ²Π°Π½ΠΎΠ² ΠΠ½Π΄ΡΠ΅ΠΉ;ΡΠ»Π΅ΡΠ°ΡΡ
ΠΠ±Π°ΠΊΠ°Π½ΠΎΠ² ΠΠΈΡ
Π°ΠΈΠ»;ΠΌΠ°Π»ΡΡ
Kubatanidza, mafaera akarongwa neiyo Unix command kworudzii uye yakatumirwa kune yekuisa iyo Unix chirongwa Join, izvo zvakakundikana zvisingatarisirwi nechikanganiso:
$> sort buhg.txt > buhg.srt
$> sort mail.txt > mail.srt
$> join buhg.srt mail.srt > result
join: buhg.srt:4: is not sorted: ΠΠ²Π°Π½ΠΎΠ² ΠΠ½Π΄ΡΠ΅ΠΉ;ΡΠ»Π΅ΡΠ°ΡΡ
Kuona mhedzisiro yekuronga nemeso ako kwakaratidza kuti, kazhinji, kuronga kwacho kwakaringana, asi muchiitiko chekusangana kwemazita echirume neechikadzi, vakadzi vanouya pamberi pevarume:
$> sort buhg.txt
ΠΠ±Π°ΠΊΠ°Π½ΠΎΠ² ΠΠΈΡ
Π°ΠΈΠ»;ΠΌΠ°Π»ΡΡ
ΠΠ»ΠΊΠΈΠ½Π° ΠΠ»Π»Π°;ΠΊΡΠ°Π½ΠΎΠ²ΡΠΈΡΠ°
ΠΠ²Π°Π½ΠΎΠ²Π° ΠΠ»Π»Π°;ΠΌΠ°Π»ΡΡ
ΠΠ²Π°Π½ΠΎΠ² ΠΠ½Π΄ΡΠ΅ΠΉ;ΡΠ»Π΅ΡΠ°ΡΡ
Inotaridzika senge glitch yekuronga muUnicode kana senge kuratidzwa kwechikadzi mune yekuronga algorithm. Yokutanga, hongu, inonzwisisika.
Ngatizviise kure izvozvi Join uye kuisa pfungwa pa kworudzii. Ngatiedzei kugadzirisa dambudziko tichishandisa sainzi poking. Kutanga, ngatichinjei nzvimbo kubva muU.S pamusoro ru_RU. Kurongedza, zvingave zvakakwana kuseta nharaunda inoshanduka LC_COLLATE, asi isu hatizotambisi nguva pazvinhu zvisingakoshi:
$> LANG=ru_RU.UTF-8 sort buhg.txt
ΠΠ±Π°ΠΊΠ°Π½ΠΎΠ² ΠΠΈΡ
Π°ΠΈΠ»;ΠΌΠ°Π»ΡΡ
ΠΠ»ΠΊΠΈΠ½Π° ΠΠ»Π»Π°;ΠΊΡΠ°Π½ΠΎΠ²ΡΠΈΡΠ°
ΠΠ²Π°Π½ΠΎΠ²Π° ΠΠ»Π»Π°;ΠΌΠ°Π»ΡΡ
ΠΠ²Π°Π½ΠΎΠ² ΠΠ½Π΄ΡΠ΅ΠΉ;ΡΠ»Π΅ΡΠ°ΡΡ
Hapana chakachinja.
Ngatiedze kudzoreredza mafaera kuita imwechete-byte encoding:
$> iconv -f UTF-8 -t KOI8-R buhg.txt
| LANG=ru_RU.KOI8-R sort
| iconv -f KOI8-R -t UTF8
Zvakare hapana chakachinja.
Hapana chaunokwanisa kuita, uchafanirwa kutsvaga mhinduro paInternet. Hapana chakanangana nemazita echiRussia, asi pane mibvunzo nezve mamwe maitiro ekusanzwisisika. Somuenzaniso, heino dambudziko:
Mhinduro yakajairika kwese kwese: shandisa iyo programmer locale "C" uye uchafara. Ngatiedze:
$> LANG=C sort buhg.txt
ΠΠ»ΠΊΠΈΠ½Π° ΠΠ»Π»Π°;ΠΊΡΠ°Π½ΠΎΠ²ΡΠΈΡΠ°
ΠΠ±Π°ΠΊΠ°Π½ΠΎΠ² ΠΠΈΡ
Π°ΠΈΠ»;ΠΌΠ°Π»ΡΡ
ΠΠ²Π°Π½ΠΎΠ² ΠΠ½Π΄ΡΠ΅ΠΉ;ΡΠ»Π΅ΡΠ°ΡΡ
ΠΠ²Π°Π½ΠΎΠ²Π° ΠΠ»Π»Π°;Π°Π΄Π²ΠΎΠΊΠ°Ρ
Pane zvachinja. VaIvanovs vakarongedza nenzira kwayo, kunyange zvazvo Yolkina akatsvedza pane imwe nzvimbo. Ngatidzokere kudambudziko rekutanga:
$> LANG=C sort buhg.txt > buhg.srt
$> LANG=C sort mail.txt > mail.srt
$> LANG=C join buhg.srt mail.srt > result
Yakashanda pasina zvikanganiso, sekuvimbisa kweInternet. Uye izvi pasinei neYolkina mumutsara wekutanga.
Dambudziko rinoratidzika kunge rakagadziriswa, asi kana zvikaitika, ngatiedze imwe yeRussia encoding - Windows CP1251:
$> iconv -f UTF-8 -t CP1251 buhg.txt
| LANG=ru_RU.CP1251 sort
| iconv -f CP1251 -t UTF8
Mhedzisiro yekuronga, zvisingaite, inopindirana nenzvimbo "C", uye muenzaniso wose, maererano, unomhanya pasina kukanganisa. Imwe mhando yekusaziva.
Ini handifarire zvisinganzwisisike mukuronga nekuti zvinowanzovhara zvikanganiso. Tichafanira kunyatsotarisa kuti zvinoshanda sei. kworudzii uye zvinokanganisa chii? LC_COLLATE .
Pakupedzisira ini ndichaedza kupindura mibvunzo:
- sei surname dzechikadzi dzakarongwa zvisirizvo?
- nei LANG=ru_RU.CP1251 zvakava zvakaenzana LANG=C
- sei uchidaro kworudzii ΠΈ Join pfungwa dzakasiyana pamusoro pekurongeka kwetambo dzakarongwa
- sei paine kukanganisa mumienzaniso yangu yese?
- pakupedzisira sei kuronga tambo sezvaunoda
Kuronga mu Unicode
Yekutanga kumira ichava technical report No. 10 ine kodzero
collation - "kuenzanisa" tambo ndiyo hwaro hwechero kurongedza algorithm. Iwo maalgorithms pachawo anogona kusiyana ("bubble", "merge", "fast"), asi vese vachashandisa kuenzanisa kwetambo mbiri kuti vaone marongero avanoita.
Kuronga tambo mumutauro wechisikigo idambudziko rakaoma. Kunyangwe mune yakapfava imwe-byte encodings, kurongeka kwemabhii muarufabheti, kunyangwe neimwe nzira yakasiyana nearufabheti yechirungu yechiLatin, haichaenderane nekurongeka kwenhamba dzenhamba dzinoiswa mavara aya. Saka muarufabheti yechiGerman tsamba Γ inomira pakati Π ΠΈ P, uye mune encoding CP850 anopinda pakati ΓΏ ΠΈ Γ.
Iwe unogona kuedza kubvisa kubva kune yakatarwa encoding uye funga "akanaka" mavara akarongwa mune imwe kurongeka, sezvinoitwa muUnicode. Encodings UTF8, UTF16 kana imwe-byte KOI8-R (kana chidimbu chidiki che Unicode chichidikanwa) chinopa akasiyana manhamba anomiririra mavara, asi tarisa kune iwo akafanana zvinhu zveiyo base tafura.
Zvinoitika kuti kunyangwe tikavaka tafura yechiratidzo kubva pakatanga, isu hatizokwanisi kugovera chiratidzo chepasirese kurongeka kwairi. Mumaarufabheti enyika akasiyana anoshandisa mavara mamwe chete, marongerwo emabhii aya anogona kusiyana. Somuenzaniso, muchiFrench Γ ichaonekwa setambo uye yakarongedzwa setambo AE. MuNorwegian Γ ichava tsamba yakaparadzana, iyo inowanikwa mushure Z. Nenzira, kuwedzera kune ligatures se Γ Kune mavara akanyorwa ane zviratidzo zvakawanda. Saka muCzech alphabet pane tsamba Ch, iyo inomira pakati H ΠΈ I.
Pamusoro pekusiyana kwemaarufabheti, kune dzimwe tsika dzemunyika dzinopesvedzera kusarura. Kunyanya, mubvunzo unomuka: mune kurongeka kwemazwi ane mavara makuru uye madiki muduramazwi? Kuronga kunogonawo kukanganiswa nekushandiswa kwezviratidzo. MuchiSpanish, chiratidzo chemubvunzo chakapindurwa chinoshandiswa pakutanga kwemutsara wekubvunzurudza (Unoda mimhanzi here?) Panyaya iyi, zviri pachena kuti mitsara yekubvunzurudza haifanirwe kuunganidzwa kuita sumbu rakaparadzana kunze kwearufabheti, asi magadzirirwo emitsara nemamwe mavanga epumikira?
Ini handisi kuzogara pakusarudza tambo mumitauro yakasiyana kwazvo neyeEurope. Ziva kuti mumitauro ine nzira yekunyora kubva kurudyi kuenda kuruboshwe kana kubva kumusoro kuenda pasi, mavara ari mumitsara anonyanya kuchengetwa muhurongwa hwekuverenga, uye kunyangwe masisitimu ekunyora asiri earufabheti ane nzira dzawo dzekuronga mitsara nemavara. . Semuenzaniso, hieroglyphs inogona kurongeka nemaitiro (
Zvichienderana nezvakanyorwa pamusoro, izvo zvakakosha zvekufananidza tambo zvichienderana neiyo Unicode matafura akagadzirwa:
- kuenzanisa kwetambo hakubvi pane nzvimbo yevatambi mutafura yekodhi;
- kutevedzana kwemavara kuumba chimiro chimwe chete kunoderedzwa kuita canonical form (A + denderedzwa repamusoro rakafanana ne Γ );
- Kana uchienzanisa tambo, hunhu hunotariswa mumamiriro etambo uye, kana zvichidikanwa, yakasanganiswa nevavakidzani vayo kuita imwe unit yekuenzanisa (Ch muCzech) kana yakakamurwa kuita akati wandei (Γ muchiFrench);
- ese maficha enyika (arufabheti, mavara makuru/maduku, manyorerwo, marongero emhando dzekunyora) anofanira kugadzirwa kusvika kune bhuku remanyorerwo ehurongwa (emoji);
- kuenzanisa kwakakosha kwete pakuronga chete, asiwo mune dzimwe nzvimbo dzakawanda, semuenzaniso pakudoma mitsara yemitsara (kutsiva {A... z} mu. Bash);
- kuenzanisa kunofanira kuitwa zvakaringana nekukurumidza.
Pamusoro pezvo, vanyori vemushumo vakagadzira zvivakwa zvekufananidza izvo vanogadzira algorithm havafanirwe kuvimba nazvo:
- iyo algorithm yekuenzanisa haifanire kudiwa seti yakaparadzana yemavara emutauro wega wega (mitauro yeRussia neUkraine inogovana akawanda maCyrillic mavara);
- kuenzanisa hakufanirwe kuvimba nekurongeka kwemavara mumatafura eUnicode;
- uremu hwetambo haifanire kuve hunhu hwetambo, sezvo tambo imwe chete mumagariro akasiyana-siyana inogona kuva nezviremu zvakasiyana;
- uremu hwemitsara hunogona kuchinja kana uchibatanidza kana kupatsanura (kubva x < y hazviteveri izvozvo xz < yz);
- tambo dzakasiyana dzine huremu hwakafanana dzinoonekwa dzakaenzana kubva pakuona kweiyo algorithm yekuronga. Kuunza kumwe kurongeka kwetambo dzakadaro kunogoneka, asi kunogona kuderedza kushanda;
- Pakuronga kunodzokororwa, mitsetse ine huremu hwakafanana inogona kuchinjaniswa. Kusimba inzvimbo yealgorithm yekuronga chaiyo, uye kwete pfuma yetambo yekuenzanisa algorithm (ona ndima yapfuura);
- Mitemo yekuronga inogona kushanduka nekufamba kwenguva sezvo tsika dzechivanhu dzinokwenenzvera/kuchinja.
Izvo zvakare zvinotemerwa kuti kuenzanisa algorithm hakuna chainoziva nezve semantics yetambo dziri kugadziriswa. Saka, tambo dzinosanganisira manhamba chete hadzifanirwe kuenzaniswa senhamba, uye mumazita emazita echiRungu chinyorwa (Beatles, The).
Kuti ugutse zvese zvakatsanangurwa zvinodiwa, tafura yekuronga tafura yealgorithm yakawanda (chaiyo mana-level) inorongwa.
Kare, mavara ari mutambo akaderedzwa kuita canonical fomu uye akaiswa muzvikamu zvekuenzanisa. Chikwata chega chega chekuenzanisa chinopihwa huremu hwakawanda hunoenderana nemazinga akati wandei ekuenzanisa. Huremu hwemayuniti ekuenzanisa zvinhu zveakarongedzerwa seti (munyaya iyi, integers) zvinogona kufananidzwa kune zvakawanda kana zvishoma. Chirevo chakakosha KUFUNGWA (0x0) zvinoreva kuti pachiyero chekuenzanisa chikamu ichi hachina kubatanidzwa mukuenzanisa. Kuenzanisa kwetambo kunogona kudzokororwa kakawanda, uchishandisa zviyero zvemazinga anoenderana. Panhanho imwe neimwe, huremu hwemayuniti ekuenzanisa emitsara miviri inoteedzana yakaenzaniswa neimwe.
Mukuita kwakasiyana kweiyo algorithm yetsika dzakasiyana dzenyika, kukosha kweiyo coefficients inogona kusiyana, asi iyo Unicode chiyero inosanganisira tafura yezviyero - "Default Unicode Collation Element Tafura" (DUCET) Ndinoda kucherechedza kuti kuseta shanduko LC_COLLATE chaizvoizvo chiratidzo chekusarudzwa kwetafura yehuremu mubasa rekuenzanisa tambo.
Weighting coefficients DUCET zvakarongwa sezvinotevera:
- padanho rekutanga, mavara ese akaderedzwa kusvika kune imwechete kesi, madhizaini anoraswa, zviratidzo zvepumisikidzo (kwete ese) zvinoregeredzwa;
- pachikamu chechipiri, mazita emutauro chete anotorwa;
- padanho rechitatu, nyaya chete inotariswa;
- padanho rechina, zvibodzwa zvepumisikidzo chete zvinotariswa.
Kuenzanisa kunoitika muzvikamu zvakawanda: kutanga, coefficients yezinga rekutanga inofananidzwa; kana uremu huchienderana, ipapo kuenzanisa kunodzokororwa nechiyero chechipiri chechiyero chinoitwa; zvino zvichida chechitatu nechechina.
Kuenzanisa kunopera kana mitsara iine zvikamu zvinofananidzwa zvekuenzanisa nezviremu zvakasiyana. Mitsara ine huremu hwakaenzana pamatanho ese mana inoonekwa yakaenzana kune imwe neimwe.
Iyi algorithm (ine boka rekuwedzera ruzivo rwehunyanzvi) yakapa zita rekutaura Nha. 10 - "Unicode Collation Algorithm" (ACU).
Apa ndipo apo maitiro ekugadzirisa kubva kumuenzaniso wedu anove akajeka zvishoma. Zvingave zvakanaka kuzvienzanisa neiyo Unicode standard.
Kuedza kuita ACU kune yakakosha
Kutarisa nemaoko kuti mitsetse yakarongwa nemazvo maererano DUCET zvingave zvinonetesa, asi, nerombo rakanaka kwatiri, kune muenzaniso wekuita raibhurari yekushanda neUnicode - "
Pawebhusaiti yeraibhurari iyi, yakagadzirwa mu IBM, kune mapeji edemo, kusanganisira
ΠΠ±Π°ΠΊΠ°Π½ΠΎΠ² ΠΠΈΡ
Π°ΠΈΠ»;ΠΌΠ°Π»ΡΡ
ΠΠ»ΠΊΠΈΠ½Π° ΠΠ»Π»Π°;ΠΊΡΠ°Π½ΠΎΠ²ΡΠΈΡΠ°
ΠΠ²Π°Π½ΠΎΠ² ΠΠ½Π΄ΡΠ΅ΠΉ;ΡΠ»Π΅ΡΠ°ΡΡ
ΠΠ²Π°Π½ΠΎΠ²Π° ΠΠ»Π»Π°;Π°Π΄Π²ΠΎΠΊΠ°Ρ
Nenzira, pane webhusaiti ICU Iwe unogona kuwana kujekeswa kwekuenzanisa algorithm paunenge uchigadzira zviratidzo zvepunctuation. Mumienzaniso
Unicode yakatibatsira, asi tsvaga zvikonzero zvemaitiro anoshamisa kworudzii Π² Linux uchafanira kuenda kumwewo.
Kuronga mu glibc
Kukurumidza kuona kweutility source codes kworudzii kubva GNU Core Utils yakaratidza kuti mune iyo yekushandisa pachayo, localization inosvika pakudhinda kukosha kwazvino kweiyo chinja LC_COLLATE paunenge uchimhanya mudebug mode:
$ sort --debug buhg.txt > buhg.srt
sort: using βen_US.UTF8β sorting rules
Kuenzanisa kwetambo kunoitwa uchishandisa yakajairwa basa strcoll, zvinoreva kuti zvese zvinonakidza zviri muraibhurari glibc.
pamusoro wiki Vasai chirongwa glibc yakatsaurirwa kune tambo kuenzanisa
Mashoko anonyanya kufadza pa wiki Vasai paiva ne link ku
Iye zvino zvatava neruzivo rwese nezve algorithm uye matafura ekubatsira, tinogona kudzokera kune dambudziko rekutanga uye tinzwisise maitiro ekugadzirisa tambo mumutauro weRussia.
ISO 14651 / 14652
Kunobva kodhi yetafura yatiri kufarira CTT pakugovera kwakawanda Linux iri mubhuku /usr/share/i18n/locales/. Tafura pachayo iri mufaira iso14651_t1_common. Ipapo iyi ndiyo dhiraivha yefaira kopi iso14651_t1_common inosanganisirwa mufaira iso14651_t1, iyo, zvakare, inosanganiswa mumafaira enyika, kusanganisira muU.S ΠΈ ru_RU. Pakugovera kwakawanda Linux ese mafaera sosi anosanganisirwa mune yekutanga kuisirwa, asi kana isipo, iwe uchafanirwa kuisa imwe pasuru kubva pakugovera.
Chimiro chefaira iso14651_t1 inogona kutaridzika sezwi rinotyisa, nemirairo isiri pachena yekuvaka mazita, asi kana iwe ukazvitarisa, zvese zviri nyore. Chimiro chinotsanangurwa muchiyero ISO 14652, kopi inogona kutorwa kubva pawebhusaiti
Iyo faira chimiro chinotaridzika seizvi:
Nekumisikidza, hunhu hunoshandiswa sehunhu hwekupukunyuka, uye kupera kwemutsara mushure meiyo # hunhu ndeyekutaura. Zvose zviratidzo zvinogona kutsanangurwa patsva, zvinova izvo zvinoitwa mushanduro itsva yetafura:
escape_char /
comment_char %
Iyo faira ichange iine tokeni mufomati kana (kupi x - hexadecimal digit). Iyi ndiyo hexadecimal inomiririra yeUnicode kodhi mapoinzi mune encoding UCS-4 (UTF-32) Zvese zvimwe zvinhu mumabhuraketi emakona (kusanganisira , <2> nezvimwe zvakadaro) zvinotorwa sematambo akareruka asina zvaanoreva kunze kwechirevo.
Mutsara LC_COLLATE inotitaurira kuti inotevera inotanga iyo data inotsanangura kuenzanisa kwetambo.
Chekutanga, mazita anotaridzwa zviremu zviri patafura yekuenzanisa nemazita emusanganiswa wezviratidzo. Kazhinji kutaura, iwo marudzi maviri emazita ndeezvikamu zviviri zvakasiyana, asi mufaira chairo anosanganiswa. Mazita ehuremu anotsanangurwa nezwi guru kuunganidza-chiratidzo (kufananidza hunhu) nekuti kana uchienzanisa, Unicode mavara ane huremu hwakafanana anozotorwa seakaenzana mavara.
Hurefu hwakazara hwechikamu mune yazvino faira revision ndeye 900 mitsetse. Ini ndakadhonza mienzaniso kubva munzvimbo dzinoverengeka kuratidza kupokana kwemazita uye akati wandei marudzi e syntax.
LC_COLLATE
collating-symbol <RES-1>
collating-symbol <BLK>
collating-symbol <MIN>
collating-symbol <WIDE>
...
collating-symbol <ARABIC>
collating-symbol <ETHPC>
collating-symbol <OSMANYA>
...
collating-symbol <S1D000>..<S1D35F>
collating-symbol <SFFFF> % Guaranteed largest symbol value. Keep at end of this list
...
collating-element <U0413_0301> from "<U0413><U0301>"
collating-element <U0413_0341> from "<U0413><U0341>"
- kuunganidza-chiratidzo matanda tambo OSMANYA patafura yemazita ezvikero
- kuunganidza-chiratidzo .. anonyora nhevedzano yemazita ane chivakashure S uye hexadecimal nhamba suffix kubva 1D000 up to 1D35F.
- FFFF Π² kuunganidza-chiratidzo inotaridzika senge hombe isina kusaina nhamba muhexadecimal, asi rinongori zita rinogona kutaridzika
- zita zvinoreva code point mukukodha UCS-4
- collating-element kubva" " inonyoresa zita idzva rema Unicode madotsi.
Kana mazita ezviremu atsanangurwa, huremu chaihwo hunotsanangurwa. Sezvo chete hukama hukuru-pane-shoma hune basa mukuenzanisa, huremu hunotarwa nekutevedzana kwakapfava kwemazita ekunyora. Izvo zviyero "zvakareruka" zvakarongwa kutanga, zvino "zvinorema" izvo. Rega ndikuyeuchidze kuti hunhu hweUnicode hunopihwa huremu ina dzakasiyana. Pano ivo vakabatanidzwa kuva imwechete yakarongedzerwa kutevedzana. Mupfungwa, chero zita rekufananidzira rinogona kushandiswa pane chero mazinga mana, asi zvirevo zvinoratidza kuti vanogadzira pfungwa vanopatsanura mazita mumatanho.
% Symbolic weight assignments
% Third-level weight assignments
<RES-1>
<BLK>
<MIN>
<WIDE>
...
% Second-level weight assignments
<BASE>
<LOWLINE> % COMBINING LOW LINE
<PSILI> % COMBINING COMMA ABOVE
<DASIA> % COMBINING REVERSED COMMA ABOVE
...
% First-level weight assignments
<S0009> % HORIZONTAL TABULATION
<S000A> % LINE FEED
<S000B> % VERTICAL TABULATION
...
<S0434> % CYRILLIC SMALL LETTER DE
<S0501> % CYRILLIC SMALL LETTER KOMI DE
<S0452> % CYRILLIC SMALL LETTER DJE
<S0503> % CYRILLIC SMALL LETTER KOMI DJE
<S0453> % CYRILLIC SMALL LETTER GJE
<S0499> % CYRILLIC SMALL LETTER ZE WITH DESCENDER
<S0435> % CYRILLIC SMALL LETTER IE
<S04D7> % CYRILLIC SMALL LETTER IE WITH BREVE
<S0454> % CYRILLIC SMALL LETTER UKRAINIAN IE
<S0436> % CYRILLIC SMALL LETTER ZHE
Pakupedzisira, iyo chaiyo uremu tafura.
Chikamu chehuremu chakavharirwa mumitsara yemazwi makuru order_start ΠΈ order_end. Zvimwe zvingasarudzwa order_start sarudza kuti ndeipi nzira mitsetse inoongororwa padanho rega rega rekuenzanisa. The default setting is Mberi. Mutumbi wechikamu une mitsetse ine kodhi yechiratidzo uye huremu hwayo ina. Iyo kodhi kodhi inogona kumiririrwa nehunhu pachayo, kodhi poindi, kana zita rekufananidzira rakatsanangurwa kare. Huremu hunogonawo kupihwa kumazita ekufananidzira, macode mapoinzi, kana iwo iwo zviratidzo. Kana kodhi mapoinzi kana mavara akashandiswa, huremu hwavo hwakafanana nehuwandu hwenhamba yekodhi point (chinzvimbo muiyo Unicode tafura). Mavara asina kutaurwa zvakajeka (sekunzwisisa kwandinoita) anoonekwa seakaiswa patafura ine huremu hwekutanga hunofanana nechinzvimbo mutafura yeUnicode. Huremu hwakakosha REGAI zvinoreva kuti chiratidzo chinofuratirwa pamwero wakakodzera wekuenzanisa.
Kuti ndiratidze chimiro chezviyero, ndakasarudza zvidimbu zvitatu zvakajeka:
- mavara asina hanya zvachose
- zviratidzo zvakaenzana nenhamba yetatu mumatanho maviri ekutanga
- kutanga kweCyrillic alphabet, iyo isina diacritics, uye naizvozvo inorongwa zvakanyanya nekutanga uye yechitatu nhanho.
order_start forward;forward;forward;forward,position
<U0000> IGNORE;IGNORE;IGNORE;IGNORE % NULL (in 6429)
<U0001> IGNORE;IGNORE;IGNORE;IGNORE % START OF HEADING (in 6429)
<U0002> IGNORE;IGNORE;IGNORE;IGNORE % START OF TEXT (in 6429)
...
<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
<UFF13> <S0033>;<BASE>;<WIDE>;<UFF13> % FULLWIDTH DIGIT THREE
<U2476> <S0033>;<BASE>;<COMPAT>;<U2476> % PARENTHESIZED DIGIT THREE
<U248A> <S0033>;<BASE>;<COMPAT>;<U248A> % DIGIT THREE FULL STOP
<U1D7D1> <S0033>;<BASE>;<FONT>;<U1D7D1> % MATHEMATICAL BOLD DIGIT THREE
...
<U0430> <S0430>;<BASE>;<MIN>;<U0430> % CYRILLIC SMALL LETTER A
<U0410> <S0430>;<BASE>;<CAP>;<U0410> % CYRILLIC CAPITAL LETTER A
<U04D1> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
<U0430_0306> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
...
<U0431> <S0431>;<BASE>;<MIN>;<U0431> % CYRILLIC SMALL LETTER BE
<U0411> <S0431>;<BASE>;<CAP>;<U0411> % CYRILLIC CAPITAL LETTER BE
<U0432> <S0432>;<BASE>;<MIN>;<U0432> % CYRILLIC SMALL LETTER VE
<U0412> <S0432>;<BASE>;<CAP>;<U0412> % CYRILLIC CAPITAL LETTER VE
...
order_end
Iye zvino unogona kudzokera kukugadzirisa mienzaniso kubva pakutanga kwechinyorwa. Vavandiri vari munzvimbo iyi yetafura yekuyeresa:
<U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE
<U0021> IGNORE;IGNORE;IGNORE;<U0021> % EXCLAMATION MARK
<U0022> IGNORE;IGNORE;IGNORE;<U0022> % QUOTATION MARK
...
Zvinogona kuonekwa kuti patafura iyi zviratidzo zvepumikira kubva patafura ASCII (kusanganisira nzvimbo) inowanzoregeredzwa kana ichienzanisa tambo. Inongosiya mitsetse inoenderana mune zvese kunze kwezviratidzo zvepungisho zvinowanikwa munzvimbo dzinoenderana. Mitsetse kubva kumuenzaniso wangu (mushure mekugadzirisa) yekuenzanisa algorithm inotaridzika seizvi:
ΠΠ±Π°ΠΊΠ°Π½ΠΎΠ²ΠΠΈΡ
Π°ΠΈΠ»ΠΌΠ°Π»ΡΡ
ΠΠ»ΠΊΠΈΠ½Π°ΠΠ»Π»Π°ΠΊΡΠ°Π½ΠΎΠ²ΡΠΈΡΠ°
ΠΠ²Π°Π½ΠΎΠ²Π°ΠΠ»Π»Π°ΠΌΠ°Π»ΡΡ
ΠΠ²Π°Π½ΠΎΠ²ΠΠ½Π΄ΡΠ΅ΠΉΡΠ»Π΅ΡΠ°ΡΡ
Tichifunga kuti patafura yezvikero, mavara makuru muchiRussia anouya mushure memavara madiki (padanho rechitatu huremu kupfuura ), kurongedza kunoratidzika kunge kwakaringana.
Pakuisa shanduko LC_COLLATE=C tafura yakakosha inotakurwa inotsanangura kuenzanisa kwebyte-byte
static const uint32_t collseqwc[] =
{
8, 1, 8, 0x0, 0xff,
/* 1st-level table */
6 * sizeof (uint32_t),
/* 2nd-level table */
7 * sizeof (uint32_t),
/* 3rd-level table */
L'x00', L'x01', L'x02', L'x03', L'x04', L'x05', L'x06', L'x07',
L'x08', L'x09', L'x0a', L'x0b', L'x0c', L'x0d', L'x0e', L'x0f',
...
L'xf8', L'xf9', L'xfa', L'xfb', L'xfc', L'xfd', L'xfe', L'xff'
};
Sezvo muUnicode poindi yekodhi Π inouya pamberi peA, tambo dzakarongwa zvinoenderana.
Zvinyorwa uye mabhinari matafura
Zviripachena, kuenzanisa tambo ibasa rakajairika, uye kupatsanura tafura CTT nzira inodhura chaizvo. Kuti uwedzere kuwana tafura, inounganidzwa kuita binary fomu nemirairo localdef.
chikwata localdef inogamuchira semaparamita faira rine tafura yehunhu hwenyika (sarudzo -i), umo mavara ese anomiririrwa neiyo Unicode madotsi, uye faira retsamba pakati peiyo Unicode dots uye mavara eiyo encoding chaiyo (sarudzo. -f) Somugumisiro webasa racho, mafaira ebhinari anosikirwa nzvimbo ine zita rinotsanangurwa muparameter yekupedzisira.
glibc inotsigira maviri mabhinari faira mafomati: "chinyakare" uye "azvino".
Mamiriro echinyakare anoreva kuti zita renzvimbo izita re subdirectory mukati /usr/lib/locale/. Iyi subdirectory inochengeta mabhinari mafaera LC_COLLATE, LC_CTYPE, LC_TIME zvichingoenda zvakadaro. File LC_IDENTIFICATION ine zita repamutemo renzvimbo (rinogona kunge rakasiyana kubva kune zita redhairekitori) uye makomendi.
Chimiro chemazuva ano chinosanganisira kuchengetedza nzvimbo dzese mudura rimwe chete /usr/lib/locale/locale-archive, iyo inomepu kune chaiyo ndangariro yemaitiro ese ari kushandisa glibc. Iro zita renzvimbo mune yazvino fomati inoiswa kune imwe canonization - nhamba chete nemabhii akaderedzwa kuita madiki anoramba ari mumazita encoding. Saka ru_RU.KOI8-R, achaponeswa sa ru_RU.koi8r.
Mafaira ekupinda anotsvagwa mudhairekitori razvino, pamwe nemadhairekitori /usr/share/i18n/locales/ ΠΈ /usr/share/i18n/charmaps/ zvemafaira CTT uye encoding mafaera, zvichiteerana.
Somuenzaniso, murairo
localedef -i ru_RU -f MAC-CYRILLIC ru_RU.MAC-CYRILLIC
ichaunganidza faira /usr/share/i18n/locales/ru_RU uchishandisa encoding file /usr/share/i18n/charmaps/MAC-CYRILLIC.gz uye chengetedza mhedzisiro mukati /usr/lib/locale/locale-archive pasi pezita ru_RU.maccyrillic
Kana iwe ukaisa shanduko LANG = en_US.UTF-8 ipapo glibc ichatsvaga mabhinari emunharaunda mune inotevera kutevedzana kwemafaira nemadhairekitori:
/usr/lib/locale/locale-archive
/usr/lib/locale/en_US.UTF-8/
/usr/lib/locale/en_US/
/usr/lib/locale/enUTF-8/
/usr/lib/locale/en/
Kana nzvimbo ikaitika mune zvese zvechinyakare uye zvemazuva ano mafomati, saka zvakakosha zvinopihwa kune yemazuva ano.
Unogona kuona rondedzero yenzvimbo dzakaunganidzwa nemurairo nzvimbo -a.
Kugadzirira tafura yako yekuenzanisa
Iye zvino, wakashongedzerwa neruzivo, unogona kugadzira yako yakanakira tambo yekufananidza tafura. Tafura iyi inofanira kunyatsoenzanisa mavara echiRussia, kusanganisira tsamba Π, uye panguva imwe chete funga nezvezviratidzo zvepumisheni maererano netafura. ASCII.
Maitiro ekugadzirira yako wega tafura yekuronga ine matanho maviri: kugadzirisa zviyero zvetafura uye kuigadzira kuita bhinari fomu nemurairo. localdef.
Kuti tafura yekuenzanisa igadziriswe nemari shoma yekugadzirisa, mufomati ISO 14652 Zvikamu zvekugadzirisa uremu hwetafura iripo zvinopihwa. Chikamu chinotanga neshoko rinokosha reorder-after uye zvichiratidza nzvimbo mushure mekunge kutsiva kunoitwa. Chikamu chinopera nemutsara reorder-end. Kana zvakakosha kugadzirisa zvikamu zvakawanda zvetafura, ipapo chikamu chinogadzirwa kune chimwe nechimwe chikamu chakadaro.
Ndakakopa shanduro itsva dzemafaira iso14651_t1_common ΠΈ ru_RU kubva ku repository glibc kudhairekitori rekumba kwangu ~/.local/share/i18n/locales/ uye ndakagadzirisa zvishoma chikamu LC_COLLATE Π² ru_RU. Shanduro itsva dzemafaira dzinonyatsoenderana neshanduro yangu glibc. Kana iwe uchida kushandisa ekare mavhezheni emafaira, iwe uchafanirwa kushandura mazita ekufananidzira uye nzvimbo iyo kutsiva kunotanga mutafura.
LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
reorder-after <U000D>
<U0020> <S0020>;<BASE>;<MIN>;<U0020> % SPACE
<U0021> <S0021>;<BASE>;<MIN>;<U0021> % EXCLAMATION MARK
<U0022> <S0022>;<BASE>;<MIN>;<U0022> % QUOTATION MARK
...
<U007D> <S007D>;<BASE>;<MIN>;<U007D> % RIGHT CURLY BRACKET
<U007E> <S007E>;<BASE>;<MIN>;<U007E> % TILDE
reorder-end
END LC_COLLATE
Muchokwadi, zvingave zvakafanira kuchinja minda mukati LC_IDENTIFICATION zvekuti vanongedza kunzvimbo ru_MY, asi mumuenzaniso wangu izvi zvaisadiwa, sezvo ini ndakabvisa dura kubva mukutsvaga kwenzvimbo locale-archive.
kuti localdef yakashanda nemafaira mufolda yangu kuburikidza nekusiyana I18NPATH Iwe unogona kuwedzera imwe dhairekitori yekutsvaga mafaera ekuisa, uye dhairekitori rekuchengetedza mabhinari mafaera anogona kutsanangurwa senzira ine slashes:
$> I18NPATH=~/.local/share/i18n localedef -i ru_RU -f UTF-8 ~/.local/lib/locale/ru_MY.UTF-8
POSIX inoratidza kuti mu JUST unogona kunyora nzira dzakakwana kumadhairekitori ane mafaera enzvimbo, kutanga nemberi slash, asi glibc Π² Linux nzira dzese dzinoverengerwa kubva kune base dhairekitori, iyo inogona kudhindwa kuburikidza nekusiyana LOCPATH. Mushure mekuisa LOCPATH=~/.local/lib/locale/ mafaera ese ane chekuita nekugadziriswa kwenzvimbo anozosechwa mufolda yangu chete. Archive yenzvimbo dzine vhezheni seti LOCPATH kufuratirwa.
Heino bvunzo yakasimba:
$> LANG=ru_MY.UTF-8 LOCPATH=~/.local/lib/locale/ sort buhg.txt
ΠΠ±Π°ΠΊΠ°Π½ΠΎΠ² ΠΠΈΡ
Π°ΠΈΠ»;ΠΌΠ°Π»ΡΡ
ΠΠ»ΠΊΠΈΠ½Π° ΠΠ»Π»Π°;ΠΊΡΠ°Π½ΠΎΠ²ΡΠΈΡΠ°
ΠΠ²Π°Π½ΠΎΠ² ΠΠ½Π΄ΡΠ΅ΠΉ;ΡΠ»Π΅ΡΠ°ΡΡ
ΠΠ²Π°Π½ΠΎΠ²Π° ΠΠ»Π»Π°;Π°Π΄Π²ΠΎΠΊΠ°Ρ
Hooray! Tazviita!
Shandisai mishonga
Ini ndatopindura mibvunzo pamusoro pekugadzirisa tambo yakabvunzwa pakutanga, asi pachine mibvunzo mishoma pamusoro pezvikanganiso - zvinoonekwa uye zvisingaonekwe.
Ngatidzokere kudambudziko rekutanga.
Uye purogiramu kworudzii uye purogiramu Join shandisa tambo imwechete yekuenzanisa mabasa kubva glibc. Zvakaitika sei izvozvo Join akapa kukanganisa kwekuronga pamitsara yakarongwa nemurairo kworudzii munharaunda en_US.UTF-8? Mhinduro iri nyore: kworudzii inoenzanisa tambo yose, uye Join inoenzanisa chete kiyi, iyo nekusarudzika ndiko kutanga kwetambo kusvika kune yekutanga whitespace character. Mumuenzaniso wangu, izvi zvakakonzera meseji yemhosho nekuti marongerwo emazwi ekutanga mumitsetse haana kuenderana nekurongwa kwemitsetse yakazara.
Locale "C" inovimbisa kuti mumatambo akarongwa ma substrings ekutanga kusvika panzvimbo yekutanga anozogadziriswa, asi izvi zvinongovhara kukanganisa. Zvinogoneka kusarudza data (vanhu vane surnames, asi akasiyana ekutanga mazita) ayo, pasina meseji yekukanganisa, achapa faira risiri iro rekubatanidza mhedzisiro. Kana tichida Join akabatanidza mitsara yefaira nezita rakazara, zvino nzira yakarurama ingave yekutsanangura zvakajeka muparadzi wemunda uye kuronga nekiyi munda, uye kwete nemutsara wose. Muchiitiko ichi, kusanganisa kuchaenderera nemazvo uye hakuzove nekukanganisa mune chero nzvimbo:
$> sort -t ; -k 1 buhg.txt > buhg.srt
$> sort -t ; -k 1 mail.txt > mail.srt
$> join -t ; buhg.srt mail.srt > result
Muenzaniso wakabudirira mukukodha CP1251 ine chimwe chikanganiso. Chokwadi ndechekuti mukugovera kwese kunozivikanwa kwandiri Linux mapakeji arikushaikwa akaunganidzwa enzvimbo ru_RU.CP1251. Kana iyo yakaunganidzwa nzvimbo isingawanikwe, saka kworudzii chinyararire anoshandisa byte-by-byte kuenzanisa, izvo zvatakaona.
Nenzira, pane imwe glitch diki ine chekuita nekusasvikika kwenzvimbo dzakaunganidzwa. Team LOCPATH=/tmp nzvimbo -a ichapa runyoro rwenzvimbo dzese mukati locale-archive, asi neinosiyana set LOCPATH yezvirongwa zvese (kusanganisira zvakanyanya vemunzvimbo) nzvimbo idzi hadzizovepo.
$> LOCPATH=/tmp locale -a | grep en_US
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
en_US
en_US.iso88591
en_US.iso885915
en_US.utf8
$> LC_COLLATE=en_US.UTF-8 sort --debug
sort: using βen_US.UTF-8β sorting rules
$> LOCPATH=/tmp LC_COLLATE=en_US.UTF-8 sort --debug
sort: using simple byte comparison
mhedziso
Kana iwe uri programmer akajaira kufunga kuti tambo seti yemabheti, saka sarudzo yako LC_COLLATE=C.
Kana iwe uri nyanzvi yemitauro kana duramazwi, saka zvirinani uunganidze munharaunda yako.
Kana iwe uri mushandisi ari nyore, saka iwe unongoda kujaira kuti iwo murairo ls -a anoburitsa mafaera achitanga nedoti rakasanganiswa nemafaira anotanga nebhii, uye Pakati peusiku mutungamiri, iyo inoshandisa mabasa ayo emukati kuronga mazita, inoisa mafaera achitanga nekadoti panotangira rondedzero.
nezvakanyorwa
Source: www.habr.com