Momwe Linux amasinthira zingwe

Mau oyamba

Zonse zidayamba ndi kalembedwe kakang'ono komwe kamayenera kuphatikizira zambiri za adilesi imelo ogwira ntchito omwe adapezeka pamndandanda wa omwe amatumizira makalata, omwe ali ndi maudindo omwe atengedwa kuchokera ku database ya dipatimenti ya HR. Mindandanda yonseyi idatumizidwa ku mafayilo amtundu wa Unicode UTF-8 ndikusungidwa ndi mathero a mzere wa Unix.

Zokhutira mail.txt

Иванов Андрей;[email protected]

Zokhutira buhg.txt

Иванова Алла;маляр
Ёлкина Элла;крановщица
Иванов Андрей;слесарь
Абаканов Михаил;маляр

Kuti muphatikize, mafayilo adasanjidwa ndi lamulo la Unix mtundu ndipo idaperekedwa kuzinthu za pulogalamu ya Unix Funsani, zomwe zidalephera mosayembekezereka ndi cholakwika:

$> sort buhg.txt > buhg.srt
$> sort mail.txt > mail.srt
$> join buhg.srt mail.srt > result
join: buhg.srt:4: is not sorted: Иванов Андрей;слесарь

Kuwona zotsatira zakusanja ndi maso anu kunawonetsa kuti, nthawi zambiri, kusanja ndikolondola, koma pakangochitika mwangozi mayina aamuna ndi aakazi, achikazi amabwera pamaso pa amuna:

$> sort buhg.txt
Абаканов Михаил;маляр
Ёлкина Элла;крановщица
Иванова Алла;маляр
Иванов Андрей;слесарь

Zikuwoneka ngati kusanja glitch mu Unicode kapena chiwonetsero cha feminism mu algorithm yosankha. Choyamba ndi, ndithudi, chomveka.

Tiyeni tiyike pambali pano Funsani ndi kuganizira mtundu. Tiyeni tiyesetse kuthetsa vutoli pogwiritsa ntchito sayansi. Choyamba, tiyeni tisinthe malo en_US pa anayankha. Kuti musankhe, zingakhale zokwanira kukhazikitsa kusintha kwa chilengedwe LC_COLLATE, koma sitiwononga nthawi pazinthu zazing'ono:

$> LANG=ru_RU.UTF-8 sort buhg.txt
Абаканов Михаил;маляр
Ёлкина Элла;крановщица
Иванова Алла;маляр
Иванов Андрей;слесарь

Palibe chomwe chinasintha.

Tiyeni tiyese kukonzanso mafayilo kukhala encoding ya single-byte:

$> iconv -f UTF-8 -t KOI8-R buhg.txt 
 | LANG=ru_RU.KOI8-R sort 
 | iconv -f KOI8-R -t UTF8

Apanso palibe chomwe chasintha.

Palibe chomwe mungachite, muyenera kuyang'ana yankho pa intaneti. Palibe mwachindunji za mayina achi Russia, koma pali mafunso okhudza kusanja kwina. Mwachitsanzo, nali vuto: unix mtundu amachitira '-' (dash) zilembo ngati zosaoneka. Mwachidule, zingwe "ab", "aa", "ac" zimasanjidwa ngati "aa", "ab", "ac".

Yankho ndilokhazikika paliponse: gwiritsani ntchito malo opangira mapulogalamu "C" ndipo mudzakhala okondwa. Tiyeni tiyese:

$> LANG=C sort buhg.txt
Ёлкина Элла;крановщица
Абаканов Михаил;маляр
Иванов Андрей;слесарь
Иванова Алла;адвокат

Chinachake chasintha. The Ivanovs mzere mu dongosolo lolondola, ngakhale Yolkina anazembera kwinakwake. Tiyeni tibwerere ku vuto loyamba:

$> LANG=C sort buhg.txt > buhg.srt
$> LANG=C sort mail.txt > mail.srt
$> LANG=C join buhg.srt mail.srt > result

Zinagwira ntchito popanda zolakwika, monga momwe intaneti inalonjezera. Ndipo izi ngakhale Yolkina mu mzere woyamba.

Vuto likuwoneka kuti lathetsedwa, koma ngati zitheka, tiyeni tiyese ma encoding ena aku Russia - Windows CP1251:

$> iconv -f UTF-8 -t CP1251 buhg.txt 
 | LANG=ru_RU.CP1251 sort 
 | iconv -f CP1251 -t UTF8 

Chotsatira chosankhika, chodabwitsa, chidzagwirizana ndi komweko "C", ndipo chitsanzo chonsecho, moyenerera, chimayenda popanda zolakwika. Mtundu wina wachinsinsi.

Sindimakonda zachinsinsi pamapulogalamu chifukwa nthawi zambiri zimabisa zolakwika. Tiyenera kuyang'ana mozama momwe zimagwirira ntchito. mtundu ndipo zikukhudza chiyani? LC_COLLATE .

Pomaliza ndiyesera kuyankha mafunso awa:

  • chifukwa chiyani mayina achikazi adasankhidwa molakwika?
  • bwanji LANG=ru_RU.CP1251 zidakhala zofanana LANG=C
  • chifukwa chiyani mtundu и Funsani malingaliro osiyanasiyana pa dongosolo la zingwe zosanjidwa
  • chifukwa chiyani pali zolakwika mu zitsanzo zanga zonse?
  • potsiriza momwe mungasankhire zingwe momwe mukufunira

Kusankha mu Unicode

Kuyimitsa koyamba kudzakhala lipoti laukadaulo No. 10 lotchedwa Unicode collation algorithm Online unicode.org. Lipotili lili ndi zambiri zaukadaulo, ndiye ndiloleni ndipereke chidule cha malingaliro akulu.

collation - "kufanizitsa" zingwe ndiye maziko a algorithm iliyonse yosankhira. Ma aligorivimu okha akhoza kusiyana ("kuwira", "kuphatikiza", "mwachangu"), koma onse adzagwiritsa ntchito kufananitsa kwa zingwe ziwiri kuti adziwe momwe amawonekera.

Kusankha zingwe m'chinenero chachibadwa ndi vuto lalikulu kwambiri. Ngakhale mu encodings yosavuta ya baiti imodzi, dongosolo la zilembo mu zilembo, ngakhale mwanjira ina yosiyana ndi zilembo zachilatini zachingerezi, sizigwirizananso ndi dongosolo la manambala omwe zilembo izi zimasungidwa. Choncho mu zilembo za Chijeremani chilembocho Ö imayima pakati О и P, ndi encoding CP850 amafika pakati ÿ и Ü.

Mutha kuyesa kutulutsa kabisidwe kake ndikuganizira zilembo "zabwino" zomwe zimasanjidwa mwanjira ina, monga zimachitikira ku Unicode. Ma encodings UTF8, UTF16 kapena imodzi-byte KOI8-R (ngati gawo laling'ono la Unicode likufunika) lipereka zilembo zosiyana za zilembo, koma tchulani zinthu zomwezo patebulo loyambira.

Zikuwonekeratu kuti ngakhale titamanga tebulo lachizindikiro kuyambira pachiwonetsero, sitingathe kuyika chizindikiro chapadziko lonse lapansi. M’zilembo zosiyanasiyana za dziko zimene zimagwiritsa ntchito zilembo zofanana, dongosolo la zilembo zimenezi likhoza kusiyana. Mwachitsanzo, mu French Æ adzatengedwa ngati ligature ndi kusanjidwa ngati chingwe AE. Mu Norway Æ idzakhala kalata yosiyana, yomwe ili pambuyo pake Z. Mwa njira, kuwonjezera ligatures ngati Æ Pali zilembo zolembedwa ndi zizindikiro zingapo. Choncho mu zilembo za Czech pali chilembo Ch, yomwe imayima pakati H и I.

Kuphatikiza pa kusiyana kwa zilembo, pali miyambo ina yamitundu yomwe imakhudza kusanja. Makamaka, funso limabuka: Kodi mawu okhala ndi zilembo zazikulu ndi ang'onoang'ono ayenera kuwonekera mumtanthauzira wotani? Kusanja kungakhudzidwenso ndi kugwiritsa ntchito zizindikiro zopumira. M'Chisipanishi, funso lotembenuzidwa limagwiritsidwa ntchito kumayambiriro kwa chiganizo chofunsa mafunso (Kodi mumakonda nyimbo?). Pamenepa, n’zachidziŵikire kuti ziganizo zofunsa mafunso siziyenera kuikidwa m’gulu lapadera kunja kwa zilembo, koma momwe mungasankhire mizere ndi zizindikiro zina zopumira?

Sindidzangoganizira za kusanja zingwe m'zilankhulo zosiyana kwambiri ndi za ku Ulaya. Zindikirani kuti m'zilankhulo zolembera kuchokera kumanja kupita kumanzere kapena kumtunda kupita kumunsi, zilembo zomwe zili m'mizere zimasungidwa motsatira ndondomeko yowerengera, ngakhalenso makina osakhala ndi zilembo amakhala ndi njira zawozawo zoyitanitsa mizere malinga ndi zilembo. . Mwachitsanzo, ma hieroglyphs amatha kuyitanidwa ndi kalembedwe (Makiyi a zilembo zaku China) kapena potchula mawu. Kunena zowona, sindikudziwa momwe ma emojis angasankhidwe, koma mutha kuwapangira zina.

Kutengera zomwe zalembedwa pamwambapa, zofunikira pakuyerekeza zingwe kutengera matebulo a Unicode zidapangidwa:

  • kuyerekeza kwa zingwe sikudalira malo a zilembo mu tebulo la code;
  • kutsatizana kwa zilembo zopanga munthu m'modzi kumachepetsedwa kukhala ovomerezeka (A + bwalo lapamwamba ndi lofanana ndi Å);
  • Poyerekeza zingwe, munthu amaganiziridwa molingana ndi chingwecho ndipo, ngati kuli kofunikira, kuphatikizidwa ndi oyandikana nawo kukhala gawo limodzi loyerekeza (Ch mu Czech) kapena lagawidwa angapo (Æ mu French);
  • mbali zonse za dziko (zilembo, zilembo zazikulu/ zing'onozing'ono, zizindikiro zopumira, kalembedwe ka mitundu yolembera) ziyenera kukonzedwa kuti zigwirizane ndi dongosolo la dongosolo (emoji);
  • kufananitsa ndikofunikira osati pakusanja kokha, komanso m'malo ena ambiri, mwachitsanzo pofotokoza mizere (kulowetsa {A... z} mu bash);
  • kuyerekeza kuchitidwe mwachilungamo mofulumira.

Kuphatikiza apo, olemba lipotilo adapanga zofananira zomwe opanga ma aligorivimu sayenera kudalira:

  • ma aligorivimu ofananitsa sayenera kukhala ndi zilembo zapadera pachilankhulo chilichonse (zilankhulo za Chirasha ndi Chiyukireniya zimagawana zilembo zambiri za Cyrillic);
  • kuyerekezera kuyenera kudalira dongosolo la zilembo za Unicode;
  • kulemera kwa chingwe sikuyenera kukhala chikhalidwe cha chingwe, popeza chingwe chomwecho pazikhalidwe zosiyanasiyana chikhoza kukhala ndi zolemera zosiyana;
  • mizere yolemera imatha kusintha ikaphatikiza kapena kugawanika (kuchokera x < y izo sizimatsatira izo xz < yz);
  • zingwe zosiyana zokhala ndi zolemera zofanana zimatengedwa kuti ndizofanana kuchokera pamalingaliro a algorithm yosankha. Kuyambitsa ndondomeko yowonjezera ya zingwe zoterezi ndizotheka, koma zikhoza kusokoneza ntchito;
  • Posanja mobwerezabwereza, mizere yokhala ndi zolemera zofanana imatha kusinthana. Kulimba ndi chinthu cha algorithm yosankha, osati chinthu chofananira ndi zingwe (onani ndime yapitayi);
  • Kusanja malamulo kungasinthe pakapita nthawi pamene miyambo ikusintha.

Zimanenedwanso kuti algorithm yofananitsa sadziwa kanthu za semantics ya zingwe zomwe zikukonzedwa. Chifukwa chake, zingwe zomwe zimakhala ndi manambala okha siziyenera kufananizidwa ngati manambala, ndipo pamndandanda wamatchulidwe achingerezi nkhaniyo (Beatles, The).

Pofuna kukwaniritsa zofunikira zonse zomwe zatchulidwa, ndondomeko yosankha tebulo lamagulu angapo (kwenikweni inayi) ikuperekedwa.

M'mbuyomu, zilembo zachingwezo zimachepetsedwa kukhala zovomerezeka ndikugawidwa m'magulu ofananiza. Chigawo chilichonse chofananitsa chimapatsidwa miyeso ingapo yolingana ndi magawo angapo ofananiza. Miyeso ya mayunitsi ofananitsa ndi zinthu zama seti oyitanidwa (panthawiyi, manambala) omwe amatha kufananizidwa mochulukirapo kapena mochepera. Tanthauzo lapadera ANANYALAWA (0x0) amatanthawuza kuti pamlingo wofananira wofananira gawo ili silikukhudzidwa pakufananitsa. Kuyerekeza kwa zingwe kungathe kubwerezedwa kangapo, pogwiritsa ntchito zolemera za miyeso yofanana. Pa mlingo uliwonse, zolemera za mayunitsi ofananitsa a mizere iwiri amafananizidwa motsatizana.

M'makhazikitsidwe osiyanasiyana a aligorivimu pamiyambo yosiyanasiyana yamayiko, ma coefficients amatha kusiyana, koma mulingo wa Unicode umaphatikizapo tebulo loyambira la zolemera - "Default Unicode Collation Element Table" (DUCET). Ndikufuna kuzindikira kuti kukhazikitsa variable LC_COLLATE kwenikweni ndi chizindikiro cha kusankha kwa tebulo lolemera mu ntchito yofananitsa chingwe.

Kulemera kwa coefficients DUCET zakonzedwa motere:

  • pa mlingo woyamba, zilembo zonse zimachepetsedwa kukhala zofanana, zizindikiro zimatayidwa, zizindikiro zopumira (osati zonse) zimanyalanyazidwa;
  • pamlingo wachiwiri, zilembo zokhazokha zimaganiziridwa;
  • pamlingo wachitatu, mlandu wokhawo umaganiziridwa;
  • pa mlingo wachinayi, zizindikiro zopumira zokha zimaganiziridwa.

Kufanizira kumachitika m'mapita angapo: choyamba, ma coefficients a msinkhu woyamba amafananizidwa; ngati zolemera zimagwirizana, ndiye kuti kufananitsa mobwerezabwereza ndi miyeso yachiwiri kumachitika; ndiye mwina wachitatu ndi wachinayi.

Kuyerekeza kumathera pamene mizere ili ndi mayunitsi ofananitsa oyerekeza ndi zolemera zosiyanasiyana. Mizere yomwe ili ndi zolemera zofanana pamagulu onse anayi amaonedwa kuti ndi ofanana.

Algorithm iyi (yokhala ndi zambiri zowonjezera zaukadaulo) idapereka dzina loti linene No. 10 - "Unicode Collation Algorithm" (ACU).

Apa ndipamene kusanja kuchokera ku chitsanzo chathu kumamveka bwino. Zingakhale zabwino kuzifanizitsa ndi Unicode standard.

Kuyesa kukhazikitsa ACU pali wapadera mayeso, kugwiritsa weights file, kukhazikitsa DUCET. Mutha kupeza mitundu yonse yazinthu zoseketsa mufayilo ya mamba. Mwachitsanzo, pali dongosolo la mahjong ndi maulamuliro aku Europe, komanso dongosolo la ma suti pamndandanda wamakhadi (chizindikiro. 1F000 ndi zina). Zovala zamakhadi zimayikidwa molingana ndi malamulo a mlatho - PCBT, ndipo makhadi mu suti ali motsatira T, 2,3, XNUMX... K.

Kuwona pamanja kuti mizere yasanjidwa bwino molingana ndi DUCET zingakhale zotopetsa, koma, mwamwayi kwa ife, pali chitsanzo cha kukhazikitsa laibulale yogwira ntchito ndi Unicode - "Zida Zapadziko Lonse za Unicode"(ICU).

Pa webusayiti ya laibulale iyi, yopangidwa mu IBM, pali masamba owonetsera, kuphatikizapo tsamba la algorithm yofanizira chingwe. Timalowetsa mizere yathu yoyeserera ndi zosintha zosasintha ndipo, taonani, timasankha bwino Chirasha.

Абаканов Михаил;маляр
Ёлкина Элла;крановщица
Иванов Андрей;слесарь
Иванова Алла;адвокат

Mwa njira, webusaitiyi ICU Mutha kupeza tsatanetsatane wa ma aligorivimu ofananiza pokonza zizindikiro zopumira. Mu zitsanzo Kuphatikizika kwa FAQ apostrophe ndi hyphen amanyalanyazidwa.

Unicode idatithandiza, koma yang'anani zifukwa zamakhalidwe achilendo mtundu в Linux adzayenera kupita kwinakwake.

Kusanja mu glibc

Kuwona mwachangu ma code source source mtundu kuchokera GNU Core Utils adawonetsa kuti muzogwiritsidwa ntchito palokha, kumasulira kumabwera mpaka kusindikiza mtengo wamakono wa zosinthika LC_COLLATE mukugwira ntchito mu debug mode:

$ sort --debug buhg.txt > buhg.srt
sort: using ‘en_US.UTF8’ sorting rules

Kufananitsa kwa zingwe kumachitika pogwiritsa ntchito muyezo strcoll, zomwe zikutanthauza kuti chilichonse chosangalatsa chili mulaibulale glibc.

pa wiki ntchito glibc odzipereka ku kufananiza kwa zingwe ndime imodzi. Kuchokera ndime iyi zitha kumveka kuti mu glibc kusanja kumatengera algorithm yomwe tikudziwa kale ACU (Unicode collation algorithm) ndi/kapena pa muyezo pafupi ndi izo ISO 14651 (Kuyitanitsa zingwe zapadziko lonse lapansi ndi kufananitsa). Ponena za muyezo waposachedwa, ziyenera kudziwidwa kuti patsamba standards.iso.org ISO 14651 zalengezedwa poyera, koma ulalo wofananirawu umatsogolera kutsamba lomwe silinakhalepo. Google imabwezeretsa masamba angapo okhala ndi maulalo kumasamba ovomerezeka omwe amapereka kugula kope lamagetsi la ma euro zana, koma patsamba lachitatu kapena lachinayi lazotsatira palinso maulalo achindunji ku PDF. Mwambiri, muyezo siwosiyana ndi ACU, koma ndizosangalatsa kuwerenga chifukwa mulibe zitsanzo zomveka bwino za mtundu wa kusanja zingwe.

Zosangalatsa kwambiri pa wiki panali kugwirizana kwa tracker ya bug ndi zokambirana za kukhazikitsidwa kwa zingwe zofananitsa mu glibc. Kuchokera pazokambirana zitha kuphunziridwa kuti glibc amagwiritsidwa ntchito kufananiza zingwe ISOtebulo laumwini The Common Template Table (CTT), adilesi yomwe ingapezeke muzogwiritsira ntchito A muyezo ISO 14651. Pakati pa 2000 ndi 2015 tebulo ili mkati glibc analibe wosamalira ndipo anali wosiyana kwambiri (osachepera kunja) ndi mtundu wapano wa muyezo. Kuyambira 2015 mpaka 2018, kusintha kwa tebulo latsopano kunachitika, ndipo tsopano muli ndi mwayi wokumana ndi moyo weniweni wa tebulo latsopano (CentOS 8), ndi wakale (CentOS 7).

Tsopano popeza tili ndi chidziwitso chonse chokhudza ma aligorivimu ndi matebulo othandizira, titha kubwerera ku vuto loyambirira ndikumvetsetsa momwe mungasankhire zingwe molondola m'chigawo cha Russia.

ISO 14651 / 14652

Magwero a tebulo lomwe tikufuna CTT pamagawidwe ambiri Linux ili mu katalogu /usr/share/i18n/locales/. Tebulo lokha lili mu fayilo ISO14651_t1_common. Ndiye ichi ndiye chiwongolero cha fayilo kukopera ISO14651_t1_common kuphatikizidwa mufayilo ISO14651_t1, omwe, nawonso, akuphatikizidwa mu mafayilo amtundu, kuphatikizapo en_US и anayankha. Pa zogawa zambiri Linux Mafayilo onse oyambira amaphatikizidwa pakuyika koyambira, koma ngati palibe, muyenera kukhazikitsa phukusi lina kuchokera pakugawa.

Mapangidwe a fayilo ISO14651_t1 zitha kuwoneka ngati zomveka, zokhala ndi malamulo osadziwika bwino opangira mayina, koma ngati muyang'ana, zonse ndizosavuta. Mapangidwewo akufotokozedwa muyeso ISO 14652, kope lake lomwe litha kutsitsidwa patsamba open-std.org. Kufotokozera kwina kwa mtundu wa fayilo kumatha kuwerengedwa mu mfundo POSIX от OpenGroup. Monga njira ina yowerengera muyezo, mutha kuphunzira magwero a ntchitoyo sonkhanitsa_werengani в glibc/locale/programs/ld-collate.c.

Mapangidwe a fayilo amawoneka motere:

Mwachikhazikitso, munthuyo amagwiritsidwa ntchito ngati munthu wothawa, ndipo mapeto a mzere pambuyo pa # khalidwe ndi ndemanga. Zizindikiro zonsezi zitha kufotokozedwanso, zomwe ndizomwe zimachitika mumtundu watsopano wa tebulo:

escape_char /
comment_char %

Fayiloyo idzakhala ndi zizindikiro mumtundu kapena (kuti x - chiwerengero cha hexadecimal). Ichi ndiye chiwonetsero cha hexadecimal cha Unicode code point mu encoding UCS-4 (UTF-32). Zinthu zina zonse m'mabulaketi amakona (kuphatikiza , <2> ndi zina zotero) zimatengedwa ngati zingwe zosavuta zomwe zilibe tanthauzo kunja kwa nkhaniyo.

Mzere LC_COLLATE limatiuza kuti kenako akuyamba deta kufotokoza kufananitsa zingwe.

Choyamba, mayina amatchulidwa pazolemera zomwe zili patebulo lofananitsa ndi mayina ophatikiza zizindikiro. Nthawi zambiri, mitundu iwiri ya mayina ndi yamagulu awiri osiyana, koma mu fayilo yeniyeni amasakanizidwa. Mayina a zolemera amatchulidwa ndi mawu ofunika kusonkhanitsa - chizindikiro (kufananiza) chifukwa tikayerekeza, zilembo za Unicode zomwe zili ndi zolemera zofanana zimatengedwa ngati zilembo zofanana.

Kutalika konse kwa gawo mukusinthanso kwatsopano kwa fayilo kuli pafupifupi mizere 900. Ndidatulutsa zitsanzo m'malo angapo kuti ndiwonetse kusagwirizana kwa mayina ndi mitundu ingapo ya mawu.

LC_COLLATE

collating-symbol <RES-1>
collating-symbol <BLK>
collating-symbol <MIN>
collating-symbol <WIDE>
...
collating-symbol <ARABIC>
collating-symbol <ETHPC>
collating-symbol <OSMANYA>
...
collating-symbol <S1D000>..<S1D35F>
collating-symbol <SFFFF> % Guaranteed largest symbol value. Keep at end of this list
...
collating-element <U0413_0301> from "<U0413><U0301>"
collating-element <U0413_0341> from "<U0413><U0341>"

  • kusonkhanitsa - chizindikiro amalemba chingwe OSMANYA mu gome la maina a miyeso
  • kusonkhanitsa - chizindikiro .. amalembetsa mndandanda wa mayina okhala ndi chiyambi S ndi hexadecimal manambala suffix kuchokera 1D000 mpaka Mtengo wa 1D35F.
  • FFFF в kusonkhanitsa - chizindikiro zikuwoneka ngati nambala yayikulu yosasainidwa mu hexadecimal, koma ndi dzina chabe lomwe lingawonekere
  • dzina kutanthauza malo a code mu encoding UCS-4
  • cholumikizira-chinthu kuchokera" " amalembetsa dzina latsopano la madontho a Unicode.

Mayina a zolemerazo atafotokozedwa, zolemera zenizeni zimatchulidwa. Popeza kuti ubale wokhawokha ndi wofunika kwambiri poyerekeza, zolemera zimatsimikiziridwa ndi mndandanda wosavuta wa mayina. Zolemera "zopepuka" zimatchulidwa poyamba, kenako "zolemetsa". Ndiroleni ndikukumbutseni kuti munthu aliyense wa Unicode amapatsidwa miyeso inayi yosiyana. Apa iwo akuphatikizidwa mu ndondomeko imodzi yolamulidwa. Mwachidziwitso, dzina lililonse lophiphiritsa lingagwiritsidwe ntchito pamlingo uliwonse mwa magawo anayi, koma ndemanga zimasonyeza kuti oyambitsa amalekanitsa mayina m'maganizo.

% Symbolic weight assignments

% Third-level weight assignments
<RES-1>
<BLK>
<MIN>
<WIDE>
...
% Second-level weight assignments
<BASE>
<LOWLINE> % COMBINING LOW LINE
<PSILI> % COMBINING COMMA ABOVE
<DASIA> % COMBINING REVERSED COMMA ABOVE
...
% First-level weight assignments
<S0009> % HORIZONTAL TABULATION 
<S000A> % LINE FEED
<S000B> % VERTICAL TABULATION
...
<S0434> % CYRILLIC SMALL LETTER DE
<S0501> % CYRILLIC SMALL LETTER KOMI DE
<S0452> % CYRILLIC SMALL LETTER DJE
<S0503> % CYRILLIC SMALL LETTER KOMI DJE
<S0453> % CYRILLIC SMALL LETTER GJE
<S0499> % CYRILLIC SMALL LETTER ZE WITH DESCENDER
<S0435> % CYRILLIC SMALL LETTER IE
<S04D7> % CYRILLIC SMALL LETTER IE WITH BREVE
<S0454> % CYRILLIC SMALL LETTER UKRAINIAN IE
<S0436> % CYRILLIC SMALL LETTER ZHE

Pomaliza, tebulo kulemera kwenikweni.

Gawo lazolemera limatsekedwa mu mizere ya mawu osakira order_start и dongosolo_kumapeto. Zosankha zowonjezera order_start Dziwani kuti mizere imayang'aniridwa pati pamlingo uliwonse wofananiza. Chokhazikika chokhazikika ndi patsogolo. Thupi la gawoli lili ndi mizere yomwe ili ndi chizindikiro cha chizindikiro ndi zolemera zake zinayi. Khodi ya chikhalidwe ikhoza kuimiridwa ndi khalidwe lokha, code code, kapena dzina lophiphiritsira lomwe linafotokozedwa kale. Zolemera zimatha kuperekedwanso ku mayina ophiphiritsa, ma code code, kapena zizindikiro zomwezo. Ngati ma code code kapena zilembo zikugwiritsidwa ntchito, kulemera kwake kumakhala kofanana ndi chiwerengero cha nambala ya code (malo pa tebulo la Unicode). Zilembo zomwe sizinatchulidwe mwatsatanetsatane (monga momwe ndikumvera) zimaperekedwa patebulo ndi kulemera koyambirira komwe kumafanana ndi malo omwe ali patebulo la Unicode. Mtengo wapadera wolemetsa KHALANIBE zikutanthauza kuti chizindikirocho chikunyalanyazidwa pa mlingo woyenera wa kufananiza.

Kuti ndiwonetse mawonekedwe a masikelo, ndinasankha zidutswa zitatu zoonekeratu:

  • zilembo zomwe zimanyalanyazidwa kotheratu
  • zizindikiro zofanana ndi nambala yachitatu mu magawo awiri oyambirira
  • chiyambi cha zilembo za Cyrillic, zomwe zilibe zilembo, choncho zimasankhidwa makamaka ndi magawo oyambirira ndi achitatu.

order_start forward;forward;forward;forward,position
<U0000> IGNORE;IGNORE;IGNORE;IGNORE % NULL (in 6429)
<U0001> IGNORE;IGNORE;IGNORE;IGNORE % START OF HEADING (in 6429)
<U0002> IGNORE;IGNORE;IGNORE;IGNORE % START OF TEXT (in 6429)
...
<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
<UFF13> <S0033>;<BASE>;<WIDE>;<UFF13> % FULLWIDTH DIGIT THREE
<U2476> <S0033>;<BASE>;<COMPAT>;<U2476> % PARENTHESIZED DIGIT THREE
<U248A> <S0033>;<BASE>;<COMPAT>;<U248A> % DIGIT THREE FULL STOP
<U1D7D1> <S0033>;<BASE>;<FONT>;<U1D7D1> % MATHEMATICAL BOLD DIGIT THREE
...
<U0430> <S0430>;<BASE>;<MIN>;<U0430> % CYRILLIC SMALL LETTER A
<U0410> <S0430>;<BASE>;<CAP>;<U0410> % CYRILLIC CAPITAL LETTER A
<U04D1> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
<U0430_0306> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
...
<U0431> <S0431>;<BASE>;<MIN>;<U0431> % CYRILLIC SMALL LETTER BE
<U0411> <S0431>;<BASE>;<CAP>;<U0411> % CYRILLIC CAPITAL LETTER BE
<U0432> <S0432>;<BASE>;<MIN>;<U0432> % CYRILLIC SMALL LETTER VE
<U0412> <S0432>;<BASE>;<CAP>;<U0412> % CYRILLIC CAPITAL LETTER VE
...
order_end

Tsopano mutha kubwereranso kukusanja zitsanzo kuyambira koyambirira kwa nkhaniyo. Chobisalira chili mu gawo ili la tebulo lolemera:

<U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE
<U0021> IGNORE;IGNORE;IGNORE;<U0021> % EXCLAMATION MARK
<U0022> IGNORE;IGNORE;IGNORE;<U0022> % QUOTATION MARK
...

Zitha kuwoneka kuti patebuloli zizindikiro zopumira zochokera patebulo ASCII (kuphatikiza danga) pafupifupi nthawi zonse amanyalanyazidwa poyerekezera zingwe. Kupatulapo ndi mizere yomwe imagwirizana m'chilichonse kupatula zizindikiro zopumira zomwe zimapezeka m'malo ofananira. Mizere yochokera ku chitsanzo changa (nditatha kusanja) ya algorithm yofananizira imawoneka motere:

АбакановМихаилмаляр
ЁлкинаЭллакрановщица
ИвановаАлламаляр
ИвановАндрейслесарь

Poganizira kuti patebulo la masikelo, zilembo zazikulu mu Chirasha zimabwera pambuyo pa zilembo zing'onozing'ono (pamlingo wachitatu cholemera kuposa ), kusanja kumawoneka kolondola kwambiri.

Pokhazikitsa variable LC_COLLATE=C tebulo lapadera limayikidwa lomwe limafotokoza kufananitsa kwa byte-byte

static const uint32_t collseqwc[] =
{
  8, 1, 8, 0x0, 0xff,
  /* 1st-level table */
  6 * sizeof (uint32_t),
  /* 2nd-level table */
  7 * sizeof (uint32_t),
  /* 3rd-level table */
  L'x00', L'x01', L'x02', L'x03', L'x04', L'x05', L'x06', L'x07',
  L'x08', L'x09', L'x0a', L'x0b', L'x0c', L'x0d', L'x0e', L'x0f',

...
  L'xf8', L'xf9', L'xfa', L'xfb', L'xfc', L'xfd', L'xfe', L'xff'
};

Popeza mu Unicode code code Ё imabwera patsogolo pa A, zingwe zimasanjidwa moyenerera.

Malemba ndi ma tebulo a binary

Mwachiwonekere, kufananitsa zingwe ndi ntchito yofala kwambiri, komanso kugawa tebulo CTT ndondomeko yokwera mtengo kwambiri. Kuti muwonjezere mwayi wopezeka patebulo, imapangidwa kukhala mawonekedwe a binary ndi lamulo localdef.

timu localdef amavomereza ngati magawo fayilo yokhala ndi tebulo la mawonekedwe adziko (option -i), momwe zilembo zonse zimayimiridwa ndi madontho a Unicode, ndi fayilo yamakalata pakati pa madontho a Unicode ndi zilembo za encoding inayake (njira -f). Chifukwa cha ntchitoyi, mafayilo amabina amapangidwira malo omwe ali ndi dzina lomwe limatchulidwa pagawo lomaliza.

glibc imathandizira mitundu iwiri yamafayilo oyimba: "zachikhalidwe" ndi "zamakono".

Maonekedwe achikhalidwe amatanthauza kuti dzina la malo ndi dzina la subdirectory mu /usr/lib/locale/. Gulu laling'onoli limasunga mafayilo a binary LC_COLLATE, LC_CTYPE, LC_TIME ndi zina zotero. Fayilo LC_IDENTIFICATION ili ndi dzina lamalo (lomwe lingakhale losiyana ndi dzina lachikwatu) ndi ndemanga.

Kapangidwe kamakono kakuphatikiza kusunga madera onse muakale imodzi /usr/lib/locale/locale-archive, yomwe imapangidwira kukumbukira njira zonse zomwe zimagwiritsidwa ntchito glibc. Dzina lamalo m'mawonekedwe amakono liyenera kuvomerezedwa - manambala ndi zilembo zokha zochepetsedwa kukhala zilembo zochepa ndizotsalira m'mayina a encoding. Choncho ru_RU.KOI8-R, adzapulumutsidwa monga ru_RU.koi8r.

Mafayilo olowetsa amafufuzidwa m'ndandanda wamakono, komanso m'makalata /usr/share/i18n/locales/ и /usr/share/i18n/charmaps/ za mafayilo CTT ndi encoding owona, motero.

Mwachitsanzo, lamulo

localedef -i ru_RU -f MAC-CYRILLIC ru_RU.MAC-CYRILLIC

adzapanga fayilo /usr/share/i18n/locales/ru_RU pogwiritsa ntchito encoding file /usr/share/i18n/charmaps/MAC-CYRILLIC.gz ndikusunga zotsatira mu /usr/lib/locale/locale-archive pansi pa dzina ru_RU.maccyrillic

Ngati muyika zosinthika LANG = en_US.UTF-8 ndi glibc idzayang'ana ma binaries am'malo motsatira mafayilo ndi maulondo:

/usr/lib/locale/locale-archive
/usr/lib/locale/en_US.UTF-8/
/usr/lib/locale/en_US/
/usr/lib/locale/enUTF-8/
/usr/lib/locale/en/

Ngati dera limapezeka mwachikhalidwe komanso zamakono, ndiye kuti malo amakono amaperekedwa patsogolo.

Mutha kuwona mndandanda wamalo ophatikizidwa ndi lamulo malo-a.

Kukonzekera tebulo lanu lofananizira

Tsopano, pokhala ndi chidziwitso, mutha kupanga tebulo lanu lofananira la zingwe. Gome ili liyenera kuyerekeza molondola zilembo za Chirasha, kuphatikizapo chilembo Ё, ndipo nthawi yomweyo muziganizira zizindikiro zopumira malinga ndi tebulo. ASCII.

Njira yokonzekera tebulo lanu losankhira ili ndi magawo awiri: kusintha tebulo lazolemera ndikulipanga kukhala mawonekedwe a binary ndi lamulo. localdef.

Kuti tebulo lofananitsa lisinthidwe ndi ndalama zochepa zosinthira, muzojambula ISO 14652 Zigawo zosinthira zolemera za tebulo lomwe lilipo zaperekedwa. Gawoli limayamba ndi mawu osakira konzanso pambuyo ndikuwonetsa malo omwe m'malo mwake amachitidwa. Gawolo limathera ndi mzere konzanso-mapeto. Ngati kuli kofunikira kukonza magawo angapo a tebulo, ndiye kuti gawo lililonse limapangidwa pagawo lililonse.

Ndinakopera mafayilo atsopano ISO14651_t1_common и anayankha kuchokera kunkhokwe glibc kupita ku bukhu langa lanyumba ~/.local/share/i18n/locales/ ndikusintha gawolo pang'ono LC_COLLATE в anayankha. Mafayilo atsopano amagwirizana kwathunthu ndi mtundu wanga glibc. Ngati mukufuna kugwiritsa ntchito mafayilo akale, muyenera kusintha mayina ophiphiritsa ndi malo omwe kusinthaku kumayambira patebulo.

LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
reorder-after <U000D>
<U0020> <S0020>;<BASE>;<MIN>;<U0020> % SPACE
<U0021> <S0021>;<BASE>;<MIN>;<U0021> % EXCLAMATION MARK
<U0022> <S0022>;<BASE>;<MIN>;<U0022> % QUOTATION MARK
...
<U007D> <S007D>;<BASE>;<MIN>;<U007D> % RIGHT CURLY BRACKET
<U007E> <S007E>;<BASE>;<MIN>;<U007E> % TILDE
reorder-end
END LC_COLLATE

M'malo mwake, zitha kukhala zofunikira kusintha magawo LC_IDENTIFICATION kotero kuti amaloza kumaloko ru_MY, koma m'chitsanzo changa izi sizinali zofunikira, popeza sindinaphatikizepo zolemba zakale pakufufuza madera locale-archive.

kuti localdef adagwira ntchito ndi mafayilo mufoda yanga kudzera mukusintha I18NPATH Mutha kuwonjezera chikwatu chowonjezera kuti mufufuze mafayilo olowera, ndipo chikwatu kuti musunge mafayilo abinare chitha kufotokozedwa ngati njira yokhala ndi ma slashes:

$> I18NPATH=~/.local/share/i18n localedef -i ru_RU -f UTF-8 ~/.local/lib/locale/ru_MY.UTF-8

POSIX akuganiza kuti mu LANG mutha kulemba njira zonse zolozera ndi mafayilo akumaloko, kuyambira ndi slash yakutsogolo, koma glibc в Linux njira zonse zimawerengedwa kuchokera ku chikwatu choyambira, chomwe chitha kuchotsedwa kudzera mukusintha LOCPATH. Pambuyo unsembe LOCPATH=~/.local/lib/locale/ mafayilo onse okhudzana ndi kumasulira adzafufuzidwa mufoda yanga yokha. Zosungira zakale zamalo okhala ndi seti yosinthika LOCPATH kunyalanyazidwa.

Nayi mayeso omaliza:

$> LANG=ru_MY.UTF-8 LOCPATH=~/.local/lib/locale/ sort buhg.txt
Абаканов Михаил;маляр
Ёлкина Элла;крановщица
Иванов Андрей;слесарь
Иванова Алла;адвокат

Uwu! Tinachita!

Ntchito yazovuta

Ndayankha kale mafunso okhudza kusanja zingwe zomwe zidapangidwa koyambirira, koma pali mafunso angapo okhudza zolakwika - zowoneka ndi zosawoneka.

Tiyeni tibwerere ku vuto loyamba.

Ndipo pulogalamu mtundu ndi pulogalamuyi Funsani gwiritsani ntchito zingwe zofanana zofananira kuchokera glibc. Zidachitika bwanji Funsani adapereka cholakwika chosankha pamizere yosankhidwa ndi lamulo mtundu m'malo en_US.UTF-8? Yankho ndi losavuta: mtundu kufananiza chingwe chonse, ndi Funsani amafanizira fungulo lokha, lomwe mwachisawawa ndilo chiyambi cha chingwe mpaka pamtundu woyamba wa whitespace. Mu chitsanzo changa, izi zidapangitsa kuti pakhale uthenga wolakwika chifukwa kusanja mawu oyamba m'mizere sikunafanane ndi kusanja kwa mizere yonse.

Malo "C" zimatsimikizira kuti mu zingwe zosankhidwa zingwe zoyambira mpaka pamalo oyamba zidzasanjidwanso, koma izi zimangobisa cholakwikacho. Ndizotheka kusankha deta (anthu omwe ali ndi mayina ofanana, koma mayina osiyana oyambirira) omwe, popanda uthenga wolakwika, angapereke zotsatira zolakwika zophatikizana. Ngati ife tikufuna kutero Funsani kuphatikiza mizere yamafayilo ndi dzina lathunthu, ndiye njira yolondola ingakhale kufotokoza momveka bwino cholekanitsa chamunda ndikusintha ndi kiyi, osati ndi mzere wonse. Pamenepa, kuphatikiza kudzapitirira molondola ndipo sipadzakhala zolakwika m'dera lililonse:

$> sort -t ; -k 1 buhg.txt > buhg.srt
$> sort -t ; -k 1 mail.txt > mail.srt
$> join -t ; buhg.srt mail.srt > result

Chitsanzo chachita bwino pakusinthitsa CP1251 ili ndi cholakwika china. Chowonadi ndi chakuti m'magawo onse odziwika kwa ine Linux mapaketi akusowa malo ophatikizidwa ru_RU.CP1251. Ngati malo ophatikizidwa sapezeka, ndiye mtundu mwakachetechete amagwiritsa ntchito kufananitsa kwa byte-byte, zomwe tidawona.

Mwa njira, pali glitch ina yaying'ono yokhudzana ndi kusafikika kwa madera ophatikizidwa. Gulu LOCPATH=/tmp dera -a adzapereka mndandanda wa madera onse locale-archive, koma ndi kusintha kosinthika LOCPATH pa mapulogalamu onse (kuphatikiza ambiri wamba) malo awa sapezeka.

$> LOCPATH=/tmp locale -a | grep en_US
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
en_US
en_US.iso88591
en_US.iso885915
en_US.utf8

$> LC_COLLATE=en_US.UTF-8 sort --debug
sort: using ‘en_US.UTF-8’ sorting rules

$> LOCPATH=/tmp LC_COLLATE=en_US.UTF-8 sort --debug
sort: using simple byte comparison

Pomaliza

Ngati ndinu wopanga mapulogalamu omwe amazolowera kuganiza kuti zingwe ndi ma byte, ndiye kusankha kwanu LC_COLLATE=C.

Ngati ndinu katswiri wa zilankhulo kapena wolemba mtanthauzira mawu, ndiye kuti ndibwino kuti muphatikize m'dera lanu.

Ngati ndinu wosuta wosavuta, ndiye kuti muyenera kuzolowera mfundo yakuti lamulo ls -a zimatulutsa mafayilo kuyambira ndi dontho losakanizidwa ndi mafayilo oyambira ndi chilembo, ndi Pakati pausiku wamkulu, yomwe imagwiritsa ntchito ntchito zake zamkati posankha mayina, imayika mafayilo kuyambira ndi kadontho koyambirira kwa mndandanda.

powatsimikizira

Report No. 10 Unicode collation algorithm

Zolemera zamakhalidwe pa unicode.org

ICU - kukhazikitsa laibulale yogwira ntchito ndi Unicode kuchokera ku IBM.

Kuyesa kuyesa pogwiritsa ntchito ICU

Kulemera kwa zilembo ISO 14651

Kufotokozera za mtundu wa fayilo wokhala ndi masikelo ISO 14652

Kukambitsirana kwa zingwe zofananira mu glibc

Source: www.habr.com

Kuwonjezera ndemanga