I-Linux ihlunga kanjani izintambo

Isingeniso

Konke kwaqala ngombhalo omfushane okwakufanele uhlanganise imininingwane yekheli imeyili abasebenzi abatholwe ohlwini lwabasebenzisi bohlu lwamakheli, abanezikhundla zezisebenzi ezitholwe kusizindalwazi somnyango wakwa-HR. Zombili izinhlu zithunyelwe kumafayela ombhalo we-Unicode UTF-8 futhi ilondolozwe ngeziphetho zomugqa we-Unix.

Okuqukethwe mail.txt

Иванов Андрей;[email protected]

Okuqukethwe buhg.txt

Иванова Алла;маляр
Ёлкина Элла;крановщица
Иванов Андрей;слесарь
Абаканов Михаил;маляр

Ukuze uhlanganise, amafayela ahlungwe ngomyalo we-Unix ukhanda futhi ihanjiswe kokokufaka kohlelo lwe-Unix Ujoyine, ehluleke kungazelelwe ngephutha:

$> sort buhg.txt > buhg.srt
$> sort mail.txt > mail.srt
$> join buhg.srt mail.srt > result
join: buhg.srt:4: is not sorted: Иванов Андрей;слесарь

Ukubuka umphumela wokuhlunga ngamehlo akho kubonise ukuthi, ngokuvamile, ukuhlunga kulungile, kodwa esimweni sokuhlangana kwezibongo zabesilisa nabesifazane, ezabesifazane ziza ngaphambi kwabesilisa:

$> sort buhg.txt
Абаканов Михаил;маляр
Ёлкина Элла;крановщица
Иванова Алла;маляр
Иванов Андрей;слесарь

Kubukeka njengenkinga yokuhlunga ku-Unicode noma njengokubonakaliswa kobufazi ku-algorithm yokuhlunga. Esokuqala, yiqiniso, sizwakala kakhulu.

Asiyibeke okwamanje Ujoyine futhi ugxile ukhanda. Ake sizame ukuxazulula inkinga sisebenzisa i-poking yesayensi. Okokuqala, ake sishintshe indawo kusuka en_US on i-ru_RU. Ukuhlunga, kuzokwanela ukusetha ukuguquguquka kwemvelo LC_COLLATE, kodwa ngeke simoshe isikhathi ezintweni ezincane:

$> LANG=ru_RU.UTF-8 sort buhg.txt
Абаканов Михаил;маляр
Ёлкина Элла;крановщица
Иванова Алла;маляр
Иванов Андрей;слесарь

Akukho okushintshile.

Ake sizame ukwenza ikhodi kabusha amafayela abe umbhalo webhayithi elilodwa:

$> iconv -f UTF-8 -t KOI8-R buhg.txt 
 | LANG=ru_RU.KOI8-R sort 
 | iconv -f KOI8-R -t UTF8

Nakhona akukho okushintshile.

Akukho ongakwenza, kuzodingeka ubheke isixazululo ku-inthanethi. Akukho lutho ngokuqondile ngezibongo zaseRussia, kodwa kunemibuzo mayelana nezinye izinto ezingavamile zokuhlunga. Isibonelo, nansi inkinga: uhlobo lwe-unix luphatha izinhlamvu ze-‘-’ (dash) njengezingabonakali. Ngamafuphi, amayunithi ezinhlamvu "a-b", "aa", "ac" ahlelwa ngokuthi "aa", "a-b", "ac".

Impendulo ijwayelekile yonke indawo: sebenzisa indawo yomhleli "C" futhi uzojabula. Ake sizame:

$> LANG=C sort buhg.txt
Ёлкина Элла;крановщица
Абаканов Михаил;маляр
Иванов Андрей;слесарь
Иванова Алла;адвокат

Kukhona okushintshile. Abakwa-Ivanov bakleliswe ngendlela efanele, nakuba u-Yolkina eshelele kwenye indawo. Ake sibuyele enkingeni yokuqala:

$> LANG=C sort buhg.txt > buhg.srt
$> LANG=C sort mail.txt > mail.srt
$> LANG=C join buhg.srt mail.srt > result

Isebenze ngaphandle kwamaphutha, njengoba i-inthanethi yathembisa. Futhi lokhu naphezu kwe-Yolkina emgqeni wokuqala.

Inkinga ibonakala ixazululiwe, kodwa uma kwenzeka, ake sizame enye ikhodi yaseRussia - iWindows I-CP1251:

$> iconv -f UTF-8 -t CP1251 buhg.txt 
 | LANG=ru_RU.CP1251 sort 
 | iconv -f CP1251 -t UTF8 

Umphumela wokuhlunga, ngokuxakile, uzoqondana nendawo "C", futhi sonke isibonelo, ngokufanelekile, sisebenza ngaphandle kwamaphutha. Uhlobo oluthile lwemfihlakalo.

Angiyithandi i-mysticism ezinhlelweni ngoba imvamisa ivala amaphutha. Kuzofanele sibhekisise ukuthi isebenza kanjani. ukhanda futhi kuthinta ini? LC_COLLATE .

Ekugcineni ngizozama ukuphendula imibuzo:

  • kungani izibongo zesifazane zingahlelwanga ngendlela engafanele?
  • kungani LANG=ru_RU.CP1251 kwavela ukuthi kuyalingana LANG=C
  • ngani ukhanda и Ujoyine imibono ehlukene mayelana nokuhleleka kwezintambo ezihlungiwe
  • kungani kukhona amaphutha kuzo zonke izibonelo zami?
  • ekugcineni indlela yokuhlunga izintambo ngendlela oyithandayo

Ihlunga nge-Unicode

Isitobhi sokuqala kuzoba umbiko wezobuchwepheshe No. 10 onesihloko I-algorithm yokuhlanganisa i-Unicode Online unicode.org. Umbiko uqukethe imininingwane eminingi yobuchwepheshe, ngakho-ke ake nginikeze isifinyezo esifushane semibono esemqoka.

nokucutshungulwa — "Ukuqhathanisa" iyunithi yezinhlamvu kuyisisekelo sanoma iyiphi i-algorithm yokuhlunga. Ama-algorithms ngokwawo angase ahluke ("ibhamuza", "hlanganisa", "ngokushesha"), kodwa wonke azosebenzisa ukuqhathanisa kweyunithi yezinhlamvu ukuze anqume indlela avela ngayo.

Ukuhlunga izintambo ngolimi lwemvelo kuyinkinga eyinkimbinkimbi. Ngisho nasezibhalweni ezilula zebhayithi eyodwa, ukuhleleka kwezinhlamvu kuzinhlamvu, noma ngandlela thize ehluke kwezinhlamvu zesiLatini zesiNgisi, ngeke kusaqondana nokuhleleka kwamanani ezinombolo lezi zinhlamvu ezibhalwe ngawo. Ngakho ku-alfabhethi yesiJalimane uhlamvu Ö imi phakathi О и P, nakumbhalo wekhodi I-CP850 uyangena phakathi ÿ и Ü.

Ungazama ukukhipha ekubhalweni kwekhodi okuthile futhi ucabangele izinhlamvu “ezikahle” ezihlelwa ngokulandelana okuthile, njengoba kwenziwa ku-Unicode. Umbhalo wekhodi UTF8, UTF16 noma ibhayithi elilodwa I-KOI8-R (uma isethi engaphansi elinganiselwe ye-Unicode idingeka) izonikeza izethulo zezinombolo ezihlukene zezinhlamvu, kodwa ibhekisele ezintweni ezifanayo zethebula lesisekelo.

Kuvela ukuthi noma sakha itafula lezimpawu kusukela ekuqaleni, ngeke sikwazi ukulinika i-oda lophawu lomhlaba wonke. Kuma-alfabhethi esizwe ahlukene asebenzisa izinhlamvu ezifanayo, ukuhleleka kwalezi zinhlamvu kungase kuhluke. Ngokwesibonelo, ngesiFulentshi Æ izothathwa njenge-ligature futhi ihlungwe njengentambo AE. NgesiNorwegian Æ kuzoba incwadi ehlukile, etholakala ngemuva Z. Ngendlela, ngaphezu kwama-ligatures afana Æ Kunezinhlamvu ezibhalwe ezinezimpawu ezimbalwa. Ngakho ku-alfabhethi yesiCzech kukhona uhlamvu Ch, emi phakathi H и I.

Ngaphezu kokwehluka kwezinhlamvu, kukhona amanye amasiko esizwe anomthelela ekuhlungeni. Ikakhulukazi, umbuzo ophakamayo: Ngabe amagama ahlanganisa osonhlamvukazi nabancane kufanele avele kanjani kusichazamazwi? Ukuhlunga kungase kuthikamezeke nokusetshenziswa kwezimpawu zokuloba. NgeSpanishi, uphawu lombuzo oluhlanekezelwe lusetshenziswa ekuqaleni komusho obuza imibuzo (Uyawuthanda umculo?). Kulesi simo, kusobala ukuthi imisho yemibuzo akufanele iqoqwe ibe yiqoqo elihlukile ngaphandle kwezinhlamvu zamagama, kodwa indlela yokuhlunga imigqa nezinye izimpawu zokubhala?

Ngeke ngigxile ekuhleleni izintambo ngezilimi ezihluke kakhulu kwezaseYurophu. Qaphela ukuthi ezilimini ezinezikhombisi-ndlela zokubhala ukusuka kwesokudla kuye kwesokunxele noma kusuka phezulu kuya phansi, izinhlamvu emigqeni cishe zigcinwa ngokulandelana kokufunda, futhi nezinhlelo zokubhala ezingasebenzisi ama-alfabhethi zinezindlela zazo zokuhlela imigqa ngohlamvu. . Isibonelo, ama-hieroglyphs angahlelwa ngesitayela (Okhiye bezinhlamvu zesiShayina) noma ngokuphimisela. Uma ngikhuluma iqiniso, angazi ukuthi ama-emojis kufanele ahlelwe kanjani, kodwa ungawaqhamela nokuthile.

Ngokusekelwe kuzici ezibalwe ngenhla, izidingo eziyisisekelo zokuqhathanisa iyunithi yezinhlamvu ezisuselwe kumathebula e-Unicode zakhiwe:

  • ukuqhathaniswa kwezintambo akuncikile endaweni yezinhlamvu kuthebula lekhodi;
  • ukulandelana kwezinhlamvu ezakha uhlamvu olulodwa kwehliswa kube uhlobo lwecanonical (A + indingilizi ephezulu iyafana ne Å);
  • Uma kuqhathaniswa izintambo, umlingisi ubhekwa kumongo wentambo futhi, uma kunesidingo, ahlanganiswe nomakhelwane bakhe abe yiyunithi eyodwa yokuqhathanisa (Ch ngesiCzech) noma ihlukaniswe eziningana (Æ ngesiFulentshi);
  • zonke izici zikazwelonke (alfabhethi, usonhlamvukazi/osonhlamvukazi abancane, izimpawu zokuloba, ukuhleleka kwezinhlobo zokubhala) kufanele zilungiswe kuze kufike ekunikezweni okwenziwa ngesandla kwe-oda (i-emoji);
  • ukuqhathanisa kubalulekile hhayi kuphela ekuhleleni, kodwa nakwezinye izindawo eziningi, isibonelo ukucacisa ububanzi bemigqa (ukufaka esikhundleni {A... z} ku. bash);
  • ukuqhathanisa kufanele kwenziwe ngokushesha.

Ngaphezu kwalokho, ababhali bombiko bakhe izici zokuqhathanisa lapho onjiniyela be-algorithm okungafanele bathembele kuzo:

  • i-algorithm yokuqhathanisa akufanele idinge isethi ehlukile yezinhlamvu zolimi ngalunye (izilimi zesiRashiya nesi-Ukrainian zabelana ngezinhlamvu eziningi zesiCyrillic);
  • ukuqhathanisa akufanele kuncike ekuhlelweni kwezinhlamvu kumathebula e-Unicode;
  • isisindo sentambo akufanele sibe isibaluli sentambo, njengoba intambo efanayo ezimweni zamasiko ezihlukene ingaba nezisindo ezihlukene;
  • Izisindo zomugqa zingashintsha lapho zihlanganiswa noma zihlukaniswa (kusuka x < y akulandeli lokho xz < yz);
  • izintambo ezihlukene ezinezisindo ezifanayo zibhekwa njengezilingana ukusuka endaweni yokubuka ye-algorithm yokuhlunga. Ukwethula ukuhleleka okwengeziwe kwezintambo ezinjalo kungenzeka, kodwa kungase kwehlise izinga lokusebenza;
  • Ngesikhathi sokuhlunga okuphindaphindiwe, imigqa enesisindo esifanayo ingase ishintshwe. Ukuqina kuyisici se-algorithm ethile yokuhlunga, hhayi indawo yokuqhathanisa i-algorithm yeyunithi yezinhlamvu (bona isigaba sangaphambilini);
  • Imithetho yokuhlunga ingase ishintshe ngokuhamba kwesikhathi njengoba amasiko ehlunga/eshintsha.

Kuphinde kubekwe ukuthi i-algorithm yokuqhathanisa ayazi lutho mayelana ne-semantics yeyunithi yezinhlamvu ezicutshungulwayo. Ngakho-ke, izintambo eziqukethe amadijithi kuphela akufanele ziqhathaniswe njengezinombolo, futhi ohlwini lwamagama esiNgisi isihloko (Beatles, The).

Ukuze ukwanelisa zonke izidingo ezishiwo, i-algorithm yokuhlunga ithebula enamazinga amaningi (empeleni anamazinga amane) iyaphakanyiswa.

Ngaphambilini, izinhlamvu zeyunithi yezinhlamvu zehliswa zibe uhlobo lwe-canonical futhi zihlelwe zaba amayunithi okuqhathanisa. Iyunithi ngayinye yokuqhathanisa inikezwa izisindo eziningana ezihambisana namazinga amaningana okuqhathanisa. Izisindo zamayunithi okuqhathanisa ziyizici zamasethi a-oda (kulesi simo, izinombolo) ezingaqhathaniswa ngokuningi noma ngaphansi. Incazelo ekhethekile ZINJWE (0x0) kusho ukuthi ezingeni elihambisanayo lokuqhathanisa leyunithi ayibandakanyi ekuqhathaniseni. Ukuqhathaniswa kwezintambo kungaphindaphindiwe izikhathi eziningana, kusetshenziswa izisindo zamazinga ahambisanayo. Ezingeni ngalinye, izisindo zamayunithi okuqhathanisa emigqa emibili ziqhathaniswa ngokulandelana kwazo.

Ekusetshenzisweni okuhlukene kwe-algorithm yamasiko ahlukene kazwelonke, amanani we-coefficients angase ahluke, kodwa izinga le-Unicode lihlanganisa itafula eliyisisekelo lesisindo - "Ithebula le-Unicode Collation Element elizenzakalelayo" (I-DUCET). Ngingathanda ukuqaphela ukuthi ukusetha okuguquguqukayo LC_COLLATE empeleni kuyinkomba yokukhethwa kwetafula lesisindo emsebenzini wokuqhathanisa iyunithi yezinhlamvu.

Isisindo sama-coefficients I-DUCET ihlelwe kanje:

  • ezingeni lokuqala, zonke izinhlamvu zincishiselwa esimweni esifanayo, izimpawu zokubhala ziyalahlwa, izimpawu zokubhala (hhayi zonke) azinakwa;
  • ezingeni lesibili, kucatshangelwa izimpawu zezimpawu kuphela;
  • ezingeni lesithathu, yicala kuphela elicatshangelwa;
  • ezingeni lesine, kubhekwa izimpawu zokuloba kuphela.

Ukuqhathanisa kwenzeka ngezigaba eziningana: okokuqala, ama-coefficients wezinga lokuqala aqhathaniswa; uma izisindo zihambisana, khona-ke ukuqhathanisa okuphindaphindiwe nezisindo zezinga lesibili kwenziwa; bese mhlawumbe owesithathu nowesine.

Ukuqhathanisa kugcina lapho imigqa iqukethe amayunithi afanayo okuqhathanisa anesisindo esihlukile. Imigqa enezisindo ezilinganayo kuwo wonke amazinga amane ithathwa njengelingana nomunye.

Le-algorithm (enenqwaba yemininingwane yobuchwepheshe eyengeziwe) inikeze igama lokubika No. 10 - "Unicode Collation Algorithm" (I-ACU).

Yilapho indlela yokuziphatha yokuhlunga esibonelweni sethu iba sobala kancane. Kungaba kuhle ukuyiqhathanisa nezinga le-Unicode.

Ukuhlola ukusetshenziswa I-ACU kukhona okukhethekile isivivinyo, usebenzisa isisindo ifayela, ukwenza I-DUCET. Ungathola zonke izinhlobo zezinto ezihlekisayo efayeleni lesikali. Isibonelo, kukhona ukuhleleka kwama-mahjong nama-dominoes aseYurophu, kanye nokuhleleka kwamasudi emphemeni wamakhadi (uphawu 1F000 kanye nokunye). Amasudi ekhadi abekwe ngokwemithetho yebhuloho - i-PCBT, futhi amakhadi akusudi alandelana ngo-T, 2,3, XNUMX... K.

Ukuhlola mathupha ukuthi imigqa ihlelwa ngendlela efanele yini I-DUCET kungaba isicefe impela, kodwa, ngenhlanhla yethu, kukhona ukuqaliswa okuyisibonelo komtapo wolwazi wokusebenza ne-Unicode - "Izingxenye Zamazwe Ngamazwe ze-Unicode"(I-ICU).

Kuwebhusayithi yalo mtapo wolwazi, ithuthukiswe ngo IBM, kukhona amakhasi edemo, okuhlanganisa ikhasi le-algorithm yokuqhathanisa iyunithi yezinhlamvu. Sifaka imigqa yethu yokuhlola ngezilungiselelo ezizenzakalelayo futhi, bheka futhi, sithola ukuhlunga okuphelele kwesiRashiya.

Абаканов Михаил;маляр
Ёлкина Элла;крановщица
Иванов Андрей;слесарь
Иванова Алла;адвокат

Ngokwesibonelo, website I-ICU Ungathola ukucaciswa kwe-algorithm yokuqhathanisa lapho ucubungula izimpawu zokubhala. Ezibonelweni I-Collation FAQ i-aphostrophe ne-hyphen azinakwa.

I-Unicode isizile, kodwa bheka izizathu zokuziphatha okungajwayelekile ukhanda в Linux kuzofanele ngiye kwenye indawo.

Ihlunga nge-glibc

Ukubuka okusheshayo kwamakhodi omthombo wosizo ukhanda kusuka ku GNU Core Utils ibonise ukuthi kuhlelo lokusebenza ngokwalo, ukwenziwa kwasendaweni kwehla ekuphrinteni inani lamanje lokuguquguquka LC_COLLATE uma usebenza kumodi yokulungisa iphutha:

$ sort --debug buhg.txt > buhg.srt
sort: using ‘en_US.UTF8’ sorting rules

Ukuqhathanisa izintambo kwenziwa kusetshenziswa umsebenzi ojwayelekile i-strcoll, okusho ukuthi yonke into ethokozisayo isemtatsheni wezincwadi i-glibc.

In wiki iphrojekthi i-glibc okunikezelwe ekuqhathanisweni kweyunithi yezinhlamvu isigaba esisodwa. Kusukela kulesi sigaba kungaqondwa ukuthi ku i-glibc ukuhlunga kususelwa ku-algorithm esesiyazi kakade I-ACU (I-algorithm yokuhlanganisa i-Unicode) kanye/noma ezingeni eliseduze nayo ISO 14651 (Ukuhleleka kweyunithi yezinhlamvu kwamanye amazwe nokuqhathanisa). Mayelana nezinga zakamuva, kufanele kuqashelwe ukuthi kusayithi standards.iso.org ISO 14651 imenyezelwe ngokusemthethweni ukuthi itholakala esidlangalaleni, kodwa isixhumanisi esihambisanayo siholela ekhasini elingekho. I-Google ibuyisela amakhasi amaningana anezixhumanisi kumasayithi asemthethweni anikezela ukuthenga ikhophi ye-elekthronikhi yezinga lama-euro ayikhulu, kodwa ekhasini lesithathu noma lesine lemiphumela yosesho kukhona nezixhumanisi eziqondile PDF. Ngokuvamile, indinganiso ayihlukile neze I-ACU, kodwa kujabulisa kakhulu ukufunda ngoba ayiqukethe izibonelo ezicacile zezici zikazwelonke zokuhlunga izintambo.

Ulwazi oluthakazelisa kakhulu ku wiki kwakukhona isixhumanisi i-tracker yesiphazamisi ngengxoxo yokusetshenziswa kokuqhathanisa kweyunithi yezinhlamvu ku i-glibc. Engxoxweni kungafundwa lokho i-glibc esetshenziswa ukuqhathanisa izintambo ISOitafula lomuntu siqu Ithebula Lesifanekiso Esivamile (CTT), ikheli elingatholakala kuhlelo lokusebenza A ejwayelekile ISO 14651. Phakathi kuka-2000 no-2015 leli thebula ku i-glibc ibingenaye umnakekeli futhi ibihluke kakhulu (okungenani ngaphandle) kunguqulo yamanje yezinga. Kusukela ngo-2015 kuya ku-2018, ukujwayela inguqulo entsha yetafula kwenzeka, futhi manje unethuba lokuhlangana empilweni yangempela inguqulo entsha yetafula (I-CentOS 8), kanye nabadala (I-CentOS 7).

Manje njengoba sesinalo lonke ulwazi mayelana ne-algorithm namathebula asizayo, singabuyela enkingeni yokuqala futhi siqonde ukuthi singahlunga kanjani kahle izintambo endaweni yesiRashiya.

ISO 14651 / 14652

Ikhodi yomthombo yethebula esiyithandayo CTT ekusatshalalisweni okuningi Linux ikukhathalogi /usr/share/i18n/izindawo/. Ithebula ngokwalo likufayela iso14651_t1_evamile. Bese lokhu kungumyalelo wefayela kopisha i-iso14651_t1_evamile kufakwe kufayela iso14651_t1, okuyinto, futhi, ifakwe kumafayela kazwelonke, kuhlanganise en_US и i-ru_RU. Ekusabalazweni okuningi Linux Wonke amafayela omthombo afakiwe ekufakweni okuyisisekelo, kodwa uma engekho, kuzodingeka ukuthi ufake iphakheji eyengeziwe kusukela ekusabalaliseni.

Isakhiwo sefayela iso14651_t1 kungase kubonakale njenge-verbose kabi, nemithetho engacacile yokwakha amagama, kodwa uma uyibheka, yonke into ilula kakhulu. Isakhiwo sichazwe ezingeni ISO 14652, ikhophi engalandwa kuwebhusayithi open-std.org. Enye incazelo yefomethi yefayela ingafundwa kuyo imininingwane I-POSIX kusukela I-OpenGroup. Njengenye indlela yokufunda indinganiso, ungafunda ikhodi yomthombo yomsebenzi qoqa_funda в glibc/locale/programs/ld-collate.c.

Isakhiwo sefayela sibukeka kanje:

Ngokuzenzakalelayo, uhlamvu lusetshenziswa njengohlamvu lokuphunyuka, futhi isiphetho somugqa ngemva kohlamvu olungu-# singamazwana. Zombili izimpawu zingachazwa kabusha, okuyilokho okwenziwa enguqulweni entsha yethebula:

escape_char /
comment_char %

Ifayela lizoqukatha amathokheni ngefomethi noma (lapho x - idijithi ye-hexadecimal). Lokhu ukumelwa kwe-hexadecimal kwamaphoyinti ekhodi ye-Unicode ekubhaleni ngekhodi UCS-4 (UTF-32). Zonke ezinye izakhi kubakaki be-engeli (okuhlanganisa , <2> nokunye okunjalo) kuthathwa njengezintambo ezingaguquki ezilula ezinencazelo encane ngaphandle komongo.

Ulayini LC_COLLATE isitshela ukuthi okulandelayo kuqala idatha echaza ukuqhathaniswa kweyunithi yezinhlamvu.

Okokuqala, amagama acacisiwe ezisindweni ezisethebula lokuqhathanisa namagama ezinhlanganisela zezimpawu. Ngokuvamile, lezi zinhlobo ezimbili zamagama zingezamabhizinisi amabili ahlukene, kodwa kufayela langempela zixutshwe. Amagama ezisindo acaciswa igama elingukhiye ukuhlanganisa-uphawu (uhlamvu lokuqhathanisa) ngoba uma uqhathanisa, izinhlamvu ze-Unicode ezinesisindo esifanayo zizobhekwa njengezinhlamvu ezilinganayo.

Ubude obuphelele besigaba ekubuyekezweni kwamanje kwefayela cishe imigqa engu-900. Ngikhiphe izibonelo ezindaweni ezimbalwa ukuze ngibonise ubulungiswa bamagama nezinhlobo ezimbalwa ze-syntax.

LC_COLLATE

collating-symbol <RES-1>
collating-symbol <BLK>
collating-symbol <MIN>
collating-symbol <WIDE>
...
collating-symbol <ARABIC>
collating-symbol <ETHPC>
collating-symbol <OSMANYA>
...
collating-symbol <S1D000>..<S1D35F>
collating-symbol <SFFFF> % Guaranteed largest symbol value. Keep at end of this list
...
collating-element <U0413_0301> from "<U0413><U0301>"
collating-element <U0413_0341> from "<U0413><U0341>"

  • uphawu oluhlanganisayo ugoqa umucu OSMANYA ethebuleni lamagama ezikali
  • uphawu oluhlanganisayo .. ubhalisa ukulandelana kwamagama aqukethe isiqalo S kanye nesijobelelo sezinombolo se-hexadecimal kusuka 1D000 ukuze 1D35F.
  • I-FFFF в uphawu oluhlanganisayo ibukeka njengenombolo enkulu engasayiniwe ku-hexadecimal, kodwa yigama nje elingase libukeke
  • имя kusho iphuzu lekhodi ekubhalweni ngekhodi UCS-4
  • i-elementi yokuhlanganisa kusukela ku-"" ibhalisa igama elisha lepheya lamachashazi e-Unicode.

Uma amagama ezisindo esechaziwe, izisindo zangempela ziyachazwa. Njengoba kuphela ubudlelwano obukhulu kunobuncane bubalulekile uma kuqhathaniswa, izisindo zinqunywa ukulandelana okulula kwamagama ohlu. Izisindo "ezilula" zibalwe kuqala, bese kuba "ezisindayo". Ake ngikukhumbuze ukuthi uhlamvu ngalunye lwe-Unicode lwabelwe izisindo ezine ezihlukene. Lapha zihlanganiswa zibe ukulandelana okuhleliwe okukodwa. Ngokombono, noma yiliphi igama elingokomfanekiso lingasetshenziswa kunoma yimaphi amaleveli amane, kodwa amazwana abonisa ukuthi onjiniyela bahlukanisa amagama ngokwengqondo ngamaleveli.

% Symbolic weight assignments

% Third-level weight assignments
<RES-1>
<BLK>
<MIN>
<WIDE>
...
% Second-level weight assignments
<BASE>
<LOWLINE> % COMBINING LOW LINE
<PSILI> % COMBINING COMMA ABOVE
<DASIA> % COMBINING REVERSED COMMA ABOVE
...
% First-level weight assignments
<S0009> % HORIZONTAL TABULATION 
<S000A> % LINE FEED
<S000B> % VERTICAL TABULATION
...
<S0434> % CYRILLIC SMALL LETTER DE
<S0501> % CYRILLIC SMALL LETTER KOMI DE
<S0452> % CYRILLIC SMALL LETTER DJE
<S0503> % CYRILLIC SMALL LETTER KOMI DJE
<S0453> % CYRILLIC SMALL LETTER GJE
<S0499> % CYRILLIC SMALL LETTER ZE WITH DESCENDER
<S0435> % CYRILLIC SMALL LETTER IE
<S04D7> % CYRILLIC SMALL LETTER IE WITH BREVE
<S0454> % CYRILLIC SMALL LETTER UKRAINIAN IE
<S0436> % CYRILLIC SMALL LETTER ZHE

Ekugcineni, ithebula lesisindo langempela.

Isigaba sezisindo sifakwe emigqeni yamagama angukhiye order_start и oda_ukuphela. Izinketho ezengeziwe order_start nquma ukuthi imigqa ibheke ngakuphi lapho iskenwa khona ezingeni ngalinye lokuqhathanisa. Isilungiselelo esimisiwe sithi phambili. Umzimba wesigaba uqukethe imigqa equkethe ikhodi yophawu nezisindo zayo ezine. Ikhodi yohlamvu ingamelwa umlingiswa uqobo, iphuzu lekhodi, noma igama elingokomfanekiso elichazwe ngaphambilini. Izisindo zinganikezwa namagama angokomfanekiso, amaphuzu ekhodi, noma izimpawu ngokwazo. Uma kusetshenziswa amaphuzu ekhodi noma izinhlamvu, isisindo sazo siyafana nenani lezinombolo zephoyinti lekhodi (indawo kuthebula le-Unicode). Izinhlamvu ezingashiwongo ngokucacile (njengoba ngiqonda) zibhekwa njengezabelwa etafuleni ngesisindo esiyinhloko esifana nendawo kuthebula le-Unicode. Inani elikhethekile lesisindo UNGANAKELE kusho ukuthi uphawu alunakwa ezingeni elifanele lokuqhathanisa.

Ukukhombisa ukwakheka kwezikali, ngikhethe izingcezu ezintathu ezisobala:

  • izinhlamvu azinakwa ngokuphelele
  • izimpawu ezilingana nenombolo yesithathu emazingeni amabili okuqala
  • ukuqala kwezinhlamvu zamagama zesiCyrillic, ezingaqukethe amagama, ngakho-ke ihlelwa ikakhulukazi ngamaleveli okuqala nesithathu.

order_start forward;forward;forward;forward,position
<U0000> IGNORE;IGNORE;IGNORE;IGNORE % NULL (in 6429)
<U0001> IGNORE;IGNORE;IGNORE;IGNORE % START OF HEADING (in 6429)
<U0002> IGNORE;IGNORE;IGNORE;IGNORE % START OF TEXT (in 6429)
...
<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
<UFF13> <S0033>;<BASE>;<WIDE>;<UFF13> % FULLWIDTH DIGIT THREE
<U2476> <S0033>;<BASE>;<COMPAT>;<U2476> % PARENTHESIZED DIGIT THREE
<U248A> <S0033>;<BASE>;<COMPAT>;<U248A> % DIGIT THREE FULL STOP
<U1D7D1> <S0033>;<BASE>;<FONT>;<U1D7D1> % MATHEMATICAL BOLD DIGIT THREE
...
<U0430> <S0430>;<BASE>;<MIN>;<U0430> % CYRILLIC SMALL LETTER A
<U0410> <S0430>;<BASE>;<CAP>;<U0410> % CYRILLIC CAPITAL LETTER A
<U04D1> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
<U0430_0306> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
...
<U0431> <S0431>;<BASE>;<MIN>;<U0431> % CYRILLIC SMALL LETTER BE
<U0411> <S0431>;<BASE>;<CAP>;<U0411> % CYRILLIC CAPITAL LETTER BE
<U0432> <S0432>;<BASE>;<MIN>;<U0432> % CYRILLIC SMALL LETTER VE
<U0412> <S0432>;<BASE>;<CAP>;<U0412> % CYRILLIC CAPITAL LETTER VE
...
order_end

Manje ungabuyela ekuhleleni izibonelo kusukela ekuqaleni kwesihloko. Ukuqamekelwa kule ngxenye yetafula lezisindo:

<U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE
<U0021> IGNORE;IGNORE;IGNORE;<U0021> % EXCLAMATION MARK
<U0022> IGNORE;IGNORE;IGNORE;<U0022> % QUOTATION MARK
...

Kuyabonakala ukuthi kuleli thebula izimpawu zokuloba ezivela etafuleni ASCII (kuhlanganise nesikhala) cishe akunakwa uma kuqhathaniswa iyunithi yezinhlamvu. Okuhlukile kuphela imigqa efana kukho konke ngaphandle kwezimpawu zokubhala ezitholakala ezindaweni ezifanayo. Imigqa evela esibonelweni sami (ngemuva kokuhlunga) ye-algorithm yokuqhathanisa ibukeka kanje:

АбакановМихаилмаляр
ЁлкинаЭллакрановщица
ИвановаАлламаляр
ИвановАндрейслесарь

Uma kucatshangelwa ukuthi etafuleni lezikali, osonhlamvukazi ngesiRashiya beza ngemva kwezinhlamvu ezincane (ezingeni lesithathu esindayo ukwedlula ), ukuhlunga kubukeka kulungile.

Uma usetha okuguquguqukayo LC_COLLATE=C itafula elikhethekile lilayishwa elicacisa ukuqhathanisa kwe-byte-by-byte

static const uint32_t collseqwc[] =
{
  8, 1, 8, 0x0, 0xff,
  /* 1st-level table */
  6 * sizeof (uint32_t),
  /* 2nd-level table */
  7 * sizeof (uint32_t),
  /* 3rd-level table */
  L'x00', L'x01', L'x02', L'x03', L'x04', L'x05', L'x06', L'x07',
  L'x08', L'x09', L'x0a', L'x0b', L'x0c', L'x0d', L'x0e', L'x0f',

...
  L'xf8', L'xf9', L'xfa', L'xfb', L'xfc', L'xfd', L'xfe', L'xff'
};

Njengoba ku-Unicode iphoyinti lekhodi Ё liza ngaphambi kuka-A, izintambo zihlelwa ngokufanele.

Umbhalo kanye namathebula kanambambili

Ngokusobala, ukuqhathanisa iyunithi yezinhlamvu kuwumsebenzi ovame kakhulu, kanye nokuhlukanisa ithebula CTT inqubo ebiza kakhulu. Ukwandisa ukufinyelela kwethebula, ihlanganiswa ibe ifomu kanambambili ngomyalo localdef.

Ithimba localdef yamukela njengamapharamitha ifayela elinethebula lezimpawu zikazwelonke (inketho -i), lapho zonke izinhlamvu zimelelwa amachashazi e-Unicode, kanye nefayela lokubhalelana phakathi kwamachashazi e-Unicode nezinhlamvu zombhalo wekhodi othize (inketho -f). Njengomphumela womsebenzi, amafayela kanambambili adalelwa indawo enegama elishiwo kupharamitha yokugcina.

I-Glibc isekela amafomethi amabili wefayela kanambambili: "endabuko" kanye "nesimanje".

Ifomethi yendabuko isho ukuthi igama lendawo yigama lohla lwemibhalo olungaphansi /usr/lib/indawo/. Lolu hlu lwemibhalo olungaphansi lugcina amafayela kanambambili LC_COLLATE, LC_CTYPE, LC_TIME njalo njalo. Ifayela LC_IDENTIFICATION iqukethe igama elisemthethweni lendawo (elingase lihluke egameni lenkomba) namazwana.

Ifomethi yesimanje ifaka phakathi ukugcina zonke izifunda endaweni yomlando eyodwa /usr/lib/locale/locale-archive, efakwe kumephu kumemori ebonakalayo yazo zonke izinqubo ezisetshenziswayo i-glibc. Igama lendawo ngefomethi yesimanje lingaphansi kwezinye ze-canonization - izinombolo nezinhlamvu kuphela ezincishisiwe zibe zofeleba abancane ezisala emagameni ombhalo wekhodi. Ngakho ru_RU.KOI8-R, izosindiswa njenge ru_RU.koi8r.

Amafayela okokufaka aseshwa kuhla lwemibhalo lwamanje, kanye nasezinhlwini zemibhalo /usr/share/i18n/izindawo/ и /usr/share/i18n/charmaps/ okwamafayela CTT kanye namafayela ombhalo wekhodi, ngokulandelana.

Ngokwesibonelo, umyalo

localedef -i ru_RU -f MAC-CYRILLIC ru_RU.MAC-CYRILLIC

izohlanganisa ifayela /usr/share/i18n/locales/ru_RU usebenzisa ifayela lombhalo wekhodi /usr/share/i18n/charmaps/MAC-CYRILLIC.gz bese ugcine umphumela ku /usr/lib/locale/locale-archive ngaphansi kwegama ru_RU.maccyrillic

Uma usetha okuguquguqukayo LANG = en_US.UTF-8 ke i-glibc izobheka okuhamba ngakubili kwesifunda ngokulandelana okulandelayo kwamafayela nezinkomba:

/usr/lib/locale/locale-archive
/usr/lib/locale/en_US.UTF-8/
/usr/lib/locale/en_US/
/usr/lib/locale/enUTF-8/
/usr/lib/locale/en/

Uma indawo ivela kuzo zombili izakhiwo zendabuko nesimanje, khona-ke okubalulekile kunikezwa kwesimanje.

Ungabuka uhlu lwezindawo ezihlanganisiwe ngomyalo indawo -a.

Ilungiselela ithebula lakho lokuqhathanisa

Manje, uhlome ngolwazi, ungakha ithebula lakho lokuqhathanisa leyunithi yezinhlamvu. Leli thebula kufanele liqhathanise kahle izinhlamvu zesiRashiya, kuhlanganise nohlamvu Ё, futhi ngesikhathi esifanayo licabangele izimpawu zokubhala ngokuhambisana netafula. ASCII.

Inqubo yokulungiselela itafula lakho lokuhlunga liqukethe izigaba ezimbili: ukuhlela ithebula lezisindo nokulihlanganisa libe kwifomu kanambambili ngomyalo. localdef.

Ukuze ithebula lokuqhathanisa lilungiswe ngezindleko zokuhlela ezincane, ngefomethi ISO 14652 Izigaba zokulungisa izisindo zetafula elikhona zinikeziwe. Isigaba siqala ngegama elingukhiye hlela kabusha-ngemuva futhi ebonisa indawo ngemva kwalokho ukushintshwa kwenziwa. Isigaba siphetha ngomugqa hlela kabusha-isiphetho. Uma kudingekile ukulungisa izingxenye eziningana zetafula, khona-ke isigaba sidalwe esigabeni ngasinye esinjalo.

Ngikopishe izinguqulo ezintsha zamafayela iso14651_t1_evamile и i-ru_RU kusuka endaweni yokugcina i-glibc kunkomba yami yasekhaya ~/.local/share/i18n/locales/ futhi ngahlela kancane isigaba LC_COLLATE в i-ru_RU. Izinguqulo ezintsha zamafayela zihambisana ngokugcwele nenguqulo yami i-glibc. Uma ufuna ukusebenzisa izinguqulo ezindala zamafayela, kuzodingeka uguqule amagama angokomfanekiso nendawo lapho ukushintshwa kuqala khona kuthebula.

LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
reorder-after <U000D>
<U0020> <S0020>;<BASE>;<MIN>;<U0020> % SPACE
<U0021> <S0021>;<BASE>;<MIN>;<U0021> % EXCLAMATION MARK
<U0022> <S0022>;<BASE>;<MIN>;<U0022> % QUOTATION MARK
...
<U007D> <S007D>;<BASE>;<MIN>;<U007D> % RIGHT CURLY BRACKET
<U007E> <S007E>;<BASE>;<MIN>;<U007E> % TILDE
reorder-end
END LC_COLLATE

Eqinisweni, kungadingeka ukushintsha izinkambu ku LC_IDENTIFICATION ukuze bakhombe indawo ru_MY, kodwa esibonelweni sami lokhu bekungadingeki, njengoba ngikhiphe ingobo yomlando ekusesheni izindawo indawo-ingobo yomlando.

ukuthi localdef ngisebenze ngamafayela kufolda yami ngokusebenzisa okuguquguqukayo I18NPATH Ungangeza uhla lwemibhalo olwengeziwe ukuze useshe amafayela okufakwayo, futhi uhla lwemibhalo lokulondoloza amafayela kanambambili lungacaciswa njengendlela enama-slash:

$> I18NPATH=~/.local/share/i18n localedef -i ru_RU -f UTF-8 ~/.local/lib/locale/ru_MY.UTF-8

I-POSIX ucabanga ukuthi ku ULIMI ungabhala izindlela eziphelele zezinkomba ezinamafayili ezindawo, uqale ngo-slash oya phambili, kodwa i-glibc в Linux zonke izindlela zibalwa kusukela kuhla lwemibhalo oluyisisekelo, olungakhishwa ngokuguquguquka I-LOCPATH. Ngemva kokufaka LOCPATH=~/.local/lib/locale/ wonke amafayela ahlobene nokwenza okwasendaweni azoseshwa kufolda yami kuphela. Ingobo yomlando yezindawo ezinesethi eguquguqukayo I-LOCPATH indiva.

Nali isivivinyo esiwujuqu:

$> LANG=ru_MY.UTF-8 LOCPATH=~/.local/lib/locale/ sort buhg.txt
Абаканов Михаил;маляр
Ёлкина Элла;крановщица
Иванов Андрей;слесарь
Иванова Алла;адвокат

Hooray! Sikwenzile!

Iphutha ekusingatheni

Sengivele ngiyiphendulile imibuzo emayelana nokuhlelwa kwezintambo ezibuzwe ekuqaleni, kodwa kusenemibuzo embalwa mayelana namaphutha - abonakalayo futhi angabonakali.

Ake sibuyele enkingeni yokuqala.

Futhi uhlelo ukhanda kanye nohlelo Ujoyine sebenzisa imisebenzi yokuqhathanisa iyunithi yezinhlamvu efanayo kusukela i-glibc. Kwenzeke kanjani lokho Ujoyine inikeze iphutha lokuhlunga emigqeni ehlungwe ngomyalo ukhanda endaweni en_US.UTF-8? Impendulo ilula: ukhanda uqhathanisa intambo yonke, futhi Ujoyine iqhathanisa ukhiye kuphela, okuthi ngokuzenzakalelayo uyisiqalo seyunithi yezinhlamvu ukuya kuhlamvu lwesikhala esimhlophe sokuqala. Esibonelweni sami, lokhu kubangele umlayezo wephutha ngoba ukuhlelwa kwamagama okuqala emigqeni akuzange kufane nokuhlelwa kwemigqa ephelele.

Indawo "C" iqinisekisa ukuthi kuyunithi yezinhlamvu ezihlungiwe uchungechunge oluncane lokuqala olufika esikhaleni sokuqala nalo luzohlungwa, kodwa lokhu kufihla iphutha kuphela. Kungenzeka ukukhetha idatha (abantu abanezibongo ezifanayo, kodwa amagama okuqala ahlukene) okuthi, ngaphandle komlayezo wephutha, inikeze umphumela wokuhlanganisa ifayela ongalungile. Uma sifuna Ujoyine imigqa yefayela ehlanganisiwe ngegama eligcwele, khona-ke indlela elungile izoba ukucacisa ngokusobala isihlukanisi senkambu nokuhlunga ngenkambu yokhiye, futhi hhayi ngawo wonke umugqa. Kulesi simo, ukuhlanganisa kuzoqhubeka ngendlela efanele futhi ngeke kube namaphutha kunoma iyiphi indawo:

$> sort -t ; -k 1 buhg.txt > buhg.srt
$> sort -t ; -k 1 mail.txt > mail.srt
$> join -t ; buhg.srt mail.srt > result

Kwenziwe ngempumelelo isibonelo ekubhaleni ngekhodi I-CP1251 iqukethe elinye iphutha. Iqiniso liwukuthi kukho konke ukusatshalaliswa okwaziwayo kimina Linux amaphakheji ashoda ngolimi oluhlanganisiwe ru_RU.CP1251. Uma indawo ehlanganisiwe ingatholakali, khona-ke ukhanda buthule isebenzisa isiqhathaniso se-byte-by-byte, okuyilokho esikubonile.

Ngendlela, kukhona enye i-glitch encane ehlobene nokungafinyeleleki kwezindawo ezihlanganisiwe. Ithimba LOCPATH=/tmp indawo -a izonikeza uhlu lwazo zonke izindawo ngaphakathi indawo-ingobo yomlando, kodwa ngesethi eguquguqukayo I-LOCPATH kuzo zonke izinhlelo (okuhlanganisa kakhulu wendawo) lezi zindawo ngeke zitholakale.

$> LOCPATH=/tmp locale -a | grep en_US
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
en_US
en_US.iso88591
en_US.iso885915
en_US.utf8

$> LC_COLLATE=en_US.UTF-8 sort --debug
sort: using ‘en_US.UTF-8’ sorting rules

$> LOCPATH=/tmp LC_COLLATE=en_US.UTF-8 sort --debug
sort: using simple byte comparison

isiphetho

Uma ungumhleli ojwayele ukucabanga ukuthi izintambo ziyisethi yamabhayithi, ukukhetha kwakho LC_COLLATE=C.

Uma ungumfundi wezilimi noma umdidiyeli wesichazamazwi, ungahlanganisa kangcono endaweni yangakini.

Uma ungumsebenzisi olula, khona-ke udinga nje ukujwayela iqiniso lokuthi umyalo ls -a ikhipha amafayela aqala ngechashaza elixutshwe namafayela aqala ngohlamvu, kanye Umphathi wamaphakathi nobusuku, esebenzisa imisebenzi yayo yangaphakathi ukuhlunga amagama, ibeka amafayela aqala ngechashazi ekuqaleni kohlu.

izithenjwa

Bika inombolo engu-10 ye-algorithm yokuhlanganisa i-Unicode

Izisindo zomlingiswa ku-unicode.org

I-ICU - ukuqaliswa komtapo wolwazi wokusebenza ne-Unicode evela ku-IBM.

Ukuhlunga ukuhlola usebenzisa I-ICU

Izisindo zohlamvu phakathi ISO 14651

Incazelo yefomethi yefayela enezikali ISO 14652

Ingxoxo yokuqhathanisa iyunithi yezinhlamvu ku i-glibc

Source: www.habr.com

Engeza amazwana