Yadda nau'in Linux ke sarrafa kirtani

Gabatarwar

Duk ya fara ne da ɗan gajeren rubutun da ya kamata ya haɗa bayanan adireshin email ma'aikatan da aka samu daga jerin masu amfani da jerin aikawasiku, tare da matsayin ma'aikata da aka samu daga bayanan sashen HR. An fitar da jerin sunayen biyu zuwa fayilolin rubutu na Unicode UTF-8 kuma an ajiye shi tare da ƙarshen layin Unix.

Abun ciki mail.txt

Иванов Андрей;[email protected]

Abun ciki buhg.txt

Иванова Алла;маляр
Ёлкина Элла;крановщица
Иванов Андрей;слесарь
Абаканов Михаил;маляр

Don haɗawa, an jera fayilolin ta umarnin Unix raba kuma an ƙaddamar da shi zuwa shigar da shirin Unix shiga, wanda ba zato ba tsammani ya kasa tare da kuskure:

$> sort buhg.txt > buhg.srt
$> sort mail.txt > mail.srt
$> join buhg.srt mail.srt > result
join: buhg.srt:4: is not sorted: Иванов Андрей;слесарь

Duban sakamakon rarrabuwar da idanunku ya nuna cewa, gabaɗaya, rarrabuwa daidai ne, amma idan aka yi la’akari da lakabin sunayen maza da mata, mata suna zuwa gaban maza:

$> sort buhg.txt
Абаканов Михаил;маляр
Ёлкина Элла;крановщица
Иванова Алла;маляр
Иванов Андрей;слесарь

Yana kama da rarrabuwa a cikin Unicode ko kuma kamar bayyanar mata a cikin rarrabuwar algorithm. Na farko shi ne, ba shakka, ya fi dacewa.

Mu ajiye shi a yanzu shiga da mayar da hankali kan raba. Mu yi kokarin magance matsalar ta amfani da poking na kimiyya. Da farko, bari mu canza wurin daga en_US a kan ru_RU. Don warwarewa, zai isa a saita canjin yanayi LC_COLLATE, amma ba za mu ɓata lokaci a kan ƙananan abubuwa ba:

$> LANG=ru_RU.UTF-8 sort buhg.txt
Абаканов Михаил;маляр
Ёлкина Элла;крановщица
Иванова Алла;маляр
Иванов Андрей;слесарь

Babu wani abu da ya canza.

Bari mu yi ƙoƙari mu sake canza fayilolin zuwa cikin rufaffen rufaffiyar-byte ɗaya:

$> iconv -f UTF-8 -t KOI8-R buhg.txt 
 | LANG=ru_RU.KOI8-R sort 
 | iconv -f KOI8-R -t UTF8

Kuma babu abin da ya canza.

Babu wani abu da za ku iya yi, dole ne ku nemi mafita akan Intanet. Babu wani abu kai tsaye game da sunayen sunayen Rashanci, amma akwai tambayoyi game da wasu abubuwan rashin daidaituwa. Misali, ga matsala: nau'in unix yana ɗaukar haruffa ''-' (dash) azaman ganuwa. A takaice dai, igiyoyin "a-b", "aa", "ac" an jera su a matsayin "aa", "a-b", "ac".

Amsar ita ce daidaitattun ko'ina: yi amfani da wurin mai tsara shirye-shirye "C" kuma za ku yi farin ciki. Mu gwada:

$> LANG=C sort buhg.txt
Ёлкина Элла;крановщица
Абаканов Михаил;маляр
Иванов Андрей;слесарь
Иванова Алла;адвокат

Wani abu ya canza. Ivanovs sun yi layi a daidai tsari, kodayake Yolkina ya zame wani wuri. Mu koma kan asalin matsalar:

$> LANG=C sort buhg.txt > buhg.srt
$> LANG=C sort mail.txt > mail.srt
$> LANG=C join buhg.srt mail.srt > result

Ya yi aiki ba tare da kurakurai ba, kamar yadda Intanet ta yi alkawari. Kuma wannan duk da Yolkina a cikin layin farko.

Da alama an warware matsalar, amma kawai idan, bari mu gwada wani ɓoye na Rasha - Windows CP1251:

$> iconv -f UTF-8 -t CP1251 buhg.txt 
 | LANG=ru_RU.CP1251 sort 
 | iconv -f CP1251 -t UTF8 

Sakamakon rarrabuwa, da ban mamaki, zai zo daidai da wurin "C", kuma dukan misali, bisa ga haka, yana gudana ba tare da kurakurai ba. Wani irin sufanci.

Ba na son sufanci a cikin shirye-shirye saboda yawanci yana rufe kurakurai. Dole ne mu dubi yadda yake aiki da gaske. raba kuma me ya shafa? LC_COLLATE .

A ƙarshe zan yi ƙoƙarin amsa tambayoyin:

  • me yasa aka jera sunayen sunayen mata ba daidai ba?
  • me yasa LANG=ru_RU.CP1251 ya juya ya zama daidai LANG=C
  • me yasa raba и shiga daban-daban ra'ayoyi game da tsari na jerawa kirtani
  • me yasa ake samun kurakurai a duk misalan nawa?
  • a karshe yadda ake warware igiyoyi zuwa ga son ku

Rarraba a Unicode

Tasha ta farko za ta kasance rahoton fasaha No. 10 mai suna Unicode tattara algorithm Online unicode.org. Rahoton ya ƙunshi cikakkun bayanai na fasaha, don haka bari in ba da taƙaitaccen taƙaitaccen ra'ayoyin.

collation - “kwatancen” kirtani shine tushen kowane rarrabuwa algorithm. Algorithms da kansu na iya bambanta ("kumfa", "haɗe", "sauri"), amma duk za su yi amfani da kwatancen igiyoyi biyu don tantance tsarin da suka bayyana.

Rarraba igiyoyi a cikin harshe na halitta matsala ce mai sarƙaƙƙiya. Ko a cikin mafi sauƙin rufaffiyar rufaffiyar baiti ɗaya, tsarin haruffa a cikin haruffa, ko da ta wata hanya dabam da haruffan Latin na Ingilishi, ba za su ƙara yin daidai da tsarin ƙimar ƙididdiga waɗanda waɗannan haruffan ke cikin su ba. Don haka a cikin haruffan Jamus harafin Ö ya tsaya tsakanin О и P, kuma a cikin encoding CP850 ta shiga tsakani ÿ и Ü.

Kuna iya gwada taƙaitawa daga takamaiman ɓoyewa kuma la'akari da haruffa "masu kyau" waɗanda aka tsara ta wani tsari, kamar yadda ake yi a Unicode. Rubutun bayanai UTF8, UTF16 ko daya-byte KOI8-R (idan ana buƙatar ƙayyadadden juzu'in Unicode) zai ba da wakilcin lambobi daban-daban na haruffa, amma koma zuwa abubuwa iri ɗaya na teburin tushe.

Ya zama cewa ko da mun gina tebur alama daga karce, ba za mu iya sanya masa oda ta duniya ba. A cikin haruffan ƙasa daban-daban waɗanda ke amfani da haruffa iri ɗaya, tsarin waɗannan haruffa na iya bambanta. Misali, a Faransanci Æ za a yi la'akari da ligature da kuma jerawa a matsayin kirtani AE. A cikin Yaren mutanen Norway Æ zai zama wasiƙar dabam, wanda ke bayan Z. Af, ban da ligatures kamar Æ Akwai haruffa da aka rubuta tare da alamomi da yawa. Don haka a cikin haruffan Czech akwai harafi Ch, wanda ke tsakanin H и I.

Baya ga bambance-bambance a cikin haruffa, akwai wasu al'adun ƙasa waɗanda ke yin tasiri akan rarrabuwa. Musamman ma, tambaya ta taso: a cikin wane tsari za a bayyana kalmomin da suka ƙunshi manyan haruffa da ƙananan haruffa a cikin ƙamus? Hakanan ana iya yin tasiri ta hanyar amfani da alamomin rubutu. A cikin Mutanen Espanya, ana amfani da alamar tambaya da aka juyar a farkon jumlar tambaya (Kuna son kiɗa?). A wannan yanayin, a bayyane yake cewa bai kamata a haɗa jimlolin tambayoyi zuwa wani gungu daban da ke wajen haruffa ba, amma ta yaya za a warware layi da sauran alamomin rubutu?

Ba zan tsaya kan rarrabuwar kawuna a cikin harsunan da suka sha bamban da na Turai ba. Lura cewa a cikin yarukan da ke da shugabanci na rubutu dama-zuwa-hagu ko sama-zuwa-ƙasa, ana iya adana haruffan da ke cikin layi a cikin tsari na karatu, har ma da tsarin rubutun da ba na haruffa ba suna da nasu hanyoyin yin oda haruffa ta hali. . Misali, ana iya yin odar hieroglyphs ta salo (Maɓallan haruffan Sinanci) ko ta hanyar lafazi. A gaskiya, ban san yadda za a tsara emojis ba, amma kuna iya fito da wani abu gare su kuma.

Dangane da abubuwan da aka jera a sama, an ƙirƙiri ainihin buƙatun don kwatanta kirtani bisa teburin Unicode:

  • kwatanta kirtani ba ya dogara da matsayi na haruffa a cikin tebur na lamba;
  • jerin haruffan da suka kafa hali ɗaya an rage su zuwa sigar canonical (A + da'irar saman daidai yake da Å);
  • Lokacin kwatanta kirtani, ana la'akari da hali a cikin mahallin kirtani kuma, idan ya cancanta, haɗe shi da maƙwabtansa zuwa ɗaya naúrar kwatanta (Ch a cikin Czech) ko kuma an raba shi zuwa da yawa (Æ a cikin Faransanci);
  • duk fasalulluka na ƙasa (alphabet, babba/ƙananan, rubutu, tsari na nau'ikan rubutu) dole ne a daidaita su har zuwa aikin umarni na hannu (emoji);
  • kwatanta yana da mahimmanci ba kawai don rarrabuwa ba, har ma a wasu wurare da yawa, misali don tantance jeri (masanyan {A... z} a cikin Bash);
  • kwatanta ya kamata a yi daidai da sauri.

Bugu da kari, marubutan rahoton sun tsara kaddarorin kwatance waɗanda bai kamata masu haɓaka algorithm su dogara da su ba:

  • kwatancen algorithm bai kamata ya buƙaci saitin haruffa daban don kowane harshe (harshen Rasha da Ukrainian suna raba yawancin haruffan Cyrillic);
  • kwatancen bai kamata ya dogara da tsarin haruffa a cikin tebur na Unicode ba;
  • Nauyin kirtani bai kamata ya zama sifa na kirtani ba, tun da kirtani iri ɗaya a cikin al'adun al'adu daban-daban na iya samun ma'auni daban-daban;
  • Ma'aunin nauyi na jere na iya canzawa lokacin haɗuwa ko rarrabuwa (daga x < y baya bin haka xz < yz);
  • kirtani daban-daban masu nauyi iri ɗaya ana ɗaukarsu daidai ne daga mahangar rarrabuwar algorithm. Gabatar da ƙarin oda na irin waɗannan kirtani yana yiwuwa, amma yana iya lalata aiki;
  • Yayin maimaita rarrabuwa, ana iya musanya layuka masu nauyi iri ɗaya. Ƙarfi dukiya ce ta ƙayyadaddun algorithm na rarrabuwa, kuma ba mallakin tsarin kwatanta kirtani ba (duba sakin layi na baya);
  • Rarraba dokoki na iya canzawa cikin lokaci yayin da al'adun al'adu ke ingantawa/canzawa.

Hakanan an ƙulla cewa kwatanta algorithm bai san komai ba game da ilimin tarukan kirtani da ake sarrafa su. Don haka, kirtani da ke kunshe da lambobi kawai bai kamata a kwatanta su azaman lambobi ba, kuma a cikin jerin sunayen Ingilishi labarin (Beatles, Da).

Domin gamsar da duk ƙayyadaddun buƙatun, ana ba da shawarar rarrabuwa na algorithm na matakai masu yawa (ainihin mataki huɗu).

A baya can, haruffan da ke cikin kirtani suna raguwa zuwa sigar canonical kuma an haɗa su cikin raka'a na kwatanta. Kowace rukunin kwatance an sanya ma'aunin nauyi da yawa daidai da matakan kwatance da yawa. Ma'auni na raka'o'in kwatanta abubuwa ne na saitin da aka ba da oda (a wannan yanayin, lamba) waɗanda za a iya kwatanta su da yawa ko žasa. Ma'ana ta musamman AN YI watsi da shi (0x0) yana nufin cewa a daidai matakin kwatance wannan naúrar ba ta cikin kwatancen. Ana iya maimaita kwatancen kirtani sau da yawa, ta amfani da ma'auni na matakan da suka dace. A kowane mataki, ma'aunin ma'aunin kwatancen layuka biyu ana kwatanta su a jere da juna.

A cikin aiwatarwa daban-daban na algorithm don al'adun ƙasa daban-daban, ƙimar ƙididdiga na iya bambanta, amma ma'aunin Unicode ya haɗa da tebur na ma'auni - "Tsohuwar Teburin Haɗin Unicode" (DUCET). Ina so in lura cewa saita canjin LC_COLLATE ainihin nuni ne na zaɓin tebur mai nauyi a cikin aikin kwatanta kirtani.

Ma'aunin nauyi DUCET an shirya kamar haka:

  • a matakin farko, duk haruffa ana rage su zuwa yanayi iri ɗaya, ana watsar da diacritics, alamomin rubutu (ba duka ba) ana watsi da su;
  • a mataki na biyu, diacritics kawai ana la'akari da su;
  • a mataki na uku, ana la'akari da shari'ar kawai;
  • a mataki na hudu, alamomin rubutu ne kawai ake la'akari.

Kwatancen yana faruwa a cikin wucewa da yawa: na farko, ana kwatanta ƙididdiga na matakin farko; idan ma'aunin nauyi ya yi daidai, to ana yin kwatancen maimaitawa tare da ma'aunin matakin na biyu; to watakila na uku da na hudu.

Kwatancen yana ƙarewa lokacin da layuka suka ƙunshi raka'a masu dacewa da ma'auni daban-daban. Layukan da suke da ma'aunin nauyi daidai a duk matakai huɗu ana ɗaukarsu daidai da juna.

Wannan algorithm (tare da tarin ƙarin bayanan fasaha) ya ba da sunan don bayar da rahoton No. 10 - "Unicode Collation Algorithm" (ACU).

Wannan shi ne inda halin rarrabuwar kawuna daga misalinmu ya ɗan ƙara bayyana. Zai yi kyau a kwatanta shi da ma'aunin Unicode.

Don gwada aiwatarwa ACU akwai na musamman gwajin, amfani fayil ɗin nauyi, aiwatarwa DUCET. Kuna iya samun kowane nau'in abubuwa masu ban dariya a cikin fayil ɗin ma'auni. Misali, akwai tsari na mahjong da dominoes na Turai, da kuma tsarin kwat da wando a cikin kati (alama). 1F000 da kuma gaba). Ana sanya kwat da wando na katin bisa ga ka'idodin gada - PCBT, kuma katunan da ke cikin kwat din suna cikin jerin T, 2,3, XNUMX ... K.

Dubawa da hannu cewa an jera layuka daidai bisa ga DUCET zai zama mai wahala sosai, amma, sa'a a gare mu, akwai ingantaccen aiwatar da ɗakin karatu don aiki tare da Unicode - "Abubuwan da ke ƙasa don Unicode"(ICU).

A gidan yanar gizon wannan ɗakin karatu, an haɓaka a IBM, akwai shafukan demo, ciki har da string kwatanta shafin algorithm. Muna shigar da layin gwajin mu tare da saitunan tsoho kuma, ga kuma, muna samun cikakkiyar rarrabuwa na Rashanci.

Абаканов Михаил;маляр
Ёлкина Элла;крановщица
Иванов Андрей;слесарь
Иванова Алла;адвокат

Af, gidan yanar gizon ICU Kuna iya samun bayanin kwatancen algorithm lokacin sarrafa alamomin rubutu. A cikin misalai Tambayoyi Tambayoyi an yi watsi da ridda da saƙa.

Unicode ya taimaka mana, amma neman dalilan baƙon hali raba в Linux za a je wani wuri dabam.

Ana rarrabewa cikin glibc

Duban sauri na lambobin tushen mai amfani raba daga GNU Core Utils ya nuna cewa a cikin mai amfani da kanta, ƙayyadaddun ƙayyadaddun ƙayyadaddun ƙayyadaddun ƙayyadaddun ƙayyadaddun ƙayyadaddun ƙayyadaddun ƙayyadaddun ƙayyadaddun abu ya sauko zuwa buga ƙimar halin yanzu na m LC_COLLATE lokacin aiki a cikin yanayin gyara kuskure:

$ sort --debug buhg.txt > buhg.srt
sort: using ‘en_US.UTF8’ sorting rules

Ana yin kwatancen igiyoyi ta amfani da daidaitaccen aikin strcoll, wanda ke nufin duk abin ban sha'awa yana cikin ɗakin karatu glibc.

a kan wiki aikin glibc sadaukar don kwatanta kirtani sakin layi daya. Daga wannan sakin layi za a iya fahimtar cewa a cikin glibc rarrabuwa ya dogara ne akan algorithm da aka riga aka sani gare mu ACU (Algorithm na Unicode tari) da/ko a ma'auni kusa da shi ISO 14651 (oda da kwatancen kirtani na duniya). Game da sabon ma'auni, ya kamata a lura cewa akan shafin standards.iso.org ISO 14651 a hukumance an bayyana samuwa a bainar jama'a, amma hanyar haɗin da ta dace tana kaiwa zuwa shafin da babu shi. Google yana mayar da shafuka da yawa tare da hanyoyin haɗin yanar gizo na hukuma waɗanda ke ba da siyan kwafin lantarki na daidaitattun kuɗi na Euro ɗari, amma a shafi na uku ko na huɗu na sakamakon binciken akwai kuma hanyoyin kai tsaye zuwa PDF. Gabaɗaya, ƙa'idar a zahiri ba ta bambanta da ACU, amma yana da ban sha'awa don karantawa saboda ba ya ƙunshi bayyanannun misalan fasalulluka na rarrabuwar kirtani na ƙasa.

Mafi ban sha'awa bayanai a kan wiki akwai hanyar haɗi zuwa bug tracker tare da tattaunawa game da aiwatar da kwatanta kirtani a glibc. Daga tattaunawar za a iya koyan hakan glibc amfani da kwatanta kirtani ISOtebur na sirri Teburin Samfuran gama gari (CTT), adireshin wanda za'a iya samunsa a cikin aikace-aikacen A misali ISO 14651. Tsakanin 2000 da 2015 wannan tebur in glibc ba shi da mai kula kuma ya bambanta sosai (aƙalla a waje) da sigar daidaitattun yanzu. Daga 2015 zuwa 2018, daidaitawa zuwa sabon nau'in tebur ya faru, kuma yanzu kuna da damar saduwa a rayuwa ta sabon nau'in tebur (CentOS 8), kuma tsoho (CentOS 7).

Yanzu da muke da duk bayanan game da algorithm da tebur na taimako, za mu iya komawa zuwa matsala ta asali kuma mu fahimci yadda za a daidaita kirtani daidai a cikin yankin Rasha.

ISO 14651 / 14652

Lambar tushe na teburin da muke sha'awar CTT akan yawancin rabawa Linux yana cikin kasida /usr/share/i18n/locales/. Teburin da kansa yana cikin fayil ɗin iso14651_t1_na kowa. Sannan wannan shine umarnin fayil kwafi iso14651_t1_common kunshe a cikin fayil iso14651_t1, wanda, bi da bi, yana kunshe a cikin fayilolin ƙasa, ciki har da en_US и ru_RU. Akan yawancin rabawa Linux duk fayilolin tushen suna cikin shigarwa na asali, amma idan ba su kasance ba, dole ne ka shigar da ƙarin fakiti daga rarrabawa.

Tsarin fayil iso14651_t1 na iya zama kamar baƙar magana, tare da ƙa'idodi marasa ma'ana don gina sunaye, amma idan kun duba, komai yana da sauƙi. An kwatanta tsarin a cikin ma'auni ISO 14652, kwafin wanda za a iya sauke shi daga gidan yanar gizon bude-std.org. Ana iya karanta wani bayanin tsarin fayil a ciki ƙayyadaddun bayanai POSIX daga OpenGroup. A matsayin madadin karanta ma'auni, zaku iya nazarin lambar tushe na aikin karanta_karatu в glibc/locale/programs/ld-collate.c.

Tsarin fayil ɗin yayi kama da haka:

Ta hanyar tsoho, ana amfani da halin azaman halin tserewa, kuma ƙarshen layin bayan halin # shine sharhi. Ana iya sake fasalin alamomin biyu, wanda shine abin da ake yi a cikin sabon sigar tebur:

escape_char /
comment_char %

Fayil ɗin zai ƙunshi alamu a cikin tsari ko (ku x - lambar hexadecimal). Wannan shine wakilcin hexadecimal na maki code na Unicode a cikin rikodi UCS-4 (UTF-32). Duk sauran abubuwan da ke cikin maƙallan kusurwa (ciki har da , da makamantansu) ana la'akari da sauƙaƙan kirtani waɗanda ba su da ma'ana kaɗan a wajen mahallin.

Layi LC_COLLATE ya gaya mana cewa gaba zai fara bayanan da ke kwatanta kwatancen kirtani.

Na farko, an kayyade sunaye don ma'auni a cikin tebur ɗin kwatanta da sunaye don haɗakar alamar. Gabaɗaya magana, nau'ikan sunaye biyu suna cikin ƙungiyoyi biyu daban-daban, amma a cikin ainihin fayil ɗin an gauraye su. Sunayen ma'auni an ƙayyade ta maɓalli haɗin kai-alama (halayen kwatanta) saboda idan aka kwatanta, haruffa Unicode waɗanda suke da ma'auni iri ɗaya za a ɗauke su daidai da haruffa.

Jimlar tsawon sashe a cikin bitar fayil ɗin na yanzu kusan layuka 900 ne. Na jawo misalai daga wurare da yawa don nuna sabani na sunaye da nau'ikan syntax da yawa.

LC_COLLATE

collating-symbol <RES-1>
collating-symbol <BLK>
collating-symbol <MIN>
collating-symbol <WIDE>
...
collating-symbol <ARABIC>
collating-symbol <ETHPC>
collating-symbol <OSMANYA>
...
collating-symbol <S1D000>..<S1D35F>
collating-symbol <SFFFF> % Guaranteed largest symbol value. Keep at end of this list
...
collating-element <U0413_0301> from "<U0413><U0301>"
collating-element <U0413_0341> from "<U0413><U0341>"

  • alamar haɗin kai rajistan ayyukan kirtani OSMANYA a cikin teburin sunayen ma'auni
  • alamar tari .. yana yin rajistar jerin sunayen da suka ƙunshi prefix S da kari na lamba hexadecimal daga 1D000 to 1D35F.
  • FFFF в alamar haɗin kai yayi kama da babban lamba mara sa hannu a cikin hexadecimal, amma suna ne kawai da zai iya kama
  • имя yana nufin alamar lamba a cikin ɓoye UCS-4
  • abubuwan tattarawa daga "" yayi rijistar sabon suna don digo biyu na Unicode.

Da zarar an bayyana sunayen ma'aunin nauyi, an ƙayyade ainihin ma'aunin nauyi. Tunda mafi girman-ƙasa-ƙasa ke da mahimmanci idan aka kwatanta, ana ƙayyade ma'aunin ta hanyar jeri mai sauƙi na jerin sunayen. An jera ma'auni na "mafi sauƙi" da farko, sannan kuma "mafi nauyi". Bari in tunatar da ku cewa kowane hali Unicode an sanya ma'auni huɗu daban-daban. Anan an haɗa su zuwa jerin oda guda ɗaya. A ka'idar, ana iya amfani da kowane suna na alama a kowane mataki guda huɗu, amma sharhi yana nuna cewa masu haɓakawa a hankali sun raba sunaye zuwa matakai.

% Symbolic weight assignments

% Third-level weight assignments
<RES-1>
<BLK>
<MIN>
<WIDE>
...
% Second-level weight assignments
<BASE>
<LOWLINE> % COMBINING LOW LINE
<PSILI> % COMBINING COMMA ABOVE
<DASIA> % COMBINING REVERSED COMMA ABOVE
...
% First-level weight assignments
<S0009> % HORIZONTAL TABULATION 
<S000A> % LINE FEED
<S000B> % VERTICAL TABULATION
...
<S0434> % CYRILLIC SMALL LETTER DE
<S0501> % CYRILLIC SMALL LETTER KOMI DE
<S0452> % CYRILLIC SMALL LETTER DJE
<S0503> % CYRILLIC SMALL LETTER KOMI DJE
<S0453> % CYRILLIC SMALL LETTER GJE
<S0499> % CYRILLIC SMALL LETTER ZE WITH DESCENDER
<S0435> % CYRILLIC SMALL LETTER IE
<S04D7> % CYRILLIC SMALL LETTER IE WITH BREVE
<S0454> % CYRILLIC SMALL LETTER UKRAINIAN IE
<S0436> % CYRILLIC SMALL LETTER ZHE

A ƙarshe, ainihin nauyin tebur.

An rufe sashin ma'auni a cikin layukan maɓalli oda_fara и oda_karshen. Ƙarin zaɓuɓɓuka oda_fara Ƙayyade a wace hanya ake duba layuka a kowane matakin kwatanta. Saitin tsoho shine gaba. Jikin sashin ya ƙunshi layukan da ke ɗauke da lambar alamar da ma'auni guda huɗu. Za a iya wakilta lambar haruffa ta halin kanta, wurin lamba, ko suna na alama da aka ayyana a baya. Hakanan ana iya ba da ma'auni ga sunaye na alama, maki na lamba, ko alamomin kansu. Idan aka yi amfani da maki ko haruffa, nauyinsu ɗaya ne da ƙimar lamba ta wurin lambar (matsayi a teburin Unicode). Haruffa da ba a fayyace su ba (kamar yadda na fahimta) ana ɗaukar su an sanya su zuwa tebur tare da babban nauyi wanda yayi daidai da matsayi a teburin Unicode. Ƙimar nauyi ta musamman JAHILAI yana nufin cewa an yi watsi da alamar a daidai matakin kwatanta.

Don nuna tsarin ma'auni, na zaɓi ɓangarorin guda uku a bayyane:

  • haruffan da aka yi watsi da su gaba ɗaya
  • alamomin daidai da lamba uku a matakan farko biyu
  • farkon haruffan Cyrillic, wanda ba ya ƙunshe da diacritics, don haka ana jerawa galibi ta matakan farko da na uku.

order_start forward;forward;forward;forward,position
<U0000> IGNORE;IGNORE;IGNORE;IGNORE % NULL (in 6429)
<U0001> IGNORE;IGNORE;IGNORE;IGNORE % START OF HEADING (in 6429)
<U0002> IGNORE;IGNORE;IGNORE;IGNORE % START OF TEXT (in 6429)
...
<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
<UFF13> <S0033>;<BASE>;<WIDE>;<UFF13> % FULLWIDTH DIGIT THREE
<U2476> <S0033>;<BASE>;<COMPAT>;<U2476> % PARENTHESIZED DIGIT THREE
<U248A> <S0033>;<BASE>;<COMPAT>;<U248A> % DIGIT THREE FULL STOP
<U1D7D1> <S0033>;<BASE>;<FONT>;<U1D7D1> % MATHEMATICAL BOLD DIGIT THREE
...
<U0430> <S0430>;<BASE>;<MIN>;<U0430> % CYRILLIC SMALL LETTER A
<U0410> <S0430>;<BASE>;<CAP>;<U0410> % CYRILLIC CAPITAL LETTER A
<U04D1> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
<U0430_0306> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
...
<U0431> <S0431>;<BASE>;<MIN>;<U0431> % CYRILLIC SMALL LETTER BE
<U0411> <S0431>;<BASE>;<CAP>;<U0411> % CYRILLIC CAPITAL LETTER BE
<U0432> <S0432>;<BASE>;<MIN>;<U0432> % CYRILLIC SMALL LETTER VE
<U0412> <S0432>;<BASE>;<CAP>;<U0412> % CYRILLIC CAPITAL LETTER VE
...
order_end

Yanzu za ku iya komawa zuwa rarraba misalan daga farkon labarin. Kwanton bauna ya ta'allaka ne a wannan bangare na teburin ma'aunin nauyi:

<U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE
<U0021> IGNORE;IGNORE;IGNORE;<U0021> % EXCLAMATION MARK
<U0022> IGNORE;IGNORE;IGNORE;<U0022> % QUOTATION MARK
...

Ana iya ganin cewa a cikin wannan tebur alamun alamun rubutu daga tebur ASCII (ciki har da sarari) kusan ana yin watsi da shi yayin kwatanta kirtani. Keɓance kawai layukan da suka dace a cikin komai sai alamomin rubutu da aka samu a cikin matsayi masu dacewa. Layukan daga misalina (bayan rarrabawa) don kwatanta algorithm yayi kama da haka:

АбакановМихаилмаляр
ЁлкинаЭллакрановщица
ИвановаАлламаляр
ИвановАндрейслесарь

La'akari da cewa a cikin tebur na ma'auni, manyan haruffa a cikin Rashanci suna zuwa bayan ƙananan haruffa (a mataki na uku ya fi nauyi ), Rarraba yayi kama da daidai.

Lokacin saita m LC_COLLATE=C an loda wani tebur na musamman wanda ke ƙayyadad da kwatancen byte-by-byte

static const uint32_t collseqwc[] =
{
  8, 1, 8, 0x0, 0xff,
  /* 1st-level table */
  6 * sizeof (uint32_t),
  /* 2nd-level table */
  7 * sizeof (uint32_t),
  /* 3rd-level table */
  L'x00', L'x01', L'x02', L'x03', L'x04', L'x05', L'x06', L'x07',
  L'x08', L'x09', L'x0a', L'x0b', L'x0c', L'x0d', L'x0e', L'x0f',

...
  L'xf8', L'xf9', L'xfa', L'xfb', L'xfc', L'xfd', L'xfe', L'xff'
};

Tunda a cikin Unicode lambar lambar Ё ta zo gaban A, ana jera igiyoyin daidai gwargwado.

Rubutu da tebur na binary

Babu shakka, kwatanta kirtani aiki ne na gama-gari, da tantance tebur CTT hanya mai tsada sosai. Don inganta damar shiga teburin, an haɗa shi zuwa nau'i na binary tare da umarni gidadef.

tawagar gidadef yana karɓa azaman sigogi fayil tare da tebur na halayen ƙasa (zaɓi -i), wanda duk haruffa ke wakilta ta ɗigon Unicode, da fayil ɗin rubutu tsakanin ɗigon Unicode da haruffan takamaiman ɓoye (zaɓi). -f). A sakamakon aikin, an ƙirƙiri fayilolin binary don yanki tare da sunan da aka ƙayyade a cikin sigar ƙarshe.

glibc yana goyan bayan tsarin fayil na binary guda biyu: "gargajiya" da "zamani".

Tsarin al'ada yana nufin cewa sunan yanki shine sunan babban kundin adireshi a ciki /usr/lib/locale/. Wannan ƙaramin kundin adireshi yana adana fayilolin binary LC_COLLATE, LC_CTYPE, LC_TIME da sauransu. Fayil LC_IDENTIFICATION ya ƙunshi ainihin sunan wurin (wanda zai iya bambanta da sunan directory) da sharhi.

Tsarin zamani ya ƙunshi adana duk wurare a cikin rumbun ajiya guda /usr/lib/locale/locale-archive, wanda aka yi taswirar zuwa rumbun ƙwaƙwalwar ajiya na duk hanyoyin da ake amfani da su glibc. Sunan yanki a cikin tsarin zamani yana ƙarƙashin wasu ƙididdiga - lambobi da haruffa kawai waɗanda aka rage zuwa ƙananan haruffa sun rage a cikin sunaye masu ɓoyewa. Don haka ru_RU.KOI8-R, za a tsira kamar yadda ru_RU.koi8r.

Ana bincika fayilolin shigarwa a cikin kundin adireshi na yanzu, da kuma cikin kundayen adireshi /usr/share/i18n/locales/ и /usr/share/i18n/charmaps/ don fayiloli CTT da kuma ɓoye fayiloli, bi da bi.

Misali, umarnin

localedef -i ru_RU -f MAC-CYRILLIC ru_RU.MAC-CYRILLIC

zai tattara fayil ɗin /usr/share/i18n/locales/ru_RU ta amfani da fayil ɗin ɓoye /usr/share/i18n/charmaps/MAC-CYRILLIC.gz kuma ajiye sakamakon a ciki /usr/lib/locale/locale-archive a karkashin suna ru_RU.maccyrilic

Idan kun saita canjin LANG = en_US.UTF-8 wancan glibc zai nemi binaries na gida a cikin jerin fayiloli da kundayen adireshi masu zuwa:

/usr/lib/locale/locale-archive
/usr/lib/locale/en_US.UTF-8/
/usr/lib/locale/en_US/
/usr/lib/locale/enUTF-8/
/usr/lib/locale/en/

Idan yanki ya faru a cikin tsarin gargajiya da na zamani, to ana ba da fifiko ga na zamani.

Kuna iya duba jerin wuraren da aka haɗa tare da umarnin gida -a.

Ana shirya teburin kwatanta ku

Yanzu, dauke da makamai tare da ilimi, zaku iya ƙirƙirar tebur kwatankwacin kirtani na ku. Wannan tebur ya kamata ya kwatanta haruffan Rasha daidai, gami da harafin Ё, kuma a lokaci guda la'akari da alamun rubutu daidai da tebur. ASCII.

Tsarin shirya tebur ɗin ku ya ƙunshi matakai biyu: gyara teburin ma'aunin nauyi da haɗa shi cikin nau'i na binary tare da umarnin. gidadef.

Domin a daidaita teburin kwatancen tare da ƙaramin farashin gyarawa, a cikin tsari ISO 14652 An ba da sassan don daidaita ma'aunin tebur da ke akwai. Sashen yana farawa da kalmar maɓalli sake yin oda-bayan da kuma nuna matsayi bayan da aka yi maye gurbin. Sashen ya ƙare da layi sake tsara-ƙarshen. Idan ya cancanta don gyara sassan da dama na tebur, to, an halicci sashe ga kowane irin wannan sashe.

Na kwafi sabbin nau'ikan fayilolin iso14651_t1_na kowa и ru_RU daga ma'ajiyar glibc zuwa directory na gida ~/.local/share/i18n/locales/ kuma an ɗan gyara sashin LC_COLLATE в ru_RU. Sabbin nau'ikan fayiloli sun dace da sigar nawa glibc. Idan kuna son amfani da tsoffin juzu'in fayiloli, dole ne ku canza sunaye na alama da wurin da maye zai fara a cikin tebur.

LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
reorder-after <U000D>
<U0020> <S0020>;<BASE>;<MIN>;<U0020> % SPACE
<U0021> <S0021>;<BASE>;<MIN>;<U0021> % EXCLAMATION MARK
<U0022> <S0022>;<BASE>;<MIN>;<U0022> % QUOTATION MARK
...
<U007D> <S007D>;<BASE>;<MIN>;<U007D> % RIGHT CURLY BRACKET
<U007E> <S007E>;<BASE>;<MIN>;<U007E> % TILDE
reorder-end
END LC_COLLATE

A zahiri, ya zama dole don canza filayen a ciki LC_IDENTIFICATION ta yadda za su yi nuni zuwa ga unguwa ru_MY, amma a cikin misali na wannan ba a buƙata ba, tun da na cire tarihin daga binciken wuraren rumbun adana bayanai.

cewa gidadef yayi aiki tare da fayiloli a cikin babban fayil na ta hanyar m I18NPATH Kuna iya ƙara ƙarin littafin adireshi don bincika fayilolin shigarwa, kuma adireshin don adana fayilolin binaryar ana iya ƙayyade shi azaman hanya tare da slash:

$> I18NPATH=~/.local/share/i18n localedef -i ru_RU -f UTF-8 ~/.local/lib/locale/ru_MY.UTF-8

POSIX yana dauka a Harshe zaka iya rubuta cikakkun hanyoyi zuwa kundin adireshi tare da fayilolin gida, farawa da slash na gaba, amma glibc в Linux Ana ƙididdige duk hanyoyi daga kundin adireshi, wanda za'a iya jujjuya shi ta hanyar canji LOCPATH. Bayan shigarwa LOCPATH=~/.local/lib/locale/ duk fayilolin da ke da alaƙa da wuri za a bincika kawai a cikin babban fayil na. Taskar wurare tare da saitin canji LOCPATH watsi.

Anan ga jarrabawa mai mahimmanci:

$> LANG=ru_MY.UTF-8 LOCPATH=~/.local/lib/locale/ sort buhg.txt
Абаканов Михаил;маляр
Ёлкина Элла;крановщица
Иванов Андрей;слесарь
Иванова Алла;адвокат

Hooray! Mun yi shi!

Aiki akan kwari

Na riga na amsa tambayoyin game da rarrabuwar kirtani da aka gabatar a farkon, amma har yanzu akwai wasu tambayoyi biyu game da kurakurai - bayyane da ganuwa.

Mu koma kan asalin matsalar.

Da kuma shirin raba da shirin shiga yi amfani da ayyukan kwatanta kirtani ɗaya daga glibc. Yaya akayi haka shiga ya ba da kuskuren rarrabuwa akan layuka da aka jera ta hanyar umarnin raba a cikin gida en_US.UTF-8? Amsar mai sauki ce: raba kwatanta dukan kirtani, kuma shiga yana kwatanta maɓalli kawai, wanda ta tsohuwa shine farkon kirtani har zuwa farkon farar sararin samaniya. A misali na, wannan ya haifar da saƙon kuskure saboda rarrabuwar kalmomi na farko a cikin layi bai dace da rarrabuwa na cikakkun layi ba.

Yanki "C" yana ba da garantin cewa a cikin kirtani da aka jera suma za a jera ƙananan igiyoyin farko har zuwa sarari na farko, amma wannan kawai yana rufe kuskuren. Yana yiwuwa a zaɓi bayanai (mutane da sunayen sunaye iri ɗaya, amma sunaye na farko daban-daban) wanda, ba tare da saƙon kuskure ba, zai ba da sakamakon haɗin fayil ɗin da ba daidai ba. Idan muna so shiga layukan fayil da aka haɗe da cikakken suna, to, hanyar da ta dace ita ce zayyana mai raba filin a sarari kuma a warware ta wurin maɓalli, kuma ba ta dukkan layin ba. A wannan yanayin, haɗin zai ci gaba daidai kuma ba za a sami kurakurai a kowane yanki ba:

$> sort -t ; -k 1 buhg.txt > buhg.srt
$> sort -t ; -k 1 mail.txt > mail.srt
$> join -t ; buhg.srt mail.srt > result

An yi nasarar aiwatar da misali a cikin rikodi CP1251 ya ƙunshi wani kuskure. Gaskiyar ita ce, a cikin duk rabon da aka sani da ni Linux fakitin sun ɓace da aka haɗa ru_RU.CP1251. Idan ba a samo wurin da aka haɗa ba, to raba yayi shiru yana amfani da kwatancen byte-by-byte, wanda shine abin da muka lura.

Af, akwai wani ƙaramin ƙulli mai alaƙa da rashin isa ga wuraren da aka haɗa. Tawaga LOCPATH=/tmp gida -a zai ba da jerin duk wuraren da ke cikin rumbun adana bayanai, amma tare da m set LOCPATH ga duk shirye-shirye (ciki har da mafi gida) waɗannan wuraren ba za su kasance ba.

$> LOCPATH=/tmp locale -a | grep en_US
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
en_US
en_US.iso88591
en_US.iso885915
en_US.utf8

$> LC_COLLATE=en_US.UTF-8 sort --debug
sort: using ‘en_US.UTF-8’ sorting rules

$> LOCPATH=/tmp LC_COLLATE=en_US.UTF-8 sort --debug
sort: using simple byte comparison

ƙarshe

Idan kai mai shirye-shirye ne wanda aka saba da tunanin cewa igiyoyi sune saitin bytes, to zabinka LC_COLLATE=C.

Idan kai masanin harshe ne ko mai tara ƙamus, to zai fi kyau a haɗa cikin yankin ku.

Idan kun kasance mai sauƙin amfani, to kawai kuna buƙatar amfani da gaskiyar cewa umarnin ls -a yana fitar da fayiloli suna farawa da digo gauraye da fayilolin da suka fara da harafi, kuma Tsakar dare kwamanda, wanda ke amfani da ayyukansa na ciki don tsara sunaye, yana sanya fayiloli farawa da digo a farkon jerin.

nassoshi

Rahoton No. 10 Unicode collation algorithm

Ma'aunin nauyi a unicode.org

ICU - aiwatar da ɗakin karatu don aiki tare da Unicode daga IBM.

Ana rarraba gwajin ta amfani da ICU

Nauyin hali a ciki ISO 14651

Bayanin tsarin fayil tare da ma'auni ISO 14652

Tattaunawar kwatanta kirtani a glibc

source: www.habr.com

Add a comment