Mokhoa oa ho hlophisa likhoele tsa Linux

Selelekela

Tsohle li qalile ka mongolo o mokhuts'oane o neng o lokela ho kopanya tlhahisoleseling ea aterese imeile basebetsi ba fumanoang lethathamong la basebelisi ba lenane la mangolo, ba nang le maemo a basebetsi a fumanoang polokelong ea lefapha la HR. Manane ana ka bobeli a rometsoe ho lifaele tsa mongolo tsa Unicode UTF-8 'me e bolokoe ka liphetho tsa mela ea Unix.

Litaba mail.txt

Иванов Андрей;[email protected]

Litaba buhg.txt

Иванова Алла;маляр
Ёлкина Элла;крановщица
Иванов Андрей;слесарь
Абаканов Михаил;маляр

Ho kopanya, lifaele li ne li hlophisoa ka taelo ea Unix mofuta mme e rometsoe ho kenyelletsong ea lenaneo la Unix Kena, e ileng ea hloleha ka tšohanyetso ka phoso:

$> sort buhg.txt > buhg.srt
$> sort mail.txt > mail.srt
$> join buhg.srt mail.srt > result
join: buhg.srt:4: is not sorted: Иванов Андрей;слесарь

Ho sheba sephetho sa ho hlopha ka mahlo a hau ho bontšitse hore, ka kakaretso, ho hlopha ho nepahetse, empa tabeng ea ho kopana ha mabitso a banna le a basali, ba batšehali ba tla pele ho banna:

$> sort buhg.txt
Абаканов Михаил;маляр
Ёлкина Элла;крановщица
Иванова Алла;маляр
Иванов Андрей;слесарь

E shebahala joaloka glitch ea ho hlophisa ho Unicode kapa joalo ka ponahatso ea botšehali ho algorithm ea ho hlopha. Ea pele, ehlile, e utloahala haholoanyane.

Ha re e behelle ka thoko hajoale Kena le ho tsepamisa maikutlo ho mofuta. Ha re leke ho rarolla bothata ka ho sebelisa poking ea mahlale. Pele, ha re fetoleng sebaka ho tloha ho naheng ea U.S mabapi le ru_RU. Ho hlophisa, ho ka ba ho lekaneng ho beha phetoho ea tikoloho LC_COLLATE, empa re ke ke ra senya nako ka lintho tse nyane:

$> LANG=ru_RU.UTF-8 sort buhg.txt
Абаканов Михаил;маляр
Ёлкина Элла;крановщица
Иванова Алла;маляр
Иванов Андрей;слесарь

Ha ho letho le fetohileng.

Ha re leke ho khouta lifaele hape hore e be khouto e le 'ngoe:

$> iconv -f UTF-8 -t KOI8-R buhg.txt 
 | LANG=ru_RU.KOI8-R sort 
 | iconv -f KOI8-R -t UTF8

Hape ha ho letho le fetohileng.

Ha ho letho leo u ka le etsang, u tla tlameha ho batla tharollo Inthaneteng. Ha ho letho ka ho toba ka mabitso a Serussia, empa ho na le lipotso mabapi le mefuta e meng e sa tloaelehang. Ka mohlala, mona ke bothata: unix mofuta o tšoara litlhaku tsa '-' (dash) joalo ka tse sa bonahaleng. Ka bokhutšoanyane, likhoele "ab", "aa", "ac" li hlophisoa joalo ka "aa", "ab", "ac".

Karabo e tloaelehile hohle: sebelisa sebaka sa "programmer". "C" mme o tla thaba. Ha re leke:

$> LANG=C sort buhg.txt
Ёлкина Элла;крановщица
Абаканов Михаил;маляр
Иванов Андрей;слесарь
Иванова Алла;адвокат

Ho na le ntho e fetohileng. Li-Ivanov li ne li hlophisitsoe ka tatellano e nepahetseng, le hoja Yolkina a ile a thella kae-kae. Ha re khutleleng bothateng ba pele:

$> LANG=C sort buhg.txt > buhg.srt
$> LANG=C sort mail.txt > mail.srt
$> LANG=C join buhg.srt mail.srt > result

E sebelitse ntle le liphoso, joalo ka ha Marang-rang a tšepisitse. 'Me sena ho sa tsotellehe Yolkina moleng oa pele.

Bothata bo bonahala bo rarollotsoe, empa feela haeba ho ka etsahala, a re lekeng e 'ngoe ea khouto ea Serussia - Windows CP1251:

$> iconv -f UTF-8 -t CP1251 buhg.txt 
 | LANG=ru_RU.CP1251 sort 
 | iconv -f CP1251 -t UTF8 

Sephetho sa ho hlopha, ka mokhoa o makatsang, se tla lumellana le sebaka "C", 'me mohlala oohle, ka hona, o sebetsa ntle le liphoso. Mofuta o mong oa mehlolo.

Ha ke rate mehlolo ho mananeo hobane hangata e pata liphoso. Re tla tlameha ho sheba ka botebo hore na e sebetsa joang. mofuta mme e ama eng? LC_COLLATE .

Qetellong ke tla leka ho araba lipotso tsena:

  • ke hobane'ng ha lifane tsa basali li ne li hlophisoa ka phoso?
  • hobaneng LANG=ru_RU.CP1251 e ile ea tšoana LANG=C
  • hobaneng u etsa joalo mofuta и Kena maikutlo a fapaneng mabapi le tatellano ea likhoele tse hlophiloeng
  • ke hobane'ng ha ho na le liphoso mehlaleng eohle ea ka?
  • qetellong mokhoa oa ho hlophisa likhoele kamoo u ratang kateng

Hlophisa ka Unicode

Setopo sa pele e tla ba tlaleho ea tekheniki No. 10 e nang le sehlooho se reng Algorithm ea ho kopanya Unicode Online unicode.org. Tlaleho e na le lintlha tse ngata tsa tekheniki, kahoo e-re ke fane ka kakaretso e khutšoanyane ea mehopolo ea sehlooho.

collation - "bapisa" likhoele ke motheo oa algorithm efe kapa efe ea ho hlopha. Li-algorithms ka botsona li ka fapana ("bubble", "merge", "fast"), empa kaofela li tla sebelisa papiso ea lithapo tse peli ho fumana hore na li hlaha ka tatellano efe.

Ho hlopha likhoele ka puo ea tlhaho ke bothata bo batlang bo rarahane. Le ka har'a li-encodings tse bonolo ka ho fetesisa tsa alfabeta, tatellano ea litlhaku tsa alfabeta, leha e fapane ka tsela e itseng le alfabeta ea Senyesemane ea Selatine, e ke ke ea hlola e tsamaellana le tatellano ea lipalo tseo litlhaku tsena li kentsoeng ka tsona. Kahoo ka alfabeta ea Sejeremane tlhaku Ö e eme pakeng tsa О и P, le ho khouto CP850 o kena lipakeng ÿ и Ü.

U ka leka ho ntša khouto e itseng 'me u nahane ka litlhaku tse "loketseng" tse hlophisitsoeng ka tatellano e itseng, joalo ka ha ho etsoa ho Unicode. Likhouto UTF8, UTF16 kapa hang-hang KOI8-R (haeba ho hlokahala karolo e fokolang ea Unicode) e tla fana ka litlhaloso tse fapaneng tsa linomoro tsa litlhaku, empa e bua ka likarolo tse tšoanang tsa tafole ea motheo.

Hoa etsahala hore leha re ka haha ​​​​tafole ea matšoao ho tloha qalong, re ke ke ra khona ho fana ka taelo ea matšoao a bokahohleng ho eona. Lialfabeteng tse fapaneng tsa naha tse sebelisang litlhaku tse tšoanang, tatellano ea litlhaku tsena e ka fapana. Ka mohlala, ka Sefora Æ e tla nkoa e le ligature le ho hlophisoa joaloka khoele AE. Ka Norway Æ e tla ba lengolo le arohaneng, le fumanehang kamora Z. Ka tsela, ho phaella ho li-ligatures tse kang Æ Ho na le mangolo a ngotsoeng ka matšoao a 'maloa. Kahoo ka alfabeta ea Czech ho na le lengolo Ch, e emeng pakeng tsa H и I.

Ntle le liphapang tsa alfabeta, ho na le lineano tse ling tsa naha tse susumetsang ho hlopha. Haholo-holo, potso e hlaha: mantsoe a nang le litlhaku tse kholo le tse nyenyane a lokela ho hlaha ka tatellano efe bukeng ea mantsoe? Ho hlopha le hona ho ka angoa ke tšebeliso ea matšoao. Ka Sepanishe, letšoao la potso le sothehileng le sebelisoa qalong ea polelo e botsang lipotso (Na u rata 'mino?). Tabeng ena, ho hlakile hore lipolelo tsa lipotso ha lia lokela ho hlophisoa ka sehlopha se arohaneng ka ntle ho alfabeta, empa mokhoa oa ho hlophisa mela le matšoao a mang a matšoao?

Nke ke ka lula ke hlopha likhoele ka lipuo tse fapaneng haholo le tsa Europe. Hlokomela hore lipuong tse nang le mokhoa oa ho ngola ho tloha ho le letona ho ea ho le letšehali kapa ho tloha holimo ho ea tlase, litlhaku tse meleng li atisa ho bolokoa ka tatellano ea ho bala, esita le mekhoa ea ho ngola e sa sebeliseng alfabeta e na le mekhoa ea eona ea ho hlophisa mela ka litlhaku. . Ka mohlala, li-hieroglyphs li ka hlophisoa ka mokhoa (Linotlolo tsa litlhaku tsa Sechaena) kapa ka ho bitsa. Ho bua 'nete, ha ke tsebe hore na emojis e lokela ho hlophisoa joang, empa le uena u ka ba tlisetsa ho hong.

Ho ipapisitsoe le likarolo tse thathamisitsoeng kaholimo, litlhoko tsa mantlha tsa ho bapisa likhoele tse ipapisitseng le litafole tsa Unicode li entsoe:

  • papiso ea likhoele ha e itšetlehe ka boemo ba litlhaku tafoleng ea khoutu;
  • tatelano ya baphetwa ba bopang mophetwa a le mong e fokotswa ho ba sebopeho sa mangolo a halalelang (A + selikalikoe se ka holimo se tšoana le Å);
  • Ha ho bapisoa likhoele, sebopeho se nahanoa moelelong oa khoele, 'me, ha ho hlokahala, se kopantsoe le baahisani ba sona ho ba karolo e le 'ngoe ea papiso (Ch ka Seczech) kapa e arotsoe ka tse 'maloa (Æ ka Sefora);
  • likarolo tsohle tsa naha (alfabeta, litlhaku tse kholo / tse nyane, matšoao, tatellano ea mefuta ea ho ngola) li tlameha ho hlophisoa ho fihlela mosebetsi oa matsoho oa tatellano (emoji);
  • papiso ha e bohlokoa bakeng sa ho hlopha feela, empa le libakeng tse ling tse ngata, mohlala bakeng sa ho hlakisa melahare (ho emela {A... z} ho Bash);
  • papiso e lokela ho etsoa ka toka kapele.

Ntle le moo, bangoli ba tlaleho ba thehile thepa ea papiso eo baetsi ba algorithm ba sa lokelang ho itšetleha ka eona:

  • algorithm ea papiso ha ea lokela ho hloka sete e fapaneng ea litlhaku bakeng sa puo ka 'ngoe (lipuo tsa Serussia le Seukraine li arolelana litlhaku tse ngata tsa Cyrillic);
  • papiso ha ea lokela ho itšetleha ka tatellano ea litlhaku litafoleng tsa Unicode;
  • boima ba khoele ha boa lokela ho ba tšobotsi ea khoele, kaha khoele e tšoanang maemong a fapaneng a setso e ka ba le boima bo fapaneng;
  • boima ba mela bo ka fetoha ha bo kopanngoa kapa bo arohana (ho tloha x < y ha e latele seo xz < yz);
  • likhoele tse fapaneng tse nang le litekanyo tse lekanang li nkuoa li lekana ho tloha ntlheng ea pono ea algorithm ea ho hlopha. Ho kenyelletsa tatellano e eketsehileng ea likhoele tse joalo hoa khoneha, empa e ka 'na ea theola ts'ebetso;
  • Nakong ea ho hlopha khafetsa, mela e nang le boima bo tšoanang e ka fapanyetsanoa. Robustness ke thepa ea algorithm e itseng ea ho hlopha, eseng thepa ea algorithm ea papiso ea likhoele (sheba serapa se fetileng);
  • Melao ea ho hlopha e ka fetoha ha nako e ntse e ea ha litloaelo tsa setso li ntse li ntlafatsa/fetoha.

Ho boetse ho boleloa hore algorithm ea papiso ha e tsebe letho ka semantics ea likhoele tse ntseng li sebetsoa. Kahoo, likhoele tse nang le linomoro feela ha lia lokela ho bapisoa le linomoro, 'me lethathamong la mabitso a Senyesemane sengoloa (Beatles, The).

E le ho khotsofatsa litlhoko tsohle tse boletsoeng, ho etsoa tlhahiso ea algorithm ea ho hlopha litafole tsa maemo a mangata (ha e le hantle tse 'nè).

Nakong e fetileng, litlhaku tse khoeleng li fokotsehile ho ea ka mokhoa oa canonical 'me li arotsoe ka lihlopha tsa papiso. Yuniti ka 'ngoe ea papiso e abeloa litekanyo tse 'maloa tse tsamaellanang le maemo a' maloa a papiso. Boima ba li-unit tsa papiso ke likarolo tsa lihlopha tse laetsoeng (tabeng ena, li-integers) tse ka bapisoang le tse ling kapa tse fokolang. Moelelo o khethehileng HLOKOMELA (0x0) e bolela hore boemong bo tšoanang ba papiso yuniti ena ha e amehe papisong. Papiso ea likhoele e ka phetoa ka makhetlo a 'maloa, ho sebelisoa litekanyo tsa litekanyetso tse lumellanang. Boemong bo bong le bo bong, boima ba li-unit tsa papiso ea mela e 'meli li bapisoa ka tatellano.

Lits'ebetsong tse fapaneng tsa algorithm bakeng sa litloaelo tse fapaneng tsa naha, boleng ba li-coefficients bo ka fapana, empa maemo a Unicode a kenyelletsa tafole ea litekanyo tsa mantlha - "Default Unicode Collation Element Table" (DUCET). Ke kopa ho ela hloko hore ho beha phetoho LC_COLLATE ha e le hantle ke pontšo ea khetho ea tafole ea boima mosebetsing oa ho bapisa likhoele.

Li-coefficients tsa boima DUCET hlophisitsoe ka tsela e latelang:

  • boemong ba pele, litlhaku tsohle li fokotsehile boemong bo tšoanang, li-diacritics lia lahloa, matšoao a matšoao (eseng kaofela) a hlokomolohuoa;
  • boemong ba bobeli, ho eloa hloko feela li-diacritics;
  • boemong ba boraro, ho nkoa feela taba;
  • boemong ba bone, ho eloa hloko feela matšoao a puo.

Papiso e etsahala likarolong tse 'maloa: pele, li-coefficients tsa boemo ba pele li bapisoa; haeba litekanyo li lumellana, joale papiso e pheta-phetoang le boima ba boemo ba bobeli e etsoa; ebe mohlomong ea boraro le ea bone.

Papiso e qetella ha mela e na le li-unit tse bapisang le boima bo fapaneng. Mela e nang le boima bo lekanang maemong ohle a mane e nkuoa e lekana.

Algorithm ena (e nang le lintlha tse ngata tsa tekheniki) e fane ka lebitso la ho tlaleha No. 10 - "Unicode Collation Algorithm" (ACU).

Mona ke moo boitšoaro ba ho hlopha ho tsoa mohlaleng oa rona bo hlakehang haholoanyane. Ho ka ba monate ho e bapisa le maemo a Unicode.

Ho leka ts'ebetsong ACU ho na le e khethehileng tlhatlhobo, sebelisa boima faele, ho kenya tshebetsong DUCET. U ka fumana mefuta eohle ea lintho tse qabolang faeleng ea sekala. Mohlala, ho na le tatellano ea li-dominoes tsa mahjong le tsa Europe, hammoho le tatellano ea lisutu mokatong oa likarete (letšoao. 1F000 le ho feta). Lisutu tsa karete li behiloe ho latela melao ea borokho - PCBT, 'me likarete tsa sutu li ka tatellano ea T, 2,3, XNUMX... K.

Ho hlahloba ka bowena hore mela e hlophilwe ka nepo ho ya ka DUCET e ka ba ntho e tenang, empa, ka lehlohonolo bakeng sa rona, ho na le ts'ebetsong ea mohlala ea laeborari bakeng sa ho sebetsa le Unicode - "Likarolo tsa Machaba tsa Unicode"(ICU).

Webosaeteng ea laeborari ena, e ntlafalitsoeng ka IBM, ho na le maqephe a demo, ho kenyeletsoa leqephe la algorithm ea papiso ea likhoele. Re kenya mela ea rona ea liteko ka li-setting tsa kamehla, 'me, bonang, re fumana khetho e phethahetseng ea Serussia.

Абаканов Михаил;маляр
Ёлкина Элла;крановщица
Иванов Андрей;слесарь
Иванова Алла;адвокат

Ka kakaretso, webosaete ICU U ka fumana tlhaloso ea algorithm ea papiso ha u sebetsana le matšoao a matšoao. Ka mehlala Collation FAQ apostrophe le hyphen li hlokomolohuoa.

Unicode e re thusitse, empa batla mabaka a boitšoaro bo makatsang mofuta в Linux e tla tlameha ho ea sebakeng se seng.

Hlophisa ka glibc

Pono e potlakileng ea likhoutu tsa mohloli oa lisebelisoa mofuta ho tswa Lisebelisoa tsa GNU Core e bontšitse hore ts'ebelisong ka boeona, sebaka sa sebaka se theohela ho hatisa boleng ba hona joale ba phetoho LC_COLLATE ha o sebetsa ka mokhoa oa debug:

$ sort --debug buhg.txt > buhg.srt
sort: using ‘en_US.UTF8’ sorting rules

Lipapiso tsa likhoele li etsoa ho sebelisoa mosebetsi o tloaelehileng strcoll, ho bolelang hore ntho e 'ngoe le e 'ngoe e thahasellisang e ka laebraring glibc.

mabapi le wiki morero glibc e inehetseng ho papiso ea likhoele serapa se le seng. Ho tsoa serapeng sena ho ka utloisisoa hore ho glibc ho hlopha ho ipapisitse le algorithm eo re seng re ntse re e tseba ACU (Algorithm ea ho kopanya Unicode) le/kapa boemong bo haufi le yona ISO 14651 (Tsamaiso ea likhoele tsa machaba le papiso). Mabapi le maemo a morao-rao, ho lokela ho hlokomeloa hore setšeng standards.iso.org ISO 14651 e phatlalalitsoe ka molao phatlalatsa, empa sehokelo se tsamaisanang le sona se lebisa leqepheng le le sieo. Google e khutlisa maqephe a 'maloa a nang le likhokahano tsa libaka tsa semmuso tse fanang ka ho reka kopi ea elektroniki ea tekanyetso ea li-euro tse lekholo, empa leqepheng la boraro kapa la bone la liphetho tsa lipatlisiso ho boetse ho na le likhokahano tse tobileng ho PDF. Ka kakaretso, tekanyetso ha e fapane hole le ACU, empa e tena ho bala hobane ha e na mehlala e hlakileng ea likarolo tsa naha tsa ho hlopha likhoele.

Litaba tse khahlisang haholo ka wiki ho ne ho na le sehokelo ho kokoanyana tracker ka puisano ea ts'ebetsong ea papiso ea likhoele ka glibc. Ho tsoa lipuisanong ho ka ithutoa hore glibc se sebedisoang ho bapisa dikgwele ISOtafole ea botho The Common Template Tafole (CTT), aterese ea eona e ka fumanoang ho kopo A maemo ISO 14651. Pakeng tsa 2000 le 2015 tafole ena e glibc e ne e se na mohlokomeli mme e ne e fapane haholo (bonyane ka ntle) ho tsoa ho mofuta oa hajoale oa maemo. Ho tloha 2015 ho isa 2018, ho ikamahanya le mofuta o mocha oa tafole ho etsahetse, 'me joale u na le monyetla oa ho kopana bophelong ba' nete mofuta o mocha oa tafole (CentOS 8), le khale (CentOS 7).

Kaha joale re na le tlhahisoleseling eohle mabapi le algorithm le litafole tse thusang, re ka khutlela bothateng ba pele mme ra utloisisa mokhoa oa ho hlophisa likhoele ka nepo sebakeng sa Serussia.

ISO 14651/14652

Mohloli oa khoutu ea tafole eo re e thahasellang CTT kabong tse ngata Linux e lethathamong la libuka /usr/share/i18n/locales/. Tafole ka boeona e ka har'a faele iso14651_t1_common. Joale ena ke taelo ea faele kopi iso14651_t1_common e kenyelelitsoe faeleng iso14651_t1, eo, ka lehlakoreng le leng, e kenyelletsoeng lifaeleng tsa naha, ho kenyeletsa naheng ea U.S и ru_RU. Likabelong tse ngata Linux lifaele tsohle tsa mohloli li kenyelelitsoe ts'ebetsong ea motheo, empa haeba li le sieo, u tla tlameha ho kenya sephutheloana se eketsehileng ho tloha kabong.

Sebopeho sa faele iso14651_t1 e ka 'na ea bonahala e le mantsoe a mangata haholo, ka melao e sa hlakileng ea ho haha ​​​​mabitso, empa ha u e sheba, ntho e' ngoe le e 'ngoe e bonolo haholo. Sebopeho se hlalosoa ka mokhoa o tloaelehileng ISO 14652, kopi ea eona e ka kopitsoang webosaeteng open-std.org. Tlhaloso e 'ngoe ea sebopeho sa faele e ka baloa ho litlhaloso POSITSO от OpenGroup. E le mokhoa o mong oa ho bala maemo, o ka ithuta khoutu ea mohloli oa ts'ebetso kopanya_bala в glibc/locale/programs/ld-collate.c.

Sebopeho sa faele se shebahala tjena:

Ka mokhoa o ikhethileng, sebapali se sebelisoa e le sebapali sa phonyoho, 'me pheletso ea mohala ka mor'a sebapali sa # ke maikutlo. Matšoao ana ka bobeli a ka hlalosoa bocha, e leng se etsoang mofuteng o mocha oa tafole:

escape_char /
comment_char %

Faele e tla ba le li-tokens ka sebopeho kapa (kae x - palo ea hexadecimal). Ena ke setšoantšo sa hexadecimal sa lintlha tsa khoutu ea Unicode ho khouto UCS-4 (UTF-32). Lintho tse ling kaofela ka masakaneng a mahlakoreng (ho kenyeletsoa , <2> le tse ling tse joalo) li nkoa e le likhoele tse bonolo tse nang le moelelo o fokolang ka ntle ho moelelo.

Lintja LC_COLLATE e re bolella hore e latelang e qala data e hlalosang papiso ea likhoele.

Taba ea pele, mabitso a hlalositsoe bakeng sa litekanyo tse tafoleng ea papiso le mabitso a motsoako oa matšoao. Ka kakaretso, mefuta e 'meli ea mabitso ke ea mekhatlo e' meli e fapaneng, empa faeleng ea 'nete e tsoakane. Mabitso a boima a hlalosoa ke lentsoe la sehlooho qoqile-letshwao (papiso ea sebopeho) hobane ha ho bapisoa, litlhaku tsa Unicode tse nang le boima bo tšoanang li tla nkuoa e le litlhaku tse lekanang.

Bolelele bo felletseng ba karolo ho ntlafatso ea faele ea hajoale ke mela e ka bang 900. Ke ile ka hula mehlala ho tsoa libakeng tse 'maloa ho bontša boikemelo ba mabitso le mefuta e mengata ea li-syntax.

LC_COLLATE

collating-symbol <RES-1>
collating-symbol <BLK>
collating-symbol <MIN>
collating-symbol <WIDE>
...
collating-symbol <ARABIC>
collating-symbol <ETHPC>
collating-symbol <OSMANYA>
...
collating-symbol <S1D000>..<S1D35F>
collating-symbol <SFFFF> % Guaranteed largest symbol value. Keep at end of this list
...
collating-element <U0413_0301> from "<U0413><U0301>"
collating-element <U0413_0341> from "<U0413><U0341>"

  • qoqile-letshwao kota khoele OSMANYA tafoleng ea mabitso a sekala
  • qoqile-letshwao .. e ngodisa tatelano ya mabitso e nang le sehlongwapele S le sehlotshwana sa dipalo sa hexadecimal ho tloha 1D000 ho 1D35F.
  • FFFF в qoqile-letshwao e shebahala joalo ka palo e kholo e sa ngolisoang ka hexadecimal, empa ke lebitso feela le ka shebahalang joalo
  • lebitso e bolela sebaka sa khoutu ho khouto UCS-4
  • collating-element ho tloha " " e ngolisa lebitso le lecha bakeng sa para ea matheba a Unicode.

Hang ha mabitso a litekanyo a hlalosoa, boima ba sebele bo hlalosoa. Kaha feela likamano tse kholo ho feta tse nyane li bohlokoa ha li bapisoa, boima bo khethoa ke tatellano e bonolo ea mabitso a thathamisang. Litekanyo tse "tenya" li thathamisitsoe pele, ebe tse "boima". E re ke u hopotse hore sebapali se seng le se seng sa Unicode se abeloa litekanyo tse 'ne tse fapaneng. Mona li kopantsoe ka tatellano e le 'ngoe e laetsoeng. Ka khopolo, lebitso leha e le lefe la tšoantšetso le ka sebelisoa maemong afe kapa afe a mane, empa litlhaloso li bontša hore baetsi ba kelello ba arola mabitso ka mekhahlelo.

% Symbolic weight assignments

% Third-level weight assignments
<RES-1>
<BLK>
<MIN>
<WIDE>
...
% Second-level weight assignments
<BASE>
<LOWLINE> % COMBINING LOW LINE
<PSILI> % COMBINING COMMA ABOVE
<DASIA> % COMBINING REVERSED COMMA ABOVE
...
% First-level weight assignments
<S0009> % HORIZONTAL TABULATION 
<S000A> % LINE FEED
<S000B> % VERTICAL TABULATION
...
<S0434> % CYRILLIC SMALL LETTER DE
<S0501> % CYRILLIC SMALL LETTER KOMI DE
<S0452> % CYRILLIC SMALL LETTER DJE
<S0503> % CYRILLIC SMALL LETTER KOMI DJE
<S0453> % CYRILLIC SMALL LETTER GJE
<S0499> % CYRILLIC SMALL LETTER ZE WITH DESCENDER
<S0435> % CYRILLIC SMALL LETTER IE
<S04D7> % CYRILLIC SMALL LETTER IE WITH BREVE
<S0454> % CYRILLIC SMALL LETTER UKRAINIAN IE
<S0436> % CYRILLIC SMALL LETTER ZHE

Qetellong, tafole ea boima ba 'nete.

Karolo ea boima e kenyelelitsoe ka mela ea mantsoe a sehlooho order_start и taelo_qetello. Dikgetho tse eketsehileng order_start etsa qeto ea hore na mela e shekoa kae boemong bo bong le bo bong ba papiso. Setting ea kamehla ke pele. 'Mele oa karolo o na le mela e nang le khoutu ea letšoao le litekanyo tsa eona tse' nè. Khoutu ea litlhaku e ka emeloa ke sebapali ka boeona, ntlha ea khoutu, kapa lebitso la tšoantšetso le hlalositsoeng pele. Boima bo ka boela ba fuoa mabitso a tšoantšetso, lintlha tsa khoutu, kapa matšoao ka botsona. Haeba lintlha tsa khoutu kapa litlhaku li sebelisoa, boima ba tsona bo lekana le boleng ba linomoro tsa lintlha tsa khoutu (boemo tafoleng ea Unicode). Litlhaku tse sa hlalosoang ka ho hlaka (joalo ka ha ke utloisisa) li nkuoa li behiloe tafoleng ka boima ba mantlha bo tsamaellanang le boemo bo tafoleng ea Unicode. Boima bo khethehileng TSEBA ho bolela hore letšoao le hlokomolohuoa boemong bo loketseng ba ho bapisa.

Ho bonts'a sebopeho sa sekala, ke khethile likhechana tse tharo tse hlakileng:

  • litlhaku tse hlokomolohuoang ka botlalo
  • matšoao a lekanang le palo ea boraro maemong a mabeli a pele
  • qalo ea alfabeta ea Cyrillic, e se nang litlhaku tsa mantsoe, ka hona e hlophisoa haholo-holo ka maemo a pele le a boraro.

order_start forward;forward;forward;forward,position
<U0000> IGNORE;IGNORE;IGNORE;IGNORE % NULL (in 6429)
<U0001> IGNORE;IGNORE;IGNORE;IGNORE % START OF HEADING (in 6429)
<U0002> IGNORE;IGNORE;IGNORE;IGNORE % START OF TEXT (in 6429)
...
<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
<UFF13> <S0033>;<BASE>;<WIDE>;<UFF13> % FULLWIDTH DIGIT THREE
<U2476> <S0033>;<BASE>;<COMPAT>;<U2476> % PARENTHESIZED DIGIT THREE
<U248A> <S0033>;<BASE>;<COMPAT>;<U248A> % DIGIT THREE FULL STOP
<U1D7D1> <S0033>;<BASE>;<FONT>;<U1D7D1> % MATHEMATICAL BOLD DIGIT THREE
...
<U0430> <S0430>;<BASE>;<MIN>;<U0430> % CYRILLIC SMALL LETTER A
<U0410> <S0430>;<BASE>;<CAP>;<U0410> % CYRILLIC CAPITAL LETTER A
<U04D1> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
<U0430_0306> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
...
<U0431> <S0431>;<BASE>;<MIN>;<U0431> % CYRILLIC SMALL LETTER BE
<U0411> <S0431>;<BASE>;<CAP>;<U0411> % CYRILLIC CAPITAL LETTER BE
<U0432> <S0432>;<BASE>;<MIN>;<U0432> % CYRILLIC SMALL LETTER VE
<U0412> <S0432>;<BASE>;<CAP>;<U0412> % CYRILLIC CAPITAL LETTER VE
...
order_end

Joale u ka khutlela ho hlophisa mehlala ho tloha qalong ea sehlooho. Sebaka sa ho lalla se karolong ena ea tafole ea litekanyo:

<U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE
<U0021> IGNORE;IGNORE;IGNORE;<U0021> % EXCLAMATION MARK
<U0022> IGNORE;IGNORE;IGNORE;<U0022> % QUOTATION MARK
...

Ho ka bonoa hore tafoleng ena matšoao a matšoao a tsoang tafoleng ASCII (ho kenyeletsoa sebaka) e batla e hlokomolohuoa ha ho bapisoa likhoele. Mekhelo feela ke mela e nyallanang nthong e 'ngoe le e 'ngoe ntle le matšoao a fumanoang maemong a nyallanang. Mehala e tsoang mohlaleng oa ka (kamora ho hlopha) bakeng sa algorithm ea papiso e shebahala tjena:

АбакановМихаилмаляр
ЁлкинаЭллакрановщица
ИвановаАлламаляр
ИвановАндрейслесарь

Ha ho nahanoa hore tafoleng ea sekala, litlhaku tse kholo ka Serussia li tla ka mor'a litlhaku tse nyane (boemong ba boraro boima hofeta ), ho hlopha ho shebahala ho nepahetse ka botlalo.

Ha o seta phapano LC_COLLATE=C tafole e khethehileng e laeloa e hlalosang papiso ea byte-byte

static const uint32_t collseqwc[] =
{
  8, 1, 8, 0x0, 0xff,
  /* 1st-level table */
  6 * sizeof (uint32_t),
  /* 2nd-level table */
  7 * sizeof (uint32_t),
  /* 3rd-level table */
  L'x00', L'x01', L'x02', L'x03', L'x04', L'x05', L'x06', L'x07',
  L'x08', L'x09', L'x0a', L'x0b', L'x0c', L'x0d', L'x0e', L'x0f',

...
  L'xf8', L'xf9', L'xfa', L'xfb', L'xfc', L'xfd', L'xfe', L'xff'
};

Kaha ho Unicode ntlha ea khoutu Ё e tla pele ho A, likhoele li hlophisoa ka tsela e nepahetseng.

Mongolo le litafole binary

Ho hlakile hore papiso ea likhoele ke ts'ebetso e atileng haholo, le ho arola tafole CTT mokhoa o theko e boima haholo. Ho ntlafatsa phihlello ea tafole, e hlophisitsoe ka sebopeho sa binary ka taelo localdef.

sehlopha localdef e amohela e le liparamente faele e nang le tafole ea litšobotsi tsa naha (khetho -i), moo litlhaku tsohle li emeloa ke matheba a Unicode, le faele ea mangolo pakeng tsa matheba a Unicode le litlhaku tsa encoding e itseng (khetho -f). Ka lebaka la mosebetsi, lifaele tsa binary li bōptjoa bakeng sa sebaka se nang le lebitso le boletsoeng parameter ea ho qetela.

glibc e tšehetsa liforomo tse peli tsa faele ea binary: "ea setso" le "ea kajeno".

Sebopeho sa setso se bolela hore lebitso la sebaka ke lebitso la subdirectory ho /usr/lib/locale/. Potlana ena e boloka lifaele tsa binary LC_COLLATE, LC_CTYPE, LC_TIME joalo joalo. Faele LC_IDENTIFICATION e na le lebitso la semmuso la sebaka (le kanna la fapana le lebitso la buka) le maikutlo.

Sebopeho sa sejoale-joale se kenyelletsa ho boloka libaka tsohle sebakeng se le seng sa polokelo /usr/lib/locale/locale-archive, e entsoeng 'mapeng oa mohopolo oa sebele oa mekhoa eohle e sebelisoang glibc. Lebitso la sebaka ka mokhoa oa sejoale-joale le tlas'a ho etsoa canonization - ke linomoro le litlhaku tse fokolitsoeng ho ea ho tse nyane tse setseng mabitsong a khouto. Kahoo ru_RU.KOI8-R, o tla bolokeha e le ru_RU.koi8r.

Lifaele tse kentsoeng li batlisisoa bukeng ea hajoale, hammoho le li-directory /usr/share/i18n/locales/ и /usr/share/i18n/charmaps/ bakeng sa lifaele CTT le lifaele tsa khouto, ka ho latellana.

Ka mohlala, taelo

localedef -i ru_RU -f MAC-CYRILLIC ru_RU.MAC-CYRILLIC

e tla bokella faele /usr/share/i18n/locales/ru_RU ho sebelisa faele ea khouto /usr/share/i18n/charmaps/MAC-CYRILLIC.gz 'me u boloke sephetho ho /usr/lib/locale/locale-archive tlasa lebitso ru_RU.maccyrillic

Haeba u seta feto-fetoha LANG = en_US.UTF-8 ebe glibc e tla batla li-binaries tsa sebaka ka tatellano e latelang ea lifaele le li-directory:

/usr/lib/locale/locale-archive
/usr/lib/locale/en_US.UTF-8/
/usr/lib/locale/en_US/
/usr/lib/locale/enUTF-8/
/usr/lib/locale/en/

Haeba sebaka se hlaha ka mekhoa ea setso le ea sejoale-joale, joale ho etelletsoa pele ea sejoale-joale.

U ka sheba lenane la libaka tse hlophisitsoeng ka taelo sebaka -a.

Ho lokisa tafole ea hau ea papiso

Hona joale, u hlometse ka tsebo, u ka iketsetsa tafole ea papiso ea likhoele e loketseng. Tafole ena e lokela ho bapisa litlhaku tsa Serussia ka nepo, ho kenyelletsa le lengolo Ё, 'me ka nako e ts'oanang e nahane ka matšoao a matšoao ho latela tafole. ASCII.

Mokhoa oa ho itokisetsa tafole ea hau ea ho hlopha e na le mekhahlelo e 'meli: ho hlophisa tafole ea litekanyo le ho e bokella ka mokhoa oa binary ka taelo. localdef.

E le hore tafole ea papiso e lokisoe ka litšenyehelo tse fokolang tsa ho hlophisa, ka mokhoa ISO 14652 Likarolo tsa ho lokisa boima ba tafole e teng li fanoe. Karolo e qala ka lentsoe la sehlooho hlophisa bocha-kamora le ho bontsha boemo boo ka mora moo phetoho e etsoang. Karolo e qetella ka mola hlophisa bocha. Haeba ho hlokahala ho lokisa likarolo tse 'maloa tsa tafole, joale karolo e bōptjoa bakeng sa karolo ka' ngoe e joalo.

Ke kopilitse mefuta e mecha ea lifaele iso14651_t1_common и ru_RU ho tsoa sebakeng sa polokelo glibc ho buka ea ka ea lehae ~/.local/share/i18n/locales/ mme ke hlophisitse karolo hanyenyane LC_COLLATE в ru_RU. Mefuta e mecha ea lifaele e lumellana ka botlalo le mofuta oa ka glibc. Haeba u batla ho sebelisa liphetolelo tsa khale tsa lifaele, u tla tlameha ho fetola mabitso a tšoantšetso le sebaka seo phetoho e qalang ho sona tafoleng.

LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
reorder-after <U000D>
<U0020> <S0020>;<BASE>;<MIN>;<U0020> % SPACE
<U0021> <S0021>;<BASE>;<MIN>;<U0021> % EXCLAMATION MARK
<U0022> <S0022>;<BASE>;<MIN>;<U0022> % QUOTATION MARK
...
<U007D> <S007D>;<BASE>;<MIN>;<U007D> % RIGHT CURLY BRACKET
<U007E> <S007E>;<BASE>;<MIN>;<U007E> % TILDE
reorder-end
END LC_COLLATE

Ha e le hantle, ho ne ho tla hlokahala ho fetola libaka LC_IDENTIFICATION hoo ba supang sebaka ru_MY, empa mohlaleng oa ka sena se ne se sa hlokehe, kaha ke ne ke sa kenyeletse polokelo ea litaba ho batlisisa libaka sebaka sa polokelo.

hore localdef e sebelitse le lifaele ka har'a foldara ea ka ka ho feto-fetoha I18NPATH O ka eketsa bukana e eketsehileng ho batla lifaele tsa ho kenya, 'me bukana ea ho boloka lifaele tsa binary e ka hlalosoa e le tsela e nang le li-slashes:

$> I18NPATH=~/.local/share/i18n localedef -i ru_RU -f UTF-8 ~/.local/lib/locale/ru_MY.UTF-8

POSITSO nahana hore ka FEELA o ka ngola litsela tse felletseng ho li-directory tse nang le lifaele tsa lehae, ho qala ka slash ea pele, empa glibc в Linux litsela tsohle li baloa ho tloha bukeng ea motheo, e ka tlosoang ka ho feto-fetoha LOCPATI. Kamora ho kenya LOCPATH=~/.local/lib/locale/ Lifaele tsohle tse amanang le ho etsa sebaka li tla batloa foldareng ea ka feela. Sebaka sa polokelo ea libaka tse nang le mefuta e fapaneng LOCPATI hlokomolohile.

Mona ke tlhahlobo ea makhaola-khang:

$> LANG=ru_MY.UTF-8 LOCPATH=~/.local/lib/locale/ sort buhg.txt
Абаканов Михаил;маляр
Ёлкина Элла;крановщица
Иванов Андрей;слесарь
Иванова Алла;адвокат

Hooray! Re e entse!

Ho sebetsana le phoso

Ke se ke arabile lipotso tse mabapi le ho hlopha likhoele tse hlahisitsoeng qalong, empa ho ntse ho na le lipotso tse 'maloa mabapi le liphoso - tse bonahalang le tse sa bonahaleng.

Ha re khutleleng bothateng ba pele.

Le lenaneo mofuta le lenaneo Kena sebelisa mekhoa e tšoanang ea ho bapisa likhoele ho tloha glibc. Ho etsahetse jwang hoo Kena e fane ka phoso ea ho hlopha melang e hlophisitsoeng ka taelo mofuta sebakeng sa heno en_US.UTF-8? Karabo e bonolo: mofuta e bapisa khoele kaofela, le Kena e bapisa senotlolo feela, seo ka ho sa feleng e leng qalo ea khoele ho fihlela tlhaku ea pele ea sebaka se sesoeu. Mohlala oa ka, sena se hlahisitse molaetsa oa phoso hobane ho hlophisoa ha mantsoe a pele meleng e ne e sa tsamaellane le ho hlophisoa ha mela e felletseng.

Sebakeng "C" e tiisa hore ka likhoele tse hlophiloeng lithapo tse tlase ho fihla sebakeng sa pele le tsona li tla hlophisoa, empa sena se pata phoso feela. Hoa khoneha ho khetha data (batho ba nang le mabitso a tšoanang, empa mabitso a fapaneng a pele) ao, ntle le molaetsa oa phoso, a ka fanang ka sephetho se fosahetseng sa ho kopanya faele. Haeba re batla Kena Mehala ea faele e kopantsoeng ka lebitso le felletseng, ebe tsela e nepahetseng e tla ba ho hlakisa karohano ea tšimo le ho hlophisa ka karolo ea senotlolo, eseng ka mola kaofela. Tabeng ena, ho kopanya ho tla tsoela pele ka nepo 'me ho ke ke ha e-ba le liphoso sebakeng leha e le sefe:

$> sort -t ; -k 1 buhg.txt > buhg.srt
$> sort -t ; -k 1 mail.txt > mail.srt
$> join -t ; buhg.srt mail.srt > result

E atlehile ho etsa mohlala oa khouto CP1251 e na le phoso e 'ngoe. 'Nete ke hore kabong tsohle tse tsejoang ho' na Linux Liphutheloana ha li eo sebakeng se hlophisitsoeng ru_RU.CP1251. Haeba sebaka se hlophisitsoeng se sa fumanehe, joale mofuta ka khutso e sebelisa papiso ea byte-byte, e leng seo re se hlokometseng.

Ka tsela, ho na le glitch e 'ngoe e nyane e amanang le ho se fihlellehe ha libaka tse hlophisitsoeng. Sehlopha LOCPATH=/tmp sebaka -a e tla fana ka lethathamo la libaka tsohle tse teng sebaka sa polokelo, empa ka sete e feto-fetohang LOCPATI bakeng sa mananeo ohle (ho kenyeletsoa ka ho fetisisa wa selehae) libaka tsena li ke ke tsa fumaneha.

$> LOCPATH=/tmp locale -a | grep en_US
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
en_US
en_US.iso88591
en_US.iso885915
en_US.utf8

$> LC_COLLATE=en_US.UTF-8 sort --debug
sort: using ‘en_US.UTF-8’ sorting rules

$> LOCPATH=/tmp LC_COLLATE=en_US.UTF-8 sort --debug
sort: using simple byte comparison

fihlela qeto e

Haeba u le moqapi ea tloaetseng ho nahana hore likhoele ke sete sa li-byte, joale khetho ea hau LC_COLLATE=C.

Haeba u setsebi sa lipuo kapa u le moqapi oa bukantswe, ho ka ba molemo hore o ipokelle sebakeng sa heno.

Haeba u mosebelisi ea bonolo, joale u hloka feela ho tloaela taba ea hore taelo ls -a e hlahisa lifaele tse qalang ka letheba le tsoakiloeng le lifaele tse qalang ka tlhaku, le Motsamaisi oa khitla bosiu, e sebelisang mesebetsi ea eona ea ka hare ho hlophisa mabitso, e beha lifaele tse qalang ka letheba qalong ea lenane.

litšupiso

Tlaleho ea No. 10 Unicode collation algorithm

Boima ba litlhaku ho unicode.org

ICU - ts'ebetsong ea laeborari ea ho sebetsa le Unicode ho tsoa ho IBM.

Teko ea ho hlopha ka ho sebelisa ICU

Boima ba litlhaku ho ISO 14651

Tlhaloso ea sebopeho sa faele ka sekala ISO 14652

Puisano ea papiso ea likhoele ka glibc

Source: www.habr.com

Eketsa ka tlhaloso