How Linux's sort sorts strings

Introduction

It all started with a short script that was supposed to combine information about addresses e-mail employees obtained from the list of mailing list users with employee positions obtained from the HR database. Both lists were exported to Unicode text files. UTF-8 and stored with Unix line endings.

Content mail.txt

Иванов Андрей;[email protected]

Content buhg.txt

Иванова Алла;маляр
Ёлкина Элла;крановщица
Иванов Андрей;слесарь
Абаканов Михаил;маляр

To merge, the files were sorted by the Unix command Black and submitted to the input of the Unix program join, which unexpectedly ended with the error:

$> sort buhg.txt > buhg.srt
$> sort mail.txt > mail.srt
$> join buhg.srt mail.srt > result
join: buhg.srt:4: is not sorted: Иванов Андрей;слесарь

Viewing the sorting result with the eyes showed that, in general, the sorting is correct, but in the case of coincidences of male and female surnames, the female ones go before the male ones:

$> sort buhg.txt
Абаканов Михаил;маляр
Ёлкина Элла;крановщица
Иванова Алла;маляр
Иванов Андрей;слесарь

Looks like a sorting glitch in Unicode or like a manifestation of feminism in the sorting algorithm. The first is, of course, more plausible.

Let's put it off for now join and focus on Black. Let's try to solve the problem by the method of scientific poke. First, change the locale from en_US on ru_RU. For sorting, it would be enough to set the environment variable LC_COLLATE, but we will not waste time on trifles:

$> LANG=ru_RU.UTF-8 sort buhg.txt
Абаканов Михаил;маляр
Ёлкина Элла;крановщица
Иванова Алла;маляр
Иванов Андрей;слесарь

Nothing changed.

Let's try to recode the files into a single-byte encoding:

$> iconv -f UTF-8 -t KOI8-R buhg.txt 
 | LANG=ru_RU.KOI8-R sort 
 | iconv -f KOI8-R -t UTF8

Again, nothing has changed.

Nothing can be done, you have to look for a solution on the Internet. There is nothing directly about Russian surnames, but there are questions about other sorting oddities. For example, here's a problem: unix sort treats '-' (dash) characters as invisible. In short, the strings "ab", "aa", "ac" are sorted as "aa", "ab", "ac".

The answer is standard everywhere: use the programmer's locale "C" and you will be happy. We try:

$> LANG=C sort buhg.txt
Ёлкина Элла;крановщица
Абаканов Михаил;маляр
Иванов Андрей;слесарь
Иванова Алла;адвокат

Something has changed. The Ivanovs lined up in the correct order, although Yolkina slipped somewhere. Let's go back to the original problem:

$> LANG=C sort buhg.txt > buhg.srt
$> LANG=C sort mail.txt > mail.srt
$> LANG=C join buhg.srt mail.srt > result

It worked without errors, as promised by the Internet. And this despite Yolkina in the first line.

The problem seems to be solved, but just in case, let's try another Russian encoding - Windows CP1251:

$> iconv -f UTF-8 -t CP1251 buhg.txt 
 | LANG=ru_RU.CP1251 sort 
 | iconv -f CP1251 -t UTF8 

The sort result, oddly enough, will match the locale "C", and the whole example, accordingly, passes without errors. Some kind of mystic.

I don't like mysticism in programming because it usually masks errors. We'll have to get serious about how it works. Black and what influences LC_COLLATE .

At the end I will try to answer the questions:

  • why women's surnames were sorted incorrectly
  • why LANG=ru_RU.CP1251 turned out to be the equivalent LANG = C
  • why u Black и join different ideas about the order of sorted strings
  • why are there errors in all my examples
  • finally how to sort strings to your liking

Sorting in Unicode

First stop will be technical report #10 titled Unicode collation algorithm https://www.izakayasushilounge.com unicode.org. The report contains a lot of technical details, so I'll take the liberty of giving a summary of the main ideas.

Collation - "comparison" of strings is the basis of any sorting algorithm. The algorithms themselves may differ ("bubble", "merge", "fast"), but they will all use a comparison of a pair of strings to determine their order.

Sorting strings in natural language is a rather difficult problem. Even in the simplest single-byte encodings, the order of letters in the alphabet, even if somewhat different from the English Latin alphabet, will no longer coincide with the order of the numerical values ​​by which these letters are encoded. So in the German alphabet the letter Ö stands between О и P, and in the encoding CP850 she falls between ÿ и Ü.

You can try to abstract away from the specific encoding and consider "ideal" letters that are arranged in some order, as is done in Unicode. Encodings Utf8, Utf16 or single byte KOI8-R (if a limited subset of Unicode is needed) will give different numerical representations of the letters, but refer to the same base table elements.

It turns out that even if we build a symbol table from scratch, we cannot assign a universal symbol order to it. In different national alphabets that use the same letters, the order of these letters may differ. For example, in French Æ will be treated as a ligature and sorted as a string AE. In the Norwegian language Æ will be a separate letter, which is located after Z. By the way, apart from ligatures like Æ there are letters written with several characters. So in the Czech alphabet there is a letter Ch, which stands between H и I.

In addition to the difference in alphabets, there are other national traditions that affect sorting. In particular, the question arises: in what order should words consisting of uppercase and lowercase letters appear in the dictionary? Also, sorting can be affected by the peculiarities of the use of punctuation marks. In Spanish, an upside down question mark is placed at the beginning of an interrogative sentence (Do you like music?). In this case, it is obvious that interrogative sentences should not be grouped in a separate cluster outside the alphabet, but how to sort lines with different punctuation marks?

I will not dwell on string sorting in languages ​​very different from European ones. Note that right-to-left or top-to-bottom languages ​​tend to store characters in lines in reading order, and even non-alphabetic scripts have their own way of ordering lines character-by-character. For example, hieroglyphs can be ordered by style (Chinese character keys) or by pronunciation. How emoji should be ordered, I honestly have no idea, but you can think of something for them.

Based on the features listed above, the main requirements for comparing strings based on Unicode tables were formulated:

  • string comparison does not depend on the position of characters in the code table;
  • sequences of characters that form a single character are reduced to the canonical form (A + the top circle is the same as Å);
  • when comparing strings, a character is considered in the context of a string and, if necessary, combined with neighbors into one unit of comparison (Ch in Czech) or split into several (Æ in French);
  • all national features (alphabet, uppercase/lowercase, punctuation marks, order of types of writing) must be configured up to manual assignment of the order (emoji);
  • comparison is important not only for sorting, but also in many other places, for example, for specifying ranges of strings (substituting {A ... z} in bash);
  • comparison should be fast enough.

In addition, the authors of the report formulated comparison properties that algorithm developers should not rely on:

  • the comparison algorithm should not require a separate character set for each language (Russian and Ukrainian languages ​​share most Cyrillic characters);
  • the comparison must not be based on the order of characters in Unicode tables;
  • the weight of the string should not be an attribute of the string, since the same string in different cultural contexts can have different weights;
  • row weights can change when merging or splitting (from x < y it does not follow that xz < yz);
  • different strings that have the same weights are considered equal in terms of the sorting algorithm. The introduction of additional ordering of such strings is possible, but it may degrade performance;
  • rows with the same weights can be swapped during repeated sortings. Persistence is a property of a particular sorting algorithm, not a property of the string comparison algorithm (see previous point);
  • sorting rules may change over time as cultural practices are refined/changed.

It is also stipulated that the comparison algorithm does not know anything about the semantics of the processed strings. So, strings consisting only of numbers should not be compared as numbers, and in lists of English names, the article should not be removed (The Beatles).

In order to meet all these requirements, a multilevel (actually four-level) tabular sorting algorithm is proposed.

Previously, the characters in the string are converted to canonical form and grouped into comparison units. Each unit of comparison is assigned several weights corresponding to several levels of comparison. The weights of units of comparison are elements of ordered sets (in this case, integers) that can be compared for more or less. special meaning IGNORED (0x0) means that this unit is not involved in the comparison at the corresponding level of comparison. The string comparison can be repeated multiple times, using the weights of the appropriate levels. At each level, the weights of the comparison units of two strings are sequentially compared with each other.

In different implementations of the algorithm for different national traditions, the values ​​​​of the coefficients may differ, but the Unicode standard includes a basic weight table − "Default Unicode Collation Element Table" (DUCET). I want to note that setting the variable LC_COLLATE is actually an indication of the choice of the weight table in the string comparison function.

Weight coefficients DUCET arranged as follows:

  • at the first level, all letters are reduced to the same case, diacritical marks are discarded, punctuation marks (not all) are ignored;
  • at the second level, only diacritics are taken into account;
  • the third level is case only;
  • at the fourth level, only punctuation marks are taken into account.

The comparison takes place in several passes: first, the coefficients of the first level are compared; if the weights match, then it is re-compared with the weights of the second level; then perhaps a third and a fourth.

The comparison ends when the strings contain matching comparison units with different weights. Rows that have equal weights at all four levels are considered equal to each other.

This algorithm (with a bunch of additional technical details) gave the name to report #10 - Unicode Collation Algorithm (ACU).

This is where the sorting behavior from our example becomes a bit clearer. It would be nice to compare it with the Unicode standard.

For testing implementations ACU there is a special testusing weight file, realizing DUCET. In the weights file, you can find various amusements. For example, there is the order of mahjong and European dominoes, as well as the order of suits in a deck of cards (symbol 1F000 and further). Card suits are placed according to the rules of the bridge - PCBT, and cards in the suit - in the sequence T,2,3 ... K.

Manual verification of the correct sorting of strings in accordance with DUCET would be quite tedious, but fortunately for us, there is a reference library implementation for working with Unicode - "International Components for Unicode" (ICU).

On the site of this library, developed in IBM, there are demo pages, including string comparison algorithm page. We enter our test strings with default settings and, lo and behold, we get the perfect Russian sorting.

Абаканов Михаил;маляр
Ёлкина Элла;крановщица
Иванов Андрей;слесарь
Иванова Алла;адвокат

By the way, the website ICU you can find a refinement of the comparison algorithm when processing punctuation marks. In the examples Collation FAQ apostrophe and hyphen are ignored.

Unicode helped us, but look for the causes of strange behavior Black в Linux will have to go somewhere else.

Sorting in glibc

Quick view of utility source codes Black of GNU Core Utils showed that in the utility itself, localization is reduced to printing the current value of the variable LC_COLLATE when run in debug mode:

$ sort --debug buhg.txt > buhg.srt
sort: using ‘en_US.UTF8’ sorting rules

String comparison is performed by the standard function strcoll, which means that everything interesting is in the library glibc.

On the wiki project glibc dedicated to string comparison one paragraph. From this paragraph it can be understood that glibc sorting is based on the algorithm already known to us ACU (The Unicode collation algorithm) and/or on a standard close to it ISO 14651. (International string ordering and comparison). Regarding the latest standard, it should be noted that on the site standards.iso.org ISO 14651. officially declared public, but the corresponding link leads to a non-existent page. Google gives out several pages with links to official sites that offer to buy an electronic copy of the standard for a hundred euros, but on the third or fourth page of the search results there are also direct links to PDF. In general, the standard practically does not differ from ACU, but it reads more boring because it does not contain vivid examples of national features of string sorting.

The most interesting information wiki there was a link to bugtracker with a discussion of the implementation of string comparison in glibc. It can be seen from the discussion that glibc for string comparison is used ISOshnaya table The Common Template Table (CTT), the address of which can be found in the application A standard ISO 14651.. Between 2000 and 2015 this table in glibc did not have a maintainer and was quite different (at least outwardly) from the current version of the standard. From 2015 to 2018, there was an adaptation to the new version of the table, and at the moment you have a chance to meet in real life as a new version of the table (8 CentOS) and the old one (7 CentOS).

Now that we have all the information about the algorithm and auxiliary tables, we can return to the original problem and understand how to properly sort strings in the Russian locale.

ISO 14651 / 14652

The source code of the table we are interested in CTT on most distributions Linux is in the directory /usr/share/i18n/locales/. The table itself is in the file iso14651_t1_common. Then this file is a directive copy iso14651_t1_common included in the file iso14651_t1, which, in turn, is included in national files, including en_US и ru_RU. On most distributions Linux all source files are included in the base installation, but if they are not, then you will have to install an additional package from the distribution.

File structure iso14651_t1 may seem terribly verbose, with non-obvious rules for constructing names, but if you figure it out, everything is quite simple. The structure is described in the standard ISO 14652., a copy of which can be downloaded from the site open-std.org. Another description of the file format can be found in specifications POSIX from open group. As an alternative to reading the standard, you can study the source texts of the function collate_read в glibc/locale/programs/ld-collate.c.

The file structure looks like this:

By default, the character is used as an escape character, and the end of the line after the # character is a comment. Both symbols can be redefined, which is done in the new version of the table:

escape_char /
comment_char %

The file will contain tokens in the format or (Where x is a hexadecimal digit). This is the hexadecimal representation of Unicode code points in the encoding UCS-4 (UTF-32). All other elements in angle brackets (including , <2> and the like) are considered simple string constants that don't make much sense out of context.

Line LC_COLLATE tells us that data describing string comparisons begins next.

First, names are given for weights in the comparison table and names for character combinations. Generally speaking, two types of names belong to two different entities, but they are mixed in the actual file. The names of weights are given by the keyword collating symbol (comparison character) because Unicode characters that have the same weights will be treated as equivalent characters when compared.

The total length of the section in the current revision of the file is about 900 lines. I pulled examples from several places to show random names and several kinds of syntax.

LC_COLLATE

collating-symbol <RES-1>
collating-symbol <BLK>
collating-symbol <MIN>
collating-symbol <WIDE>
...
collating-symbol <ARABIC>
collating-symbol <ETHPC>
collating-symbol <OSMANYA>
...
collating-symbol <S1D000>..<S1D35F>
collating-symbol <SFFFF> % Guaranteed largest symbol value. Keep at end of this list
...
collating-element <U0413_0301> from "<U0413><U0301>"
collating-element <U0413_0341> from "<U0413><U0341>"

  • collating symbol logs a string OSMANYA in the scale name table
  • collating symbol .. registers a sequence of names consisting of a prefix S and a hexadecimal numeric suffix from 1D000 to 1D35F.
  • FFFF в collating symbol looks like a large unsigned integer in hexadecimal, but it's just a name that could look like
  • name means code point in encoding UCS-4
  • collating element from" " registers a new name for a pair of unicode dots.

Once the names of the weights are defined, the actual weights are specified. Since only greater-less ratios matter in comparison, the weights are determined by a simple sequence of enumerating names. The "lighter" weights are listed first, then the "heavier" weights. As a reminder, each Unicode character is assigned four different weights. Here they are summarized in a single ordered sequence. Theoretically, any symbolic name can be used at any of the four levels, but the comments indicate that the developers mentally separate the names into levels.

% Symbolic weight assignments

% Third-level weight assignments
<RES-1>
<BLK>
<MIN>
<WIDE>
...
% Second-level weight assignments
<BASE>
<LOWLINE> % COMBINING LOW LINE
<PSILI> % COMBINING COMMA ABOVE
<DASIA> % COMBINING REVERSED COMMA ABOVE
...
% First-level weight assignments
<S0009> % HORIZONTAL TABULATION 
<S000A> % LINE FEED
<S000B> % VERTICAL TABULATION
...
<S0434> % CYRILLIC SMALL LETTER DE
<S0501> % CYRILLIC SMALL LETTER KOMI DE
<S0452> % CYRILLIC SMALL LETTER DJE
<S0503> % CYRILLIC SMALL LETTER KOMI DJE
<S0453> % CYRILLIC SMALL LETTER GJE
<S0499> % CYRILLIC SMALL LETTER ZE WITH DESCENDER
<S0435> % CYRILLIC SMALL LETTER IE
<S04D7> % CYRILLIC SMALL LETTER IE WITH BREVE
<S0454> % CYRILLIC SMALL LETTER UKRAINIAN IE
<S0436> % CYRILLIC SMALL LETTER ZHE

Finally, the actual weight table.

The weight section is enclosed in keyword lines order_start и order_end. Extra options order_start determine in which direction rows are scanned at each level of comparison. The default setting is forward. The body of the section consists of lines that contain the character code and its four weights. A character code can be represented by the character itself, a code point, or a symbolic name defined previously. Weights can also be given to symbolic names, code points, or the symbols themselves. If code points or characters are used, their weight is the same as the numerical value of the code point (position in the Unicode table). Characters not explicitly (as I understand it) are considered assigned to the table with a primary weight that matches the position in the Unicode table. Special weight value IGNORED means that the given character is ignored at the corresponding level of comparison.

To demonstrate the structure of the weights, I chose three rather obvious fragments:

  • characters that are completely ignored
  • characters equivalent to the number three in the first two levels
  • the beginning of the Cyrillic alphabet, which does not contain diacritics, and therefore is sorted mainly by the first and third levels.

order_start forward;forward;forward;forward,position
<U0000> IGNORE;IGNORE;IGNORE;IGNORE % NULL (in 6429)
<U0001> IGNORE;IGNORE;IGNORE;IGNORE % START OF HEADING (in 6429)
<U0002> IGNORE;IGNORE;IGNORE;IGNORE % START OF TEXT (in 6429)
...
<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
<UFF13> <S0033>;<BASE>;<WIDE>;<UFF13> % FULLWIDTH DIGIT THREE
<U2476> <S0033>;<BASE>;<COMPAT>;<U2476> % PARENTHESIZED DIGIT THREE
<U248A> <S0033>;<BASE>;<COMPAT>;<U248A> % DIGIT THREE FULL STOP
<U1D7D1> <S0033>;<BASE>;<FONT>;<U1D7D1> % MATHEMATICAL BOLD DIGIT THREE
...
<U0430> <S0430>;<BASE>;<MIN>;<U0430> % CYRILLIC SMALL LETTER A
<U0410> <S0430>;<BASE>;<CAP>;<U0410> % CYRILLIC CAPITAL LETTER A
<U04D1> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
<U0430_0306> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
...
<U0431> <S0431>;<BASE>;<MIN>;<U0431> % CYRILLIC SMALL LETTER BE
<U0411> <S0431>;<BASE>;<CAP>;<U0411> % CYRILLIC CAPITAL LETTER BE
<U0432> <S0432>;<BASE>;<MIN>;<U0432> % CYRILLIC SMALL LETTER VE
<U0412> <S0432>;<BASE>;<CAP>;<U0412> % CYRILLIC CAPITAL LETTER VE
...
order_end

Now you can return to sorting the examples from the beginning of the article. The ambush lies in this part of the weight table:

<U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE
<U0021> IGNORE;IGNORE;IGNORE;<U0021> % EXCLAMATION MARK
<U0022> IGNORE;IGNORE;IGNORE;<U0022> % QUOTATION MARK
...

It can be seen that in this table the punctuation marks from the table ASCII (including space) is almost always ignored when comparing strings. The only exceptions are strings that match in everything except for punctuation marks that occur in matching positions. The lines from my example (after sorting) for the comparison algorithm look like this:

АбакановМихаилмаляр
ЁлкинаЭллакрановщица
ИвановаАлламаляр
ИвановАндрейслесарь

Given that in the weight table, capital letters in Russian come after lowercase letters (at the third level harder than ), the sorting looks absolutely correct.

When setting a variable LC_COLLATE = C a special table is loaded that specifies a byte-by-byte comparison

static const uint32_t collseqwc[] =
{
  8, 1, 8, 0x0, 0xff,
  /* 1st-level table */
  6 * sizeof (uint32_t),
  /* 2nd-level table */
  7 * sizeof (uint32_t),
  /* 3rd-level table */
  L'x00', L'x01', L'x02', L'x03', L'x04', L'x05', L'x06', L'x07',
  L'x08', L'x09', L'x0a', L'x0b', L'x0c', L'x0d', L'x0e', L'x0f',

...
  L'xf8', L'xf9', L'xfa', L'xfb', L'xfc', L'xfd', L'xfe', L'xff'
};

Since the Unicode code point Ё comes before A, the strings are sorted accordingly.

Text and binary tables

Obviously, string comparison is an extremely common operation, and parsing a table CTT quite an expensive procedure. To optimize access to the table, it is compiled into binary form with the command localdef.

Team localdef accepts as parameters a file with a table of national features (option -i), in which all characters are represented by Unicode dots, and a Unicode dot mapping file for characters of a specific encoding (option -f). As a result of the work, binary files are created for the locale, with the name specified in the last parameter.

glibc supports two binary file formats: "traditional" and "modern".

The traditional format implies that the locale name is the name of a subdirectory in /usr/lib/locale/. This subdirectory contains the binaries LC_COLLATE, LC_CTYPE, LC_TIME and so on. File LC_IDENTIFICATION contains the formal locale name (which may be different from the directory name) and comments.

The modern format involves storing all locales in a single archive /usr/lib/locale/locale-archive, which is mapped to the virtual memory of all processes using glibc. The name of the locale in the modern format undergoes some canonicalization - only numbers and letters reduced to lower case remain in the name of the encoding. So en_RU.KOI8-R, will be saved as ru_RU.koi8r.

Input files are searched in the current directory, as well as in directories /usr/share/i18n/locales/ и /usr/share/i18n/charmaps/ for files CTT and encoding files, respectively.

For example, the command

localedef -i ru_RU -f MAC-CYRILLIC ru_RU.MAC-CYRILLIC

will compile the file /usr/share/i18n/locales/ru_RU using encoding file /usr/share/i18n/charmaps/MAC-CYRILLIC.gz and store the result in /usr/lib/locale/locale-archive under the name en_RU.maccyrillic

If you set a variable LANG = en_US.UTF-8 that glibc will look for locale binaries in the following sequence of files and directories:

/usr/lib/locale/locale-archive
/usr/lib/locale/en_US.UTF-8/
/usr/lib/locale/en_US/
/usr/lib/locale/enUTF-8/
/usr/lib/locale/en/

If a locale occurs in both the traditional and modern formats, then the modern one takes precedence.

You can view the list of compiled locales with the command locale-a.

Preparing your comparison table

Now, armed with knowledge, you can create your own ideal string comparison table. This table should correctly compare Russian letters, including the letter Ё, and at the same time take into account punctuation marks in accordance with the table ASCII.

The process of preparing your sort table consists of two stages: editing the weight table and compiling it into binary form with the command localdef.

In order for the comparison table to be adjusted with minimal editing costs, in the format ISO 14652. sections for adjusting the weights of an existing table are provided. Section starts with keyword reorder-after and specifying the position after which the replacement is performed. Ends the section with a line reorder-end. If it is necessary to correct several sections of the table, then a section is created for each such section.

I copied the new versions of the files iso14651_t1_common и ru_RU from the repository glibc to your home directory ~/.local/share/i18n/locales/ and slightly edited the section LC_COLLATE в ru_RU. The new versions of the files are fully compatible with my version glibc. If you want to use older versions of files, then you will have to change the symbolic names and the place where the replacement starts in the table.

LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
reorder-after <U000D>
<U0020> <S0020>;<BASE>;<MIN>;<U0020> % SPACE
<U0021> <S0021>;<BASE>;<MIN>;<U0021> % EXCLAMATION MARK
<U0022> <S0022>;<BASE>;<MIN>;<U0022> % QUOTATION MARK
...
<U007D> <S007D>;<BASE>;<MIN>;<U007D> % RIGHT CURLY BRACKET
<U007E> <S007E>;<BASE>;<MIN>;<U007E> % TILDE
reorder-end
END LC_COLLATE

In fact, it would be necessary to change the fields in LC_IDENTIFICATION so that they point to the locale en_MY, but in my example this was not required, since I excluded the archive from the search for locales locale-archive.

That localdef worked with files in my folder through a variable I18NPATH you can add an additional directory to search for input files, and the directory to save binaries can be specified as a path with slashes:

$> I18NPATH=~/.local/share/i18n localedef -i ru_RU -f UTF-8 ~/.local/lib/locale/ru_MY.UTF-8

POSIX suggests that in LONG you can write absolute paths to directories with locale files, starting with a forward slash, but glibc в Linux all paths are counted from the base directory, which can be overridden via a variable LOCPATH... After installation LOCPATH=~/.local/lib/locale/ all files related to localization will be searched only in my folder. Locale archive when variable is set LOCPATH ignored.

Here is the final test:

$> LANG=ru_MY.UTF-8 LOCPATH=~/.local/lib/locale/ sort buhg.txt
Абаканов Михаил;маляр
Ёлкина Элла;крановщица
Иванов Андрей;слесарь
Иванова Алла;адвокат

Hooray! We did it!

Some errors

I have already answered the questions about sorting strings, posed at the beginning, but there are still a couple of questions about errors - visible and invisible.

Let's return to the original problem.

And the program Black and program join use the same string comparison functions from glibc. How did it happen that join gave a sort error on lines sorted by the command Black in locale en_US.UTF-8? The answer is simple: Black compares the entire string, and join compares only the key, which by default is the beginning of the string up to the first whitespace character. In my example, this resulted in an error message because the sorting of the first words in the lines did not match the sorting of the full lines.

Locale "C" guarantees that in the sorted strings, the leading substrings up to the first space will also be sorted, but this only masks the error. You can pick up such data (people with the same last name, but different names) that, without an error message, would give an incorrect result of merging files. If we want to join combined file lines by full name, then the correct way would be to explicitly specify the field separator and sort by the key field, and not by the entire line. In this case, the merge will also work correctly and there will be no errors in any locale:

$> sort -t ; -k 1 buhg.txt > buhg.srt
$> sort -t ; -k 1 mail.txt > mail.srt
$> join -t ; buhg.srt mail.srt > result

Successfully executed coded example CP1251 contains another error. The fact is that in all distributions known to me Linux packages missing compiled locale en_RU.CP1251. If the compiled locale is not found, then Black silently uses byte-by-byte comparison, which is what we observed.

By the way, there is another small glitch associated with the inaccessibility of compiled locales. Team LOCPATH=/tmp locale -a will return a list of all locales in locale-archive, but with the variable set LOCPATH for all programs (including the most local) these locales will not be available.

$> LOCPATH=/tmp locale -a | grep en_US
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
en_US
en_US.iso88591
en_US.iso885915
en_US.utf8

$> LC_COLLATE=en_US.UTF-8 sort --debug
sort: using ‘en_US.UTF-8’ sorting rules

$> LOCPATH=/tmp LC_COLLATE=en_US.UTF-8 sort --debug
sort: using simple byte comparison

Conclusion

If you are a programmer who is used to thinking that strings are a collection of bytes, then your choice is LC_COLLATE = C.

If you are a linguist or dictionary compiler, then you'd better compile your locale.

If you are a simple user, then you just need to get used to the fact that the command ls -a prints out files starting with a dot interspersed with files starting with a letter, and Midnight commander, which uses its internal functions to sort names, moves files starting with a dot to the front of the list.

references

Report #10 Unicode collation algorithm

Character weights at unicode.org

ICU - implementation of the library for working with Unicode from IBM.

Sorting test with ICU

Character weights in ISO 14651.

Description of the scale file format ISO 14652.

String comparison discussion in glibc

Source: habr.com

Add a comment