Bug 28414

Summary: Spanish locale at glibc seems to be bad
Product: [Retired] Red Hat Linux Reporter: Carlos Perells Marmn <carlos>
Component: glibcAssignee: Jakub Jelinek <jakub>
Status: CLOSED NOTABUG QA Contact: Aaron Brown <abrown>
Severity: high Docs Contact:
Priority: medium    
Version: 7.0CC: fweimer
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2001-02-20 13:18:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Carlos Perells Marmn 2001-02-20 12:01:44 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.76 [es] (X11; U; Linux 2.4.0-0.99.11 i686)


I have a problem with the "sort" command, It doesn't sort correctly with
Spanish locale. I know that the sort differs from LC_LANG=C &&
LC_LANG=es_ES, but this is not the problem. Here you have a mail I have
send to Derek Tattersall <dlt> about this problem:

<mail>
Hello I have some questions about the bugs 21913, 20975 and other
related...


As you can see at http://czyborra.com/charsets/iso8859.html#ISO-8859-1,
the '-' symbol must be before the '0' (zero), but when I sort the file
"data", I get the file "data.es.rh71" with the locale set to es_ES@euro
and the RedHat 7.1 beta sort, but when I use the RedHat 6.2's sort in
the same RedHat 7.1 machine (I have copied it from rh 6.2 to rh 7.1), I
get the file "data.es.rh62" that in my opinion it's the correct answer.

When I sort the same file "data" with the locale changed to C I get the
a good sort and is the same as the RedHat 6.2 ones. 

The explication you gave at bugzilla it's ok, but as you can see it's
not sorting as iso-8859-1 (I know that es_ES@euro is not iso-8859-1 but
iso-8859-15, but the only diff is the "EURO" character, isn't it?).

Any idea?

Thanks in advance.
</email>

Here you have the attached files:

<data>
0-------------------------------------------------------------0.COM.
0-------------------------------------------------------------1.COM.
0-------------------------------------------------------------2.COM.
0-------------------------------------------------------------3.COM.
0-------------------------------------------------------------4.COM.
0------------------------------------------------------------0.COM.
0-----------------------------------------------------------0.COM.
0----------------------------------------------------------0.COM.
0---------------------------------------------------------0.COM.
0--------------------------------------------------------0.COM.
0--------------------------0.COM.
0-------------------------0.COM.
0------------------------0.COM.
0-----------------------0.COM.
0----------------------0.COM.
0---------------------0.COM.

;
000000000000000000000000000000000000000000000000000000000000000.COM.
000000000000000000000000000000000000000000000000000000000000001.COM.
00000000000000000000000000000000000000000000000000000000000000.COM.
0000000000000000000000000000000000000000000000000000000000000.COM.
0000000000000000000000.COM.
0000000000000000000002.COM.
000000000000000000000.COM.
000000000000000000001A.COM.
00000000000000000000.COM.
0000000000000000000.COM.
000000000000000000.COM.
00000000000000000.COM.
0000000000000000.COM.
000000000000000.COM.
00000000000000.COM.
0000000000000.COM.
000000000000.COM.
00000000000.COM.
</data>

<data.es.rh62>
0-------------------------------------------------------------0.COM.
0-------------------------------------------------------------1.COM.
0-------------------------------------------------------------2.COM.
0-------------------------------------------------------------3.COM.
0-------------------------------------------------------------4.COM.
0------------------------------------------------------------0.COM.
0-----------------------------------------------------------0.COM.
0----------------------------------------------------------0.COM.
0---------------------------------------------------------0.COM.
0--------------------------------------------------------0.COM.
0--------------------------0.COM.
0-------------------------0.COM.
0------------------------0.COM.
0-----------------------0.COM.
0----------------------0.COM.
0---------------------0.COM.
00000000000.COM.
000000000000.COM.
0000000000000.COM.
00000000000000.COM.
000000000000000.COM.
0000000000000000.COM.
00000000000000000.COM.
000000000000000000.COM.
0000000000000000000.COM.
00000000000000000000.COM.
000000000000000000000.COM.
0000000000000000000000.COM.
0000000000000000000000000000000000000000000000000000000000000.COM.
00000000000000000000000000000000000000000000000000000000000000.COM.
000000000000000000000000000000000000000000000000000000000000000.COM.
000000000000000000000000000000000000000000000000000000000000001.COM.
0000000000000000000002.COM.
000000000000000000001A.COM.
;
</data.es.rh62>

<data.es.rh71>
;
000000000000000000000000000000000000000000000000000000000000000.COM.
000000000000000000000000000000000000000000000000000000000000001.COM.
00000000000000000000000000000000000000000000000000000000000000.COM.
0000000000000000000000000000000000000000000000000000000000000.COM.
0000000000000000000000.COM.
0000000000000000000002.COM.
000000000000000000000.COM.
000000000000000000001A.COM.
00000000000000000000.COM.
0000000000000000000.COM.
000000000000000000.COM.
00000000000000000.COM.
0000000000000000.COM.
000000000000000.COM.
00000000000000.COM.
0000000000000.COM.
000000000000.COM.
00000000000.COM.
0-------------------------0.COM.
0------------------------0.COM.
0-----------------------0.COM.
0----------------------0.COM.
0---------------------0.COM.
0--------------------------0.COM.
0---------------------------------------------------------0.COM.
0--------------------------------------------------------0.COM.
0-------------------------------------------------------------0.COM.
0------------------------------------------------------------0.COM.
0-----------------------------------------------------------0.COM.
0----------------------------------------------------------0.COM.
0-------------------------------------------------------------1.COM.
0-------------------------------------------------------------2.COM.
0-------------------------------------------------------------3.COM.
0-------------------------------------------------------------4.COM.
</data.es.rh71>

Here you have his answer:

<answer>
I have talked with our local expert on the i18n problems, and he suggests
that
your problem is a problem with glibc.  That is, the tables are wrong for
Spanish.  I suggest that you file a bug with as much detail as you can
provide
against glibc.

</answer>

So, here you have the bug-report.  It happends with all the versions of
glibc that you have release, from RedHat 7.0 to the actual RedHat 7.1 beta.

Reproducible: Always
Steps to Reproduce:
You only need try to sort a file with locale set to es_ES
	

As you can see it's a BIG bug, at work we do a lot of sorts with a big
files (3Gb or more) and we need this files be sorted well, so we have come
back to the RedHat 6.2 sort wich make the sort as if you have the LANG=C

Comment 1 Jakub Jelinek 2001-02-20 13:00:40 UTC
No, glibc (and sort) is correct on this.
LC_COLLATE=es_ES@euro sort sorts in a way how things are sorted in Spanish
vocabulary, not how things are sorted in ISO-8859-15.
Use
LC_ALL=C sort
if you want ASCII sorting.
The data you put above should be sorted the same way as es_ES under e.g.
en_US. Just think as if -, . and ; were replaced with nothing and the thing
would be sorted, you'd get the same result as in the es_ES collating sort.

Comment 2 Carlos Perells Marmn 2001-02-20 13:18:45 UTC
Please, if you read the link to the table that I have send you
(http://czyborra.com/charsets/iso8859.html#ISO-8859-1) you can see that the "-"
character must go BEFORE the "0" (zero) character and the result of the sort
command is sorting badly (for that table). I know that the sort that I get with
LC_ALL=en_US is not equal as the LC_ALL=es_ES one, but I also know that the
LC_ALL=es_ES sort result is not the correct one for es_ES (or es_ES@euro).

Please, Have you read the Derek Tattersall's answer?

Thanks.

Comment 3 Jakub Jelinek 2001-02-20 13:36:55 UTC
Sorry, but the table has nothing to do with this issue. In the table you can
find that strcmp("-B", "0A") < 0 which is not the same as strcoll("-B", "0A")
in most locales (including Spanish). i18n sorting is not about comparing
character values, it is a complex set of rules. Most west european locales
use ISO/IEC TR 14652 for this. Please look into some printed Spanish vocabulary
and you'll find out e.g. hyphen is not considered as a separate letter there
when the letter characters are different, e.g. as in this order:
aa
a-b
ac
I don't know whom Derek talked about this to, but the thing is really, if you
want to sort in the way Unix was sorting since 70's until several years ago,
you can use LC_ALL=C sort, if you want to sort how people sort things for
centuries, use your own locale.