Bug 750173

Summary: sort in F15 behaves different than sort in CentOS and debian for de_DE.UTF-8 locale
Product: [Fedora] Fedora Reporter: Till Maas <opensource>
Component: glibcAssignee: Jeff Law <law>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: fweimer, jakub, kdudka, law, maxamillion, opensource, ovasik, pfrankli, schwab, twaugh
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-07-18 22:15:16 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Till Maas 2011-10-31 09:11:30 UTC
Description of problem:
I would like to use a sort command with the de_DE.UTF-8 locale that creates the same output for all distributions. But it seems to be impossible. I am not sure which sort implementation is wrong. Please tell me in case Fedora's is correct.

Version-Release number of selected component (if applicable):
coreutils-8.10-fc15

How reproducible:
always

Steps to Reproduce:
1. cat test
a-b/!
ab
abc
2. cat test2
a-b/
Ab
Abc
2b. cat test3
Abc
Abcd
a-bc/!

3. on Fedora 15:
$ LC_ALL=C sort test
a-b/!
ab
abc
$ LC_ALL=de_DE.UTF-8 sort test
ab
a-b/!
abc

$ LC_ALL=C sort test2
Ab
Abc
a-b

$ LC_ALL=de_DE.UTF-8 sort test2
Ab
a-b
Abc

$ LC_ALL=de_DE.UTF-8 sort test3
Abc
a-bc/!
Abcd

4. on CentOS 5:
$ LC_ALL=C sort test -> same as Fedora 15
$ LC_ALL=de_DE.UTF-8 sort test -> same as Fedora 15
$ LC_ALL=C sort test2 -> same as Fedora 15
$ LC_ALL=de_DE.UTF-8 sort test2
a-b
Ab
Abc

$ LC_ALL=de_DE.UTF-8 sort test3
a-bc/!
Abc
Abcd

5. debian 6.0.3
$ LC_ALL=C sort test -> same as Fedora 15

$ LC_ALL=de_DE.UTF-8 sort test3
Abc
Abcd
a-bc/!
 
Actual results:
sort behaves different on different systems

Expected results:
sort behaves the same

Comment 1 Ondrej Vasik 2011-10-31 09:49:10 UTC
Well, that would be hard - as the multibyte support in sort varies in the Linux distributions - is added by coreutils-i18n.patch in Fedora. As there is no upstream for this patch, this patch may vary (and varies) in the different distributions.

Sorting depends on the LC_COLLATE and LC_NUMERIC settings from glibc - which may differ on different systems as well.

I would say this is not a bug and my only recommendation here is to use C locales where the output is predictable and more consistent between systems.

Comment 2 Till Maas 2011-10-31 16:31:39 UTC
Thank you for the fast reply.
IMHO there can be only one order that is correct for the shown lists. Also no multibyte characters are included, therefore the sort order for de_DE.UTF-8 should match the order for de_DE.* locales on Fedora, which is also not the case. And afaics coreutils is still developed by upstream, why won't they accept the patch?

Comment 3 Ondrej Vasik 2011-10-31 18:04:21 UTC
It doesn't matter, locales affect the sorting order - LC_COLLATE and LC_NUMERIC affects how to sort behaves. Additionally - multibyte patch is quite "stupid" - it sorts everything via multibyte path with multibyte locales(and multibyte path is 2-20+ times slower in the case of sort). I really recommend to use the LC_ALL=C for consistent results.

To second part - yes, coreutils upstream is active, but multibyte patch has wrong design, it has to be rewritten from scratch to be accepted by upstream (too much of duplicate code, too big performance impact, almost no test coverage(in fact activating only one 'cut' test for multibyte discovered two bugs in the patch) ... ) ... it's far away from being acceptable for upstream (but I have to keep it in Fedora due to legacy reasons).

Comment 4 Ondrej Vasik 2012-07-13 13:51:44 UTC
Cleanup - as this is caused by locale specific collation order from glibc, so moving there - there is nothing what I can do about it in coreutils. Still, likely notabug.

Comment 5 Jeff Law 2012-07-18 22:15:00 UTC
As far as I know, the F15 collation order is the most correct.

CentOS 5 is probably using the slightly out of date bits from RHEL 5.  DIACRIT_FORWARD is one of the changes that are probably missing from that era glibc.

Can't speak for why Debian differs....