Bug 750173 - sort in F15 behaves different than sort in CentOS and debian for de_DE.UTF-8 locale
Summary: sort in F15 behaves different than sort in CentOS and debian for de_DE.UTF-8 ...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: glibc
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Jeff Law
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-10-31 09:11 UTC by Till Maas
Modified: 2016-11-24 16:04 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-07-18 22:15:16 UTC
Type: ---


Attachments (Terms of Use)

Description Till Maas 2011-10-31 09:11:30 UTC
Description of problem:
I would like to use a sort command with the de_DE.UTF-8 locale that creates the same output for all distributions. But it seems to be impossible. I am not sure which sort implementation is wrong. Please tell me in case Fedora's is correct.

Version-Release number of selected component (if applicable):
coreutils-8.10-fc15

How reproducible:
always

Steps to Reproduce:
1. cat test
a-b/!
ab
abc
2. cat test2
a-b/
Ab
Abc
2b. cat test3
Abc
Abcd
a-bc/!

3. on Fedora 15:
$ LC_ALL=C sort test
a-b/!
ab
abc
$ LC_ALL=de_DE.UTF-8 sort test
ab
a-b/!
abc

$ LC_ALL=C sort test2
Ab
Abc
a-b

$ LC_ALL=de_DE.UTF-8 sort test2
Ab
a-b
Abc

$ LC_ALL=de_DE.UTF-8 sort test3
Abc
a-bc/!
Abcd

4. on CentOS 5:
$ LC_ALL=C sort test -> same as Fedora 15
$ LC_ALL=de_DE.UTF-8 sort test -> same as Fedora 15
$ LC_ALL=C sort test2 -> same as Fedora 15
$ LC_ALL=de_DE.UTF-8 sort test2
a-b
Ab
Abc

$ LC_ALL=de_DE.UTF-8 sort test3
a-bc/!
Abc
Abcd

5. debian 6.0.3
$ LC_ALL=C sort test -> same as Fedora 15

$ LC_ALL=de_DE.UTF-8 sort test3
Abc
Abcd
a-bc/!
 
Actual results:
sort behaves different on different systems

Expected results:
sort behaves the same

Comment 1 Ondrej Vasik 2011-10-31 09:49:10 UTC
Well, that would be hard - as the multibyte support in sort varies in the Linux distributions - is added by coreutils-i18n.patch in Fedora. As there is no upstream for this patch, this patch may vary (and varies) in the different distributions.

Sorting depends on the LC_COLLATE and LC_NUMERIC settings from glibc - which may differ on different systems as well.

I would say this is not a bug and my only recommendation here is to use C locales where the output is predictable and more consistent between systems.

Comment 2 Till Maas 2011-10-31 16:31:39 UTC
Thank you for the fast reply.
IMHO there can be only one order that is correct for the shown lists. Also no multibyte characters are included, therefore the sort order for de_DE.UTF-8 should match the order for de_DE.* locales on Fedora, which is also not the case. And afaics coreutils is still developed by upstream, why won't they accept the patch?

Comment 3 Ondrej Vasik 2011-10-31 18:04:21 UTC
It doesn't matter, locales affect the sorting order - LC_COLLATE and LC_NUMERIC affects how to sort behaves. Additionally - multibyte patch is quite "stupid" - it sorts everything via multibyte path with multibyte locales(and multibyte path is 2-20+ times slower in the case of sort). I really recommend to use the LC_ALL=C for consistent results.

To second part - yes, coreutils upstream is active, but multibyte patch has wrong design, it has to be rewritten from scratch to be accepted by upstream (too much of duplicate code, too big performance impact, almost no test coverage(in fact activating only one 'cut' test for multibyte discovered two bugs in the patch) ... ) ... it's far away from being acceptable for upstream (but I have to keep it in Fedora due to legacy reasons).

Comment 4 Ondrej Vasik 2012-07-13 13:51:44 UTC
Cleanup - as this is caused by locale specific collation order from glibc, so moving there - there is nothing what I can do about it in coreutils. Still, likely notabug.

Comment 5 Jeff Law 2012-07-18 22:15:00 UTC
As far as I know, the F15 collation order is the most correct.

CentOS 5 is probably using the slightly out of date bits from RHEL 5.  DIACRIT_FORWARD is one of the changes that are probably missing from that era glibc.

Can't speak for why Debian differs....


Note You need to log in before you can comment on or make changes to this bug.