Hide Forgot
Description of problem: I would like to use a sort command with the de_DE.UTF-8 locale that creates the same output for all distributions. But it seems to be impossible. I am not sure which sort implementation is wrong. Please tell me in case Fedora's is correct. Version-Release number of selected component (if applicable): coreutils-8.10-fc15 How reproducible: always Steps to Reproduce: 1. cat test a-b/! ab abc 2. cat test2 a-b/ Ab Abc 2b. cat test3 Abc Abcd a-bc/! 3. on Fedora 15: $ LC_ALL=C sort test a-b/! ab abc $ LC_ALL=de_DE.UTF-8 sort test ab a-b/! abc $ LC_ALL=C sort test2 Ab Abc a-b $ LC_ALL=de_DE.UTF-8 sort test2 Ab a-b Abc $ LC_ALL=de_DE.UTF-8 sort test3 Abc a-bc/! Abcd 4. on CentOS 5: $ LC_ALL=C sort test -> same as Fedora 15 $ LC_ALL=de_DE.UTF-8 sort test -> same as Fedora 15 $ LC_ALL=C sort test2 -> same as Fedora 15 $ LC_ALL=de_DE.UTF-8 sort test2 a-b Ab Abc $ LC_ALL=de_DE.UTF-8 sort test3 a-bc/! Abc Abcd 5. debian 6.0.3 $ LC_ALL=C sort test -> same as Fedora 15 $ LC_ALL=de_DE.UTF-8 sort test3 Abc Abcd a-bc/! Actual results: sort behaves different on different systems Expected results: sort behaves the same
Well, that would be hard - as the multibyte support in sort varies in the Linux distributions - is added by coreutils-i18n.patch in Fedora. As there is no upstream for this patch, this patch may vary (and varies) in the different distributions. Sorting depends on the LC_COLLATE and LC_NUMERIC settings from glibc - which may differ on different systems as well. I would say this is not a bug and my only recommendation here is to use C locales where the output is predictable and more consistent between systems.
Thank you for the fast reply. IMHO there can be only one order that is correct for the shown lists. Also no multibyte characters are included, therefore the sort order for de_DE.UTF-8 should match the order for de_DE.* locales on Fedora, which is also not the case. And afaics coreutils is still developed by upstream, why won't they accept the patch?
It doesn't matter, locales affect the sorting order - LC_COLLATE and LC_NUMERIC affects how to sort behaves. Additionally - multibyte patch is quite "stupid" - it sorts everything via multibyte path with multibyte locales(and multibyte path is 2-20+ times slower in the case of sort). I really recommend to use the LC_ALL=C for consistent results. To second part - yes, coreutils upstream is active, but multibyte patch has wrong design, it has to be rewritten from scratch to be accepted by upstream (too much of duplicate code, too big performance impact, almost no test coverage(in fact activating only one 'cut' test for multibyte discovered two bugs in the patch) ... ) ... it's far away from being acceptable for upstream (but I have to keep it in Fedora due to legacy reasons).
Cleanup - as this is caused by locale specific collation order from glibc, so moving there - there is nothing what I can do about it in coreutils. Still, likely notabug.
As far as I know, the F15 collation order is the most correct. CentOS 5 is probably using the slightly out of date bits from RHEL 5. DIACRIT_FORWARD is one of the changes that are probably missing from that era glibc. Can't speak for why Debian differs....