Description of problem: When sorting on one of several fields and using e.g. the en_US.utf8 locale, "sort" looks at more than the specified keys when sorting. Version-Release number of selected component (if applicable): coreutils-8.21-20.fc20.x86_64 How reproducible: Every time Steps to Reproduce: 1. printf 'a b!x\na-b-c!x\n' | LANG=en_US.utf8 ltrace -e strcoll sort -s --debug -k1,1 -t! Actual results: sort: using ‘en_US.utf8’ sorting rules sort->strcoll("a b!x", "a-b-c!x") = 21 a-b-c!x _____ a b!x ___ +++ exited (status 0) +++ Expected results: sort: using ‘en_US.utf8’ sorting rules sort->strcoll("a b", "a-b-c") = -1 a b!x ___ a-b-c!x _____ +++ exited (status 0) +++ (That is, strcoll only called on the key, not the whole string, and the output in the opposite order.) Additional info: This is shortened version of a bug report I filed upstreams: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=18540 It appears this bug is not present upstreams, but actually introduced in the Fedora packaging.
Yes, upstream doesn't have support for multibyte locales at all. Fedora has the support through very problematic and complex downstream patch and we are now working on its rewrite - so it will be more likely acceptable upstream. That being said, I can't guarantee if deeper investigation of these issues will happen. Especially sort is pretty complex and the multibyte code is fragile at best. I recommend to use C locales where possible.
> upstream doesn't have support for multibyte locales at all. Doesn't it? I downloaded an coreutils 8.22, ran configure and make, and tried a little. It appears to work as expected. In the C locale, it's byte order. In en_US, a, å and ä are sorted together, as are o and ö. In sv_SE, å, ä, and ö are considered letters with distinct places at the end of the alphabet, but not in byte numerical order. My build of sort seems to get that right. I built this ON a Fedora system, but I didn't modify the sources. And, of course, the terminal ran in a UTF-8 locale. Do I miss something? mimmi$ ( echo har ; echo hår ; echo här ; echo hor ; echo hör ) | env LANG=C src/sort har hor här hår hör mimmi$ ( echo har ; echo hår ; echo här ; echo hor ; echo hör ) | env LANG=en_US.utf8 src/sort har hår här hor hör mimmi$ ( echo har ; echo hår ; echo här ; echo hor ; echo hör ) | env LANG=sv_SE.utf8 src/sort har hor hår här hör
(In reply to Göran Uddeborg from comment #2) > > upstream doesn't have support for multibyte locales at all. Well, it DOES use strcoll, so it supports multibyte collation. But what the downstream patch adds is things like honoring multibyte space characters as field separators (upstream still uses only space and tab, rather than all UTF-8 space sequences), and attempts to use multibyte character boundaries rather than raw byte indices for things like -k 1.5,1.7 (using the 5th-7th characters, rather than the 5th-7th bytes, when sorting by a substring of field one). If you want to see what the downstream patch is adding, look at coreutils-i18n.patch applied as part of the source rpm. But it is quite hairy in its current form, hence why upstream hasn't incorporated it.
Sorry Eric for confusion, I was not clear with my "support" comment. I'm sure everyone in this bz is aware of it, but for the reference, Andreas Schwab fixed the bug, https://build.opensuse.org/package/view_file/Base:System/coreutils/sort-keycompare-mb.patch?expand=1 . We will include it to Fedora i18n patches right after quick review.
Fix built in Rawhide - coreutils-8.23-4.fc22 . Will check what should be included in the possible update for f21/f20.
coreutils-8.22-19.fc21 has been submitted as an update for Fedora 21. https://admin.fedoraproject.org/updates/coreutils-8.22-19.fc21
Package coreutils-8.22-19.fc21: * should fix your issue, * was pushed to the Fedora 21 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing coreutils-8.22-19.fc21' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2014-12937/coreutils-8.22-19.fc21 then log in and leave karma (feedback).
coreutils-8.22-19.fc21 has been pushed to the Fedora 21 stable repository. If problems still persist, please make note of it in this bug report.
coreutils-8.21-22.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/coreutils-8.21-22.fc20
coreutils-8.21-22.fc20 has been pushed to the Fedora 20 stable repository. If problems still persist, please make note of it in this bug report.