Bug 1146185

Summary: "sort" looks at more than the flags specify in non-C locales
Product: [Fedora] Fedora Reporter: Göran Uddeborg <goeran>
Component: coreutilsAssignee: Ondrej Vasik <ovasik>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 20CC: admiller, eblake, kdudka, kzak, ooprala, ovasik, p, twaugh
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: coreutils-8.21-22.fc20 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1148347 (view as bug list) Environment:
Last Closed: 2014-11-01 16:21:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1148347    

Description Göran Uddeborg 2014-09-24 17:34:20 UTC
Description of problem:
When sorting on one of several fields and using e.g. the en_US.utf8 locale, "sort" looks at more than the specified keys when sorting.

Version-Release number of selected component (if applicable):
coreutils-8.21-20.fc20.x86_64

How reproducible:
Every time

Steps to Reproduce:
1. printf 'a b!x\na-b-c!x\n' | LANG=en_US.utf8 ltrace -e strcoll sort -s --debug -k1,1 -t!

Actual results:
sort: using ‘en_US.utf8’ sorting rules
sort->strcoll("a b!x", "a-b-c!x")                = 21
a-b-c!x
_____
a b!x
___
+++ exited (status 0) +++


Expected results:
sort: using ‘en_US.utf8’ sorting rules
sort->strcoll("a b", "a-b-c")                    = -1
a b!x
___
a-b-c!x
_____
+++ exited (status 0) +++

(That is, strcoll only called on the key, not the whole string, and the output in the opposite order.)

Additional info:
This is shortened version of a bug report I filed upstreams: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=18540  It appears this bug is not present upstreams, but actually introduced in the Fedora packaging.

Comment 1 Ondrej Vasik 2014-09-25 05:51:15 UTC
Yes, upstream doesn't have support for multibyte locales at all. Fedora has the support through very problematic and complex downstream patch and we are now working on its rewrite - so it will be more likely acceptable upstream. That being said, I can't guarantee if deeper investigation of these issues will happen. Especially sort is pretty complex and the multibyte code is fragile at best. I recommend to use C locales where possible.

Comment 2 Göran Uddeborg 2014-09-25 19:58:05 UTC
> upstream doesn't have support for multibyte locales at all.

Doesn't it?  I downloaded an coreutils 8.22, ran configure and make, and tried a little.  It appears to work as expected.  In the C locale, it's byte order.  In en_US, a, å and ä are sorted together, as are o and ö.  In sv_SE, å, ä, and ö are considered letters with distinct places at the end of the alphabet, but not in byte numerical order.  My build of sort seems to get that right.  I built this ON a Fedora system, but I didn't modify the sources.  And, of course, the terminal ran in a UTF-8 locale.

Do I miss something?

mimmi$ ( echo har ; echo hår ; echo här ; echo hor ; echo hör ) | env LANG=C src/sort
har
hor
här
hår
hör
mimmi$ ( echo har ; echo hår ; echo här ; echo hor ; echo hör ) | env LANG=en_US.utf8 src/sort
har
hår
här
hor
hör
mimmi$ ( echo har ; echo hår ; echo här ; echo hor ; echo hör ) | env LANG=sv_SE.utf8 src/sort
har
hor
hår
här
hör

Comment 3 Eric Blake 2014-09-25 20:03:52 UTC
(In reply to Göran Uddeborg from comment #2)
> > upstream doesn't have support for multibyte locales at all.

Well, it DOES use strcoll, so it supports multibyte collation.  But what the downstream patch adds is things like honoring multibyte space characters as field separators (upstream still uses only space and tab, rather than all UTF-8 space sequences), and attempts to use multibyte character boundaries rather than raw byte indices for things like -k 1.5,1.7 (using the 5th-7th characters, rather than the 5th-7th bytes, when sorting by a substring of field one).

If you want to see what the downstream patch is adding, look at coreutils-i18n.patch applied as part of the source rpm.  But it is quite hairy in its current form, hence why upstream hasn't incorporated it.

Comment 4 Ondrej Vasik 2014-09-30 11:52:13 UTC
Sorry Eric for confusion, I was not clear with my "support" comment. I'm sure everyone in this bz is aware of it, but for the reference, Andreas Schwab fixed the bug, https://build.opensuse.org/package/view_file/Base:System/coreutils/sort-keycompare-mb.patch?expand=1 . We will include it to Fedora i18n patches right after quick review.

Comment 5 Ondrej Vasik 2014-10-01 13:54:08 UTC
Fix built in Rawhide - coreutils-8.23-4.fc22 . Will check what should be included in the possible update for f21/f20.

Comment 6 Fedora Update System 2014-10-15 10:06:20 UTC
coreutils-8.22-19.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/coreutils-8.22-19.fc21

Comment 7 Fedora Update System 2014-10-16 17:17:55 UTC
Package coreutils-8.22-19.fc21:
* should fix your issue,
* was pushed to the Fedora 21 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing coreutils-8.22-19.fc21'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2014-12937/coreutils-8.22-19.fc21
then log in and leave karma (feedback).

Comment 8 Fedora Update System 2014-11-01 16:21:45 UTC
coreutils-8.22-19.fc21 has been pushed to the Fedora 21 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 9 Fedora Update System 2015-05-14 18:51:54 UTC
coreutils-8.21-22.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/coreutils-8.21-22.fc20

Comment 10 Fedora Update System 2015-05-30 15:37:30 UTC
coreutils-8.21-22.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.