Bug 1046735

Summary: sort command always sorts in C locale
Product: [Fedora] Fedora Reporter: Krzysztof Halasa <khalasa>
Component: coreutilsAssignee: Ondrej Vasik <ovasik>
Status: CLOSED DUPLICATE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 20CC: admiller, kdudka, kzak, ooprala, ovasik, p, twaugh
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-12-26 20:18:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Krzysztof Halasa 2013-12-26 19:38:50 UTC
Description of problem:
sort command seems to ignore locale completely

Version-Release number of selected component (if applicable):
coreutils-8.21-18.fc20.x86_64

How reproducible:
Set locale to e.g. pl_PL.UTF-8 and try to sort data with national multibyte characters.

Steps to Reproduce:
1. export LC_ALL=pl_PL.UTF-8
2. echo -e "ą\na\nb\nc\nć\nł\nz\nż\nź" | sort

Actual results:
a
b
c
z
ą
ć
ł
ź
ż

Expected results:
a
ą
b
c
ć
ł
z
ź
ż

Additional info:
Looks like a problem with the i18n coreutils patch. This part:
@@ -2689,14 +3311,6 @@ compare (struct line const *a, struct li
     diff = - NONZERO (blen);
   else if (blen == 0)
     diff = 1;
-  else if (hard_LC_COLLATE)
-    {
-      /* Note xmemcoll0 is a performance enhancement as
-         it will not unconditionally write '\0' after the
-         passed in buffers, which was seen to give around
-         a 3% increase in performance for short lines.  */
-      diff = xmemcoll0 (a->text, alen + 1, b->text, blen + 1);
-    }
   else if (! (diff = memcmp (a->text, b->text, MIN (alen, blen))))
     diff = alen < blen ? -1 : alen != blen;

removes call to xmemcoll0(), leaving the final comparison to memcmp() which is not locale-aware. Bringing the removed part back restores correct default sort order for me, though I guess it doesn't eliminate the problem completely (for example, it doesn't fix "sort -d" which erroneously ignores multibyte letters).

Comment 1 Ondrej Vasik 2013-12-26 20:18:27 UTC
Thanks for the report.
Yes, I know about this fact, AFAIK Ondrej Oprala (who introduced this regression) already has some improvement, unfortunately he didn't pushed the changes into the git so far (I expect he will push it once back from vacation, in January). Closing duplicate, as it was already reported in #1001775 (just bringing the xmemcoll back breaks ~10 multibyte checks, so the fix has to be improved there)...

*** This bug has been marked as a duplicate of bug 1001775 ***