Bug 773551

Summary: 'sort -u' merges distinct utf8 strings
Product: [Fedora] Fedora Reporter: Zdeněk Pavlas <zpavlas>
Component: glibcAssignee: Carlos O'Donell <codonell>
Status: CLOSED EOL QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 23CC: fweimer, jakub, jzeleny, kdudka, law, maxamillion, mnewsome, ovasik, pfrankli, schwab, twaugh
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-20 12:10:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
strcoll/wcscoll reproducer
none
strcoll/wcscoll reproducer none

Description Zdeněk Pavlas 2012-01-12 08:39:31 UTC
Description of problem:

Unicode characters d39c and bd04 (ed8e9c and ebb484 in UTF8) compare as equal.

Version-Release number of selected component (if applicable):

sort (GNU coreutils) 8.5

How reproducible:

Always.

Steps to Reproduce:

$ echo 7Y6cCuu0hAo= | base64 -d >file
$ cat file
펜
봄
$ sort -u file
펜

Actual results:

first line only

Expected results:

both

Additional info:

F14, cs_CZ.utf8

Comment 1 Ondrej Vasik 2012-01-12 10:28:53 UTC
Thanks for report, confirmed... as the issue does occur even with 8.15 in Rawhide, moving version to Rawhide - F14 is EOL (and there is quite a lot of similar issues with multibyte patch - this downstream patch is pure evil ;) ).
You could use LC_ALL=C (and don't use multibyte path) in most cases.

Note: sort file | uniq doesn't work either for these characters with multibyte locales.

Comment 2 Roman Kollár 2012-11-29 16:37:33 UTC
Created attachment 654410 [details]
strcoll/wcscoll reproducer

This looks like a glibc problem to me. strcoll() in sort returns 0 on different mb strings.

Reproducer output:
3
3
en_US.UTF-8
strcoll: 0
wcscoll: 0

Comment 3 Jeff Law 2012-12-19 17:33:00 UTC
I think the way to go here is first reproduce on F14, then check F18 or rawhide.  Based on c#1, I think there's a reasonable chance this has already been fixed.

Comment 4 Siddhesh Poyarekar 2012-12-19 18:06:53 UTC
Ah, I had checked this in one of my idle moments some weeks ago and missed reporting back.  This is still reproducible in rawhide and on latest upstream.  Sorry I didn't note this earlier.

Comment 5 Roman Kollár 2012-12-19 19:17:38 UTC
Created attachment 666338 [details]
strcoll/wcscoll reproducer

Fixed wcscoll() call, same result.

Comment 6 Fedora Admin XMLRPC Client 2013-01-28 20:08:52 UTC
This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 7 Fedora End Of Life 2013-04-03 19:15:43 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.

(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19

Comment 9 Fedora End Of Life 2015-01-09 21:55:18 UTC
This message is a notice that Fedora 19 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 19. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained. Approximately 4 (four) weeks from now this bug will
be closed as EOL if it remains open with a Fedora 'version' of '19'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 19 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 10 Jan Kurik 2015-07-15 15:12:09 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 23 development cycle.
Changing version to '23'.

(As we did not run this process for some time, it could affect also pre-Fedora 23 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 23 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora23

Comment 11 Fedora End Of Life 2016-11-24 10:36:08 UTC
This message is a reminder that Fedora 23 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 23. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '23'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 23 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 12 Fedora End Of Life 2016-12-20 12:10:47 UTC
Fedora 23 changed to end-of-life (EOL) status on 2016-12-20. Fedora 23 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 13 Carlos O'Donell 2016-12-20 13:56:32 UTC
Collation rules in cz_CZ and en_US don't provide UTF-8 codepoint sorting for undefined ranges.

If you want UTF-8 codepoint sorting you have to use C.utf8 or a locale that has information for the range in question.

e.g.

LANG=C.utf8 ./rhbz773551 
3
3
C.utf8
strcoll: 1
wcscoll: 5784

LANG=ko_KR.utf8 ./rhbz773551 
3
3
ko_KR.utf8
strcoll: 1
wcscoll: 5784

The bug might be that users expect en_US to provide code-point sorting for characters outside of the collation ordering, which is useful.

Comment 14 Carlos O'Donell 2016-12-20 14:04:51 UTC
https://sourceware.org/bugzilla/show_bug.cgi?id=18927