Bug 462184

Summary: uniq: uniq on two lines of of codepoints 0xc4d and 0xc3f respectively reports 0xc4d as output
Product: [Fedora] Fedora Reporter: Caolan McNamara <caolanm>
Component: coreutilsAssignee: Kamil Dudka <kdudka>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: medium    
Version: 9CC: kdudka, ovasik, twaugh
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-09-17 07:08:19 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Attachments:
Description Flags
input demo
none
minimal example
none
strcoll test none

Description Caolan McNamara 2008-09-13 11:38:36 EDT
With coreutils-6.10-30.fc9 and the attached utf-8 text file with unicode code points

0C4D;TELUGU SIGN VIRAMA
and
0C3F;TELUGU VOWEL SIGN I
on two separate lines

then uniq input.file 
gives an output of just
0C4D;TELUGU SIGN VIRAMA

which hampered my efforts to make a frequency list of characters in a Telegu wordlists :-(
Comment 1 Caolan McNamara 2008-09-13 11:40:26 EDT
Created attachment 316669 [details]
input demo
Comment 2 Kamil Dudka 2008-09-15 03:33:48 EDT
Thank you for report. Did you try to set LC_ALL=C?

$ cat demo.txt | uniq | hexdump
0000000 b1e0 0a8d
0000004

$ cat demo.txt | LC_ALL=C uniq | hexdump
0000000 b1e0 0a8d b0e0 0abf
0000008
Comment 3 Caolan McNamara 2008-09-15 04:36:24 EDT
Yeah, I can work around with "C" for uniq (and for sort as well, which is presumably the same problem so no point filing an extra issue about that)
Comment 4 Kamil Dudka 2008-09-15 05:00:53 EDT
And what is your default LC_ALL on your configuration? What does 'locale' command say?
Comment 5 Caolan McNamara 2008-09-15 06:05:17 EDT
any .utf8 locale, e.g. en_US.UTF8

[caolan@vain tmp]$ locale
LANG=en_US.utf8
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=
[caolan@vain tmp]$ uniq -c inputdemo 
      2 ్
Comment 6 Kamil Dudka 2008-09-17 07:03:17 EDT
Created attachment 316941 [details]
minimal example

Please compile attached test on your system and run with LC_ALL=C and LC_ALL=en_US.utf8 and see what happens.
Comment 7 Kamil Dudka 2008-09-17 07:08:19 EDT
This is not coreutils bug since strcoll from glibc behaves this way. Try attached example. If you think this behavior is not correct (I am not user right now), open a new ticket against glibc.
Comment 8 Caolan McNamara 2008-09-17 07:27:26 EDT
Created attachment 316945 [details]
strcoll test

given the strcoll hint, indeed strcoll gives the wrong result under F-9, but happily the right result under rawhide/F-10, so that's good enough for me