Bug 462184 - uniq: uniq on two lines of of codepoints 0xc4d and 0xc3f respectively reports 0xc4d as output
uniq: uniq on two lines of of codepoints 0xc4d and 0xc3f respectively reports...
Status: CLOSED NOTABUG
Product: Fedora
Classification: Fedora
Component: coreutils (Show other bugs)
9
All Linux
medium Severity medium
: ---
: ---
Assigned To: Kamil Dudka
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-09-13 11:38 EDT by Caolan McNamara
Modified: 2008-09-17 07:27 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-09-17 07:08:19 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
input demo (8 bytes, text/plain)
2008-09-13 11:40 EDT, Caolan McNamara
no flags Details
minimal example (312 bytes, application/octet-stream)
2008-09-17 07:03 EDT, Kamil Dudka
no flags Details
strcoll test (396 bytes, text/plain)
2008-09-17 07:27 EDT, Caolan McNamara
no flags Details

  None (edit)
Description Caolan McNamara 2008-09-13 11:38:36 EDT
With coreutils-6.10-30.fc9 and the attached utf-8 text file with unicode code points

0C4D;TELUGU SIGN VIRAMA
and
0C3F;TELUGU VOWEL SIGN I
on two separate lines

then uniq input.file 
gives an output of just
0C4D;TELUGU SIGN VIRAMA

which hampered my efforts to make a frequency list of characters in a Telegu wordlists :-(
Comment 1 Caolan McNamara 2008-09-13 11:40:26 EDT
Created attachment 316669 [details]
input demo
Comment 2 Kamil Dudka 2008-09-15 03:33:48 EDT
Thank you for report. Did you try to set LC_ALL=C?

$ cat demo.txt | uniq | hexdump
0000000 b1e0 0a8d
0000004

$ cat demo.txt | LC_ALL=C uniq | hexdump
0000000 b1e0 0a8d b0e0 0abf
0000008
Comment 3 Caolan McNamara 2008-09-15 04:36:24 EDT
Yeah, I can work around with "C" for uniq (and for sort as well, which is presumably the same problem so no point filing an extra issue about that)
Comment 4 Kamil Dudka 2008-09-15 05:00:53 EDT
And what is your default LC_ALL on your configuration? What does 'locale' command say?
Comment 5 Caolan McNamara 2008-09-15 06:05:17 EDT
any .utf8 locale, e.g. en_US.UTF8

[caolan@vain tmp]$ locale
LANG=en_US.utf8
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=
[caolan@vain tmp]$ uniq -c inputdemo 
      2 ్
Comment 6 Kamil Dudka 2008-09-17 07:03:17 EDT
Created attachment 316941 [details]
minimal example

Please compile attached test on your system and run with LC_ALL=C and LC_ALL=en_US.utf8 and see what happens.
Comment 7 Kamil Dudka 2008-09-17 07:08:19 EDT
This is not coreutils bug since strcoll from glibc behaves this way. Try attached example. If you think this behavior is not correct (I am not user right now), open a new ticket against glibc.
Comment 8 Caolan McNamara 2008-09-17 07:27:26 EDT
Created attachment 316945 [details]
strcoll test

given the strcoll hint, indeed strcoll gives the wrong result under F-9, but happily the right result under rawhide/F-10, so that's good enough for me

Note You need to log in before you can comment on or make changes to this bug.