Bug 122586 - grep returns incorrect results when UTF8 charactersets are used.
grep returns incorrect results when UTF8 charactersets are used.
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: grep (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: Tim Waugh
Depends On:
  Show dependency treegraph
Reported: 2004-05-05 18:56 EDT by Jacob Wilkins
Modified: 2007-11-30 17:07 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2004-05-06 04:02:50 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Jacob Wilkins 2004-05-05 18:56:06 EDT
Description of problem:

The grep command returns incorrect results when a utf character set is
used, but functions correctly when LANG=C, LANG=en_US, and LANG=en_GB.

Version-Release number of selected component (if applicable):

How reproducible:

This has been reporducilbe consistantly, across multiple machines.

Steps to Reproduce
1. wget http://www.jacobwilkins.com/grep.txt
2. export LANG=en_US.UTF-8
3. grep '[A-Z][a-z][a-z][a-z][a-z], [A-Z]' grep.txt

This will return 26 lines, some of which clearly do not match the regex.

4. export LANG=C
5. grep '[A-Z][a-z][a-z][a-z][a-z], [A-Z]' grep.txt

This returns the correct 8 matching lines.
Comment 1 Tim Waugh 2004-05-06 04:02:50 EDT
I think you mean to use '[[:upper:]]' and '[[:lower:]]'.  [A-Z] is not
a case-sensitive class in anything but the C locale.
Comment 2 Michael Jennings (KainX) 2004-05-06 11:55:10 EDT
[A-Z] is a range of characters between 'A' and 'Z' inclusive.  Even in
UTF-8 encoding, this still includes only capital letters.  So how is
that not case sensitive?
Comment 3 Tim Waugh 2004-05-06 12:27:40 EDT
It's not to do with the encoding, but the locale.  Really,
'[[:upper:]]' is what you want to use -- that's what it's for.

See bug #76328 for a full discussion of why this is so.
Comment 4 Michael Jennings (KainX) 2004-05-06 13:38:48 EDT
Fair enough.

For the record, however, Miroslav is wrong, at least in one respect. 
RHEL offers 4 en_US locales:  en_US, en_US.iso88591, en_US.iso885915,
and en_US.utf8.  Of these four, only the "en_US.utf8" locale exhibits
the incorrect behavior.  So his claim that "en_US" would behave the
same way is false.
Comment 5 Michael Jennings (KainX) 2004-05-06 13:41:30 EDT
BTW, when I said "incorrect," I meant "unexpected but apparently
POSIXly correct."  :-)

Note You need to log in before you can comment on or make changes to this bug.