Red Hat Bugzilla – Bug 122586
grep returns incorrect results when UTF8 charactersets are used.
Last modified: 2007-11-30 17:07:01 EST
Description of problem:
The grep command returns incorrect results when a utf character set is
used, but functions correctly when LANG=C, LANG=en_US, and LANG=en_GB.
Version-Release number of selected component (if applicable):
This has been reporducilbe consistantly, across multiple machines.
Steps to Reproduce
1. wget http://www.jacobwilkins.com/grep.txt
2. export LANG=en_US.UTF-8
3. grep '[A-Z][a-z][a-z][a-z][a-z], [A-Z]' grep.txt
This will return 26 lines, some of which clearly do not match the regex.
4. export LANG=C
5. grep '[A-Z][a-z][a-z][a-z][a-z], [A-Z]' grep.txt
This returns the correct 8 matching lines.
I think you mean to use '[[:upper:]]' and '[[:lower:]]'. [A-Z] is not
a case-sensitive class in anything but the C locale.
[A-Z] is a range of characters between 'A' and 'Z' inclusive. Even in
UTF-8 encoding, this still includes only capital letters. So how is
that not case sensitive?
It's not to do with the encoding, but the locale. Really,
'[[:upper:]]' is what you want to use -- that's what it's for.
See bug #76328 for a full discussion of why this is so.
For the record, however, Miroslav is wrong, at least in one respect.
RHEL offers 4 en_US locales: en_US, en_US.iso88591, en_US.iso885915,
and en_US.utf8. Of these four, only the "en_US.utf8" locale exhibits
the incorrect behavior. So his claim that "en_US" would behave the
same way is false.
BTW, when I said "incorrect," I meant "unexpected but apparently
POSIXly correct." :-)