Bug 826997 - grep -i (case-insensitive) is broken with UTF8
grep -i (case-insensitive) is broken with UTF8
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: grep (Show other bugs)
6.2
All Linux
unspecified Severity high
: rc
: ---
Assigned To: Jaroslav Škarvada
Jan Kepler
:
Depends On:
Blocks: 836160
  Show dependency treegraph
 
Reported: 2012-05-31 07:49 EDT by Strahinja Kustudic
Modified: 2014-01-30 04:21 EST (History)
2 users (show)

See Also:
Fixed In Version: grep-2.6.3-4.el6
Doc Type: Bug Fix
Doc Text:
Cause: The code for handling the case insensitive searches was created with the assumption that converting the string to the lowercase cannot alter its byte size. This is not true. Consequence: It could strip the grep output if specific pattern that has lower byte size when converted to lowercase is used for the case insensitive search. Fix: The grep code was modified to correctly handle such cases when the byte size gets altered during the conversion to lowercase. Result: The case insensitive searches work correctly and doesn't truncate the grep output.
Story Points: ---
Clone Of:
: 828844 (view as bug list)
Environment:
Last Closed: 2013-06-25 10:18:43 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Backported fix (including test) (12.69 KB, patch)
2012-07-11 09:00 EDT, Jaroslav Škarvada
no flags Details | Diff
Backported fix (including tests) (14.13 KB, patch)
2012-07-20 03:24 EDT, Jaroslav Škarvada
no flags Details | Diff

  None (edit)
Description Strahinja Kustudic 2012-05-31 07:49:32 EDT
I reported this bug in the grep bug tracker, but since this is an important bug, I'm submitting it here, so it might get patched in RHEL6.

Since version 2.6.1 grep doesn't work correctly if you use a case-insesitive search with UTF8 encoding when there is an UTF8 character. Here is the example:

# Without -i switch everything works correctly
$ echo -e 'AA UTF8 char İ 12345\nAA 12345' | grep 'AA'
AA UTF8 char İ 12345
AA 12345

# With -i it breaks
$ echo -e 'AA UTF8 char İ 12345\nAA 12345' | grep -i 'AA'
AA UTF8 char İ 12345AA 12345

As you can see it somehow deletes the new line character in the line which has an UTF8 'İ' character.

Everything works correctly in versions 2.5.4 and below, it's broken from 2.6.1 to the latest version (which is atm 2.6.12) and of course it's broken in the grep-2.6.3-2.el6.x86_64.

This is a big concern, since it will break scripts which filtere UTF8 input using -i switch.
Comment 2 Jaroslav Škarvada 2012-06-05 08:45:10 EDT
Thanks for reporting, clonning to Fedora.
Comment 3 Jaroslav Škarvada 2012-07-11 09:00:09 EDT
Created attachment 597568 [details]
Backported fix (including test)
Comment 4 Jaroslav Škarvada 2012-07-20 03:24:26 EDT
Created attachment 599320 [details]
Backported fix (including tests)

Added fix for s390 (accepted upstream).
Added turkish-I test case.
Comment 7 richard rigby 2013-01-29 07:15:18 EST
not sure if this is the exact same bug, but we are seeing some strange behaviour on rhel 6.3/grep-2.6.3-3.el6.x86_64:

$ locale
LANG=en_GB
LC_CTYPE="en_GB"
LC_NUMERIC="en_GB"
LC_TIME="en_GB"
LC_COLLATE="en_GB"
LC_MONETARY="en_GB"
LC_MESSAGES="en_GB"
LC_PAPER="en_GB"
LC_NAME="en_GB"
LC_ADDRESS="en_GB"
LC_TELEPHONE="en_GB"
LC_MEASUREMENT="en_GB"
LC_IDENTIFICATION="en_GB"
LC_ALL=
$ echo "a" | /bin/grep "[A-Z]"
$ echo "b" | /bin/grep "[A-Z]"
b
$ echo "b" | /bin/grep "[B-Z]"
$ export LC_ALL="en_GB.utf8"
$ echo "b" | /bin/grep "[A-Z]"

add a '--color', e.g.:

printf "%s\n" b b a b A | grep --color "[A-Z]"

shows that grep is matching correctly, but still prints non-matching lines.
Comment 8 Jaroslav Škarvada 2013-02-05 10:29:18 EST
(In reply to comment #7)
This one is not a bug. Collating in UTF-8 locales may be really strange, like aAbB..., so you cannot use the ASCII intervals. Rather use character classes, e.g.: 
$ grep [[:upper:]] # uppercase letters
$ grep [[:alpha:]] # letters

For details see man.
Comment 9 Jaroslav Škarvada 2013-02-05 10:32:25 EST
(In reply to comment #8)
> (In reply to comment #7)
> This one is not a bug. Collating in UTF-8 locales may be really strange,
> like aAbB..., so you cannot use the ASCII intervals. Rather use character
> classes, e.g.: 
> $ grep [[:upper:]] # uppercase letters
> $ grep [[:alpha:]] # letters
> 
> For details see man.

Or use LANG=C grep [A-Z]
Comment 10 richard rigby 2013-02-06 09:53:14 EST
(In reply to comment #8)
> This one is not a bug. Collating in UTF-8 locales may be really strange,
> like aAbB..., so you cannot use the ASCII intervals. Rather use character
> classes, e.g.: 
> $ grep [[:upper:]] # uppercase letters
> $ grep [[:alpha:]] # letters
> 

thanks for your comments. i asked some colleagues for their thoughts on this:

#--

It's a bug, but I'm not sure with what.  2.6.3 colours correctly, but gives
the same result.  2.7 colours as per redhat's grep.  So as a minimum, the
colouring *has* to be a bug, surely?  As to the other bug...

grep-2.8 returns the result you're expecting.  Hmm, fish in the Changelog,
only one thing looks particularly exciting and relates to processing of
ranges.

My hunch is it's 99d3c7e1308beb1ce9a3c535ca4b6581ebd653ee that's made the
difference.

Let's double check that.

$ printf "%s\n" b b a b A | /tmp/grep-0fdedfb32dda12320e10df7973b9f5e72d2ac66b/bin/grep --color "[A-Z]"
b
b
b
A
$ printf "%s\n" b b a b A | /tmp/grep-99d3c7e1308beb1ce9a3c535ca4b6581ebd653ee/bin/grep --color "[A-Z]"
A

commit 99d3c7e1308beb1ce9a3c535ca4b6581ebd653ee
Author: Paolo Bonzini <bonzini@gnu.org>
Date:   Tue Sep 21 17:00:55 2010 +0200

     dfa: process range expressions consistently with system regex

     The actual meaning of range expressions in glibc is not exactly strcoll,
     which makes the behavior of grep hard to predict when compiled with the
     system regex.  Leave to the system regex matcher the decision of which
     single-byte characters are matched by a range expression.

     This partially reverts a change made in commit 0d38a8bb (which made
     sense at the time, but not now that src/dfa.c is not doing multibyte
     character set matching anymore).

     * src/dfa.c (in_coll_range): Remove.
     (parse_bracket_exp): Use system regex to find which single-char
     bytes match a range expression.

#-

What *I* don't get, is that /usr/share/locale/en_GB/charset is listed as
UTF-8, but why does en_GB behave differently to en_GB.utf8?

#--

as per the comments above, downloading grep 2.8, and building from source produces the expected results (at least what *i* would expect).

thanks again,

richard
Comment 12 Jaroslav Škarvada 2013-06-03 12:14:52 EDT
(In reply to richard rigby from comment #10)
This would require another bugzilla.
Comment 16 errata-xmlrpc 2013-06-25 10:18:43 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0977.html

Note You need to log in before you can comment on or make changes to this bug.