Red Hat Bugzilla – Bug 826997
grep -i (case-insensitive) is broken with UTF8
Last modified: 2014-01-30 04:21:24 EST
I reported this bug in the grep bug tracker, but since this is an important bug, I'm submitting it here, so it might get patched in RHEL6. Since version 2.6.1 grep doesn't work correctly if you use a case-insesitive search with UTF8 encoding when there is an UTF8 character. Here is the example: # Without -i switch everything works correctly $ echo -e 'AA UTF8 char İ 12345\nAA 12345' | grep 'AA' AA UTF8 char İ 12345 AA 12345 # With -i it breaks $ echo -e 'AA UTF8 char İ 12345\nAA 12345' | grep -i 'AA' AA UTF8 char İ 12345AA 12345 As you can see it somehow deletes the new line character in the line which has an UTF8 'İ' character. Everything works correctly in versions 2.5.4 and below, it's broken from 2.6.1 to the latest version (which is atm 2.6.12) and of course it's broken in the grep-2.6.3-2.el6.x86_64. This is a big concern, since it will break scripts which filtere UTF8 input using -i switch.
Thanks for reporting, clonning to Fedora.
Created attachment 597568 [details] Backported fix (including test)
Created attachment 599320 [details] Backported fix (including tests) Added fix for s390 (accepted upstream). Added turkish-I test case.
not sure if this is the exact same bug, but we are seeing some strange behaviour on rhel 6.3/grep-2.6.3-3.el6.x86_64: $ locale LANG=en_GB LC_CTYPE="en_GB" LC_NUMERIC="en_GB" LC_TIME="en_GB" LC_COLLATE="en_GB" LC_MONETARY="en_GB" LC_MESSAGES="en_GB" LC_PAPER="en_GB" LC_NAME="en_GB" LC_ADDRESS="en_GB" LC_TELEPHONE="en_GB" LC_MEASUREMENT="en_GB" LC_IDENTIFICATION="en_GB" LC_ALL= $ echo "a" | /bin/grep "[A-Z]" $ echo "b" | /bin/grep "[A-Z]" b $ echo "b" | /bin/grep "[B-Z]" $ export LC_ALL="en_GB.utf8" $ echo "b" | /bin/grep "[A-Z]" add a '--color', e.g.: printf "%s\n" b b a b A | grep --color "[A-Z]" shows that grep is matching correctly, but still prints non-matching lines.
(In reply to comment #7) This one is not a bug. Collating in UTF-8 locales may be really strange, like aAbB..., so you cannot use the ASCII intervals. Rather use character classes, e.g.: $ grep [[:upper:]] # uppercase letters $ grep [[:alpha:]] # letters For details see man.
(In reply to comment #8) > (In reply to comment #7) > This one is not a bug. Collating in UTF-8 locales may be really strange, > like aAbB..., so you cannot use the ASCII intervals. Rather use character > classes, e.g.: > $ grep [[:upper:]] # uppercase letters > $ grep [[:alpha:]] # letters > > For details see man. Or use LANG=C grep [A-Z]
(In reply to comment #8) > This one is not a bug. Collating in UTF-8 locales may be really strange, > like aAbB..., so you cannot use the ASCII intervals. Rather use character > classes, e.g.: > $ grep [[:upper:]] # uppercase letters > $ grep [[:alpha:]] # letters > thanks for your comments. i asked some colleagues for their thoughts on this: #-- It's a bug, but I'm not sure with what. 2.6.3 colours correctly, but gives the same result. 2.7 colours as per redhat's grep. So as a minimum, the colouring *has* to be a bug, surely? As to the other bug... grep-2.8 returns the result you're expecting. Hmm, fish in the Changelog, only one thing looks particularly exciting and relates to processing of ranges. My hunch is it's 99d3c7e1308beb1ce9a3c535ca4b6581ebd653ee that's made the difference. Let's double check that. $ printf "%s\n" b b a b A | /tmp/grep-0fdedfb32dda12320e10df7973b9f5e72d2ac66b/bin/grep --color "[A-Z]" b b b A $ printf "%s\n" b b a b A | /tmp/grep-99d3c7e1308beb1ce9a3c535ca4b6581ebd653ee/bin/grep --color "[A-Z]" A commit 99d3c7e1308beb1ce9a3c535ca4b6581ebd653ee Author: Paolo Bonzini <bonzini@gnu.org> Date: Tue Sep 21 17:00:55 2010 +0200 dfa: process range expressions consistently with system regex The actual meaning of range expressions in glibc is not exactly strcoll, which makes the behavior of grep hard to predict when compiled with the system regex. Leave to the system regex matcher the decision of which single-byte characters are matched by a range expression. This partially reverts a change made in commit 0d38a8bb (which made sense at the time, but not now that src/dfa.c is not doing multibyte character set matching anymore). * src/dfa.c (in_coll_range): Remove. (parse_bracket_exp): Use system regex to find which single-char bytes match a range expression. #- What *I* don't get, is that /usr/share/locale/en_GB/charset is listed as UTF-8, but why does en_GB behave differently to en_GB.utf8? #-- as per the comments above, downloading grep 2.8, and building from source produces the expected results (at least what *i* would expect). thanks again, richard
(In reply to richard rigby from comment #10) This would require another bugzilla.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-0977.html