826997 – grep -i (case-insensitive) is broken with UTF8

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 826997 - grep -i (case-insensitive) is broken with UTF8

Summary: grep -i (case-insensitive) is broken with UTF8

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	grep
Sub Component:
Version:	6.2
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jaroslav Škarvada
QA Contact:	Jan Kepler
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	836160
TreeView+	depends on / blocked

Reported:	2012-05-31 11:49 UTC by Strahinja Kustudic
Modified:	2014-01-30 09:21 UTC (History)
CC List:	2 users (show)
Fixed In Version:	grep-2.6.3-4.el6
Doc Type:	Bug Fix
Doc Text:	Cause: The code for handling the case insensitive searches was created with the assumption that converting the string to the lowercase cannot alter its byte size. This is not true. Consequence: It could strip the grep output if specific pattern that has lower byte size when converted to lowercase is used for the case insensitive search. Fix: The grep code was modified to correctly handle such cases when the byte size gets altered during the conversion to lowercase. Result: The case insensitive searches work correctly and doesn't truncate the grep output.
Clone Of:
Clones:	828844 (view as bug list)
Environment:
Last Closed:	2013-06-25 14:18:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Backported fix (including test) (12.69 KB, patch) 2012-07-11 13:00 UTC, Jaroslav Škarvada	no flags	Details \| Diff
Backported fix (including tests) (14.13 KB, patch) 2012-07-20 07:24 UTC, Jaroslav Škarvada	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2013:0977	0	normal	SHIPPED_LIVE	grep bug fix update	2013-06-25 18:17:29 UTC

Description Strahinja Kustudic 2012-05-31 11:49:32 UTC

I reported this bug in the grep bug tracker, but since this is an important bug, I'm submitting it here, so it might get patched in RHEL6.

Since version 2.6.1 grep doesn't work correctly if you use a case-insesitive search with UTF8 encoding when there is an UTF8 character. Here is the example:

# Without -i switch everything works correctly
$ echo -e 'AA UTF8 char İ 12345\nAA 12345' | grep 'AA'
AA UTF8 char İ 12345
AA 12345

# With -i it breaks
$ echo -e 'AA UTF8 char İ 12345\nAA 12345' | grep -i 'AA'
AA UTF8 char İ 12345AA 12345

As you can see it somehow deletes the new line character in the line which has an UTF8 'İ' character.

Everything works correctly in versions 2.5.4 and below, it's broken from 2.6.1 to the latest version (which is atm 2.6.12) and of course it's broken in the grep-2.6.3-2.el6.x86_64.

This is a big concern, since it will break scripts which filtere UTF8 input using -i switch.

Comment 2 Jaroslav Škarvada 2012-06-05 12:45:10 UTC

Thanks for reporting, clonning to Fedora.

Comment 3 Jaroslav Škarvada 2012-07-11 13:00:09 UTC

Created attachment 597568 [details]
Backported fix (including test)

Comment 4 Jaroslav Škarvada 2012-07-20 07:24:26 UTC

Created attachment 599320 [details]
Backported fix (including tests)

Added fix for s390 (accepted upstream).
Added turkish-I test case.

Comment 7 richard rigby 2013-01-29 12:15:18 UTC

not sure if this is the exact same bug, but we are seeing some strange behaviour on rhel 6.3/grep-2.6.3-3.el6.x86_64:

$ locale
LANG=en_GB
LC_CTYPE="en_GB"
LC_NUMERIC="en_GB"
LC_TIME="en_GB"
LC_COLLATE="en_GB"
LC_MONETARY="en_GB"
LC_MESSAGES="en_GB"
LC_PAPER="en_GB"
LC_NAME="en_GB"
LC_ADDRESS="en_GB"
LC_TELEPHONE="en_GB"
LC_MEASUREMENT="en_GB"
LC_IDENTIFICATION="en_GB"
LC_ALL=
$ echo "a" | /bin/grep "[A-Z]"
$ echo "b" | /bin/grep "[A-Z]"
b
$ echo "b" | /bin/grep "[B-Z]"
$ export LC_ALL="en_GB.utf8"
$ echo "b" | /bin/grep "[A-Z]"

add a '--color', e.g.:

printf "%s\n" b b a b A | grep --color "[A-Z]"

shows that grep is matching correctly, but still prints non-matching lines.

Comment 8 Jaroslav Škarvada 2013-02-05 15:29:18 UTC

(In reply to comment #7)
This one is not a bug. Collating in UTF-8 locales may be really strange, like aAbB..., so you cannot use the ASCII intervals. Rather use character classes, e.g.: 
$ grep [[:upper:]] # uppercase letters
$ grep [[:alpha:]] # letters

For details see man.

Comment 9 Jaroslav Škarvada 2013-02-05 15:32:25 UTC

(In reply to comment #8)
> (In reply to comment #7)
> This one is not a bug. Collating in UTF-8 locales may be really strange,
> like aAbB..., so you cannot use the ASCII intervals. Rather use character
> classes, e.g.: 
> $ grep [[:upper:]] # uppercase letters
> $ grep [[:alpha:]] # letters
> 
> For details see man.

Or use LANG=C grep [A-Z]

Comment 10 richard rigby 2013-02-06 14:53:14 UTC

(In reply to comment #8)
> This one is not a bug. Collating in UTF-8 locales may be really strange,
> like aAbB..., so you cannot use the ASCII intervals. Rather use character
> classes, e.g.: 
> $ grep [[:upper:]] # uppercase letters
> $ grep [[:alpha:]] # letters
> 

thanks for your comments. i asked some colleagues for their thoughts on this:

#--

It's a bug, but I'm not sure with what.  2.6.3 colours correctly, but gives
the same result.  2.7 colours as per redhat's grep.  So as a minimum, the
colouring *has* to be a bug, surely?  As to the other bug...

grep-2.8 returns the result you're expecting.  Hmm, fish in the Changelog,
only one thing looks particularly exciting and relates to processing of
ranges.

My hunch is it's 99d3c7e1308beb1ce9a3c535ca4b6581ebd653ee that's made the
difference.

Let's double check that.

$ printf "%s\n" b b a b A | /tmp/grep-0fdedfb32dda12320e10df7973b9f5e72d2ac66b/bin/grep --color "[A-Z]"
b
b
b
A
$ printf "%s\n" b b a b A | /tmp/grep-99d3c7e1308beb1ce9a3c535ca4b6581ebd653ee/bin/grep --color "[A-Z]"
A

commit 99d3c7e1308beb1ce9a3c535ca4b6581ebd653ee
Author: Paolo Bonzini <bonzini>
Date:   Tue Sep 21 17:00:55 2010 +0200

     dfa: process range expressions consistently with system regex

     The actual meaning of range expressions in glibc is not exactly strcoll,
     which makes the behavior of grep hard to predict when compiled with the
     system regex.  Leave to the system regex matcher the decision of which
     single-byte characters are matched by a range expression.

     This partially reverts a change made in commit 0d38a8bb (which made
     sense at the time, but not now that src/dfa.c is not doing multibyte
     character set matching anymore).

     * src/dfa.c (in_coll_range): Remove.
     (parse_bracket_exp): Use system regex to find which single-char
     bytes match a range expression.

#-

What *I* don't get, is that /usr/share/locale/en_GB/charset is listed as
UTF-8, but why does en_GB behave differently to en_GB.utf8?

#--

as per the comments above, downloading grep 2.8, and building from source produces the expected results (at least what *i* would expect).

thanks again,

richard

Comment 12 Jaroslav Škarvada 2013-06-03 16:14:52 UTC

(In reply to richard rigby from comment #10)
This would require another bugzilla.

Comment 16 errata-xmlrpc 2013-06-25 14:18:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0977.html

Note You need to log in before you can comment on or make changes to this bug.