Bug 826997
| Summary: | grep -i (case-insensitive) is broken with UTF8 | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Strahinja Kustudic <kustodian> | ||||||
| Component: | grep | Assignee: | Jaroslav Škarvada <jskarvad> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Jan Kepler <jkejda> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 6.2 | CC: | jkejda, r.rigby | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | grep-2.6.3-4.el6 | Doc Type: | Bug Fix | ||||||
| Doc Text: |
Cause:
The code for handling the case insensitive searches was created with the assumption that converting the string to the lowercase cannot alter its byte size. This is not true.
Consequence:
It could strip the grep output if specific pattern that has lower byte size when converted to lowercase is used for the case insensitive search.
Fix:
The grep code was modified to correctly handle such cases when the byte size gets altered during the conversion to lowercase.
Result:
The case insensitive searches work correctly and doesn't truncate the grep output.
|
Story Points: | --- | ||||||
| Clone Of: | |||||||||
| : | 828844 (view as bug list) | Environment: | |||||||
| Last Closed: | 2013-06-25 14:18:43 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 836160 | ||||||||
| Attachments: |
|
||||||||
|
Description
Strahinja Kustudic
2012-05-31 11:49:32 UTC
Thanks for reporting, clonning to Fedora. Created attachment 597568 [details]
Backported fix (including test)
Created attachment 599320 [details]
Backported fix (including tests)
Added fix for s390 (accepted upstream).
Added turkish-I test case.
not sure if this is the exact same bug, but we are seeing some strange behaviour on rhel 6.3/grep-2.6.3-3.el6.x86_64: $ locale LANG=en_GB LC_CTYPE="en_GB" LC_NUMERIC="en_GB" LC_TIME="en_GB" LC_COLLATE="en_GB" LC_MONETARY="en_GB" LC_MESSAGES="en_GB" LC_PAPER="en_GB" LC_NAME="en_GB" LC_ADDRESS="en_GB" LC_TELEPHONE="en_GB" LC_MEASUREMENT="en_GB" LC_IDENTIFICATION="en_GB" LC_ALL= $ echo "a" | /bin/grep "[A-Z]" $ echo "b" | /bin/grep "[A-Z]" b $ echo "b" | /bin/grep "[B-Z]" $ export LC_ALL="en_GB.utf8" $ echo "b" | /bin/grep "[A-Z]" add a '--color', e.g.: printf "%s\n" b b a b A | grep --color "[A-Z]" shows that grep is matching correctly, but still prints non-matching lines. (In reply to comment #7) This one is not a bug. Collating in UTF-8 locales may be really strange, like aAbB..., so you cannot use the ASCII intervals. Rather use character classes, e.g.: $ grep [[:upper:]] # uppercase letters $ grep [[:alpha:]] # letters For details see man. (In reply to comment #8) > (In reply to comment #7) > This one is not a bug. Collating in UTF-8 locales may be really strange, > like aAbB..., so you cannot use the ASCII intervals. Rather use character > classes, e.g.: > $ grep [[:upper:]] # uppercase letters > $ grep [[:alpha:]] # letters > > For details see man. Or use LANG=C grep [A-Z] (In reply to comment #8) > This one is not a bug. Collating in UTF-8 locales may be really strange, > like aAbB..., so you cannot use the ASCII intervals. Rather use character > classes, e.g.: > $ grep [[:upper:]] # uppercase letters > $ grep [[:alpha:]] # letters > thanks for your comments. i asked some colleagues for their thoughts on this: #-- It's a bug, but I'm not sure with what. 2.6.3 colours correctly, but gives the same result. 2.7 colours as per redhat's grep. So as a minimum, the colouring *has* to be a bug, surely? As to the other bug... grep-2.8 returns the result you're expecting. Hmm, fish in the Changelog, only one thing looks particularly exciting and relates to processing of ranges. My hunch is it's 99d3c7e1308beb1ce9a3c535ca4b6581ebd653ee that's made the difference. Let's double check that. $ printf "%s\n" b b a b A | /tmp/grep-0fdedfb32dda12320e10df7973b9f5e72d2ac66b/bin/grep --color "[A-Z]" b b b A $ printf "%s\n" b b a b A | /tmp/grep-99d3c7e1308beb1ce9a3c535ca4b6581ebd653ee/bin/grep --color "[A-Z]" A commit 99d3c7e1308beb1ce9a3c535ca4b6581ebd653ee Author: Paolo Bonzini <bonzini> Date: Tue Sep 21 17:00:55 2010 +0200 dfa: process range expressions consistently with system regex The actual meaning of range expressions in glibc is not exactly strcoll, which makes the behavior of grep hard to predict when compiled with the system regex. Leave to the system regex matcher the decision of which single-byte characters are matched by a range expression. This partially reverts a change made in commit 0d38a8bb (which made sense at the time, but not now that src/dfa.c is not doing multibyte character set matching anymore). * src/dfa.c (in_coll_range): Remove. (parse_bracket_exp): Use system regex to find which single-char bytes match a range expression. #- What *I* don't get, is that /usr/share/locale/en_GB/charset is listed as UTF-8, but why does en_GB behave differently to en_GB.utf8? #-- as per the comments above, downloading grep 2.8, and building from source produces the expected results (at least what *i* would expect). thanks again, richard (In reply to richard rigby from comment #10) This would require another bugzilla. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-0977.html |