147259 – grep and UTF-8 don't play nicely together.

Bug 147259 - grep and UTF-8 don't play nicely together.

Summary: grep and UTF-8 don't play nicely together.

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	grep
Sub Component:
Version:	4.0
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Tim Waugh
QA Contact:	Mike McLean
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-02-05 15:46 UTC by Charlie Brady
Modified:	2007-11-30 22:07 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-02-08 10:56:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Charlie Brady 2005-02-05 15:46:26 UTC

grep and UTF-8 don't play nicely together. There's an errata package
for RHEL 3 - fix hasn't been propagated to HEAD, by the looks.

[charlieb@charlieb SOURCES]$ cat /tmp/test
a log
b log
c log
d log
A log
B log
C log
[charlieb@charlieb SOURCES]$ grep '[A-C]' /tmp/test
b log
c log
A log
B log
C log
[charlieb@charlieb SOURCES]$ echo $LANG
en_AU.UTF-8
[charlieb@charlieb SOURCES]$ rpm -q grep
grep-2.5.1-31
[charlieb@charlieb SOURCES]$

Hmmm, errata package seems to be broken as well:

[root@charlieb charlieb]# rpm -Uhv --oldpackage grep-2.5.1-24.1.i386.rpm
Preparing...               
########################################### [100%]
   1:grep                  
########################################### [100%]
[root@charlieb charlieb]# grep '[A-C]' /tmp/test
b log
c log
A log
B log
C log
[root@charlieb charlieb]# unset LANG
[root@charlieb charlieb]# grep '[A-C]' /tmp/test
A log
B log
C log
[root@charlieb charlieb]#

Comment 1 Tim Waugh 2005-02-08 10:56:51 UTC

The behaviour you cite is correct.  For matching upper-case letters, you need to
use [[:upper:]], or list them explicitly in a class such as [ABC].

ISO 14651, which is the sorting standard, specifies this behaviour.  You can
also find some information in the strcoll documentation.

IEEE Std 1003.1, 2003 Edition says that grep uses the current locale as the
"locale for the behavior of ranges".

Note You need to log in before you can comment on or make changes to this bug.