683535 – regular expressions when using '-' in alpha range will match out of the given range

Bug 683535 - regular expressions when using '-' in alpha range will match out of the given range

Summary: regular expressions when using '-' in alpha range will match out of the given...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	grep
Sub Component:
Version:	13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Jaroslav Škarvada
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-03-09 16:35 UTC by Mel
Modified:	2011-03-10 21:55 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2011-03-10 21:55:00 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Mel 2011-03-09 16:35:53 UTC

Description of problem:
regular expressions when using '-' in alpha range will match out of the given range
for example: [a-z] will match [A-Ya-z]

Version-Release number of selected component (if applicable):
pcre-7.8.3.fc12.i686 -yes, this fedora 13

How reproducible:
cat << MEL >foo
AA
aa
BB
bb
CC
cc
XX
xx
YY
yy
ZZ
zz
MEL
echo ----- grep UPPER -----
grep '[A-Z]' foo
echo ----- grep lower -----
grep '[a-z]' foo


Steps to Reproduce:
1. grep a regular expression alpha range with '-' to match just upper case lines -ie: '[A-Z]'
2. grep a regular expression alpha range with '-' to match just lower case lines - ie: '[a-z]'
  
Actual results:
----- grep [A-Z] -----
AA
BB
bb
CC
cc
XX
xx
YY
yy
ZZ
zz
----- grep [a-z] -----
AA
aa
BB
bb
CC
cc
XX
xx
YY
yy
zz


Expected results:
----- grep [A-Z] -----
AA
BB
CC
XX
YY
ZZ
----- grep [a-z] -----
aa
bb
cc
xx
yy
zz


Additional info:
'[BC]' works correctly, but '[B-C]' behaves as if it were '[B-Cc]'
[A-Z] acts is if it is [A-Zb-z]
[a-z] acts as if it is [A-Ya-z]
Behaves identical to the gawk bug:
https://bugzilla.redhat.com/show_bug.cgi?id=683519

Comment 1 Petr Pisar 2011-03-10 21:55:00 UTC

Thank you for report, however:

(1) You call grep. grep does not use libpcre, if not asked by `-P' option. Otherwise POSIX basic (by default) or extended regular matching is performed by regex(3) call to standard library:

$ printf 'A\na\nB\nb\nC\nc\n' | grep '[a-z]'
a
B
b
C
c

$ printf 'A\na\nB\nb\nC\nc\n' | grep -P '[a-z]'
a
b
c

(2) If you interpret the expression as PCRE (grep -P, pcregrep), you get expected behaviour:

$ printf 'A\na\nB\nb\nC\nc\n' | pcregrep '[a-z]'
a
b
c

(3) Results of your grep command depend on locale. The `[a-z]' does not mean all lower case letters. It means characters with ordinal number between oridnals of `a' and `z'. That's equivalent in C locale:

$ printf 'A\na\nB\nb\nC\nc\n' | LANG=C grep '[a-z]'
a
b
c

But does not have to be in any other, e.g.:

$ printf 'A\na\nB\nb\nC\nc\n' | LANG=cs_CZ.UTF-8 grep '[a-z]'
a
B
b
C
c

Use `[[:lower:]]' to express lower-cased letters disregarding current locale:

$ printf 'A\na\nB\nb\nC\nc\n' | LANG=cs_CZ.UTF-8 grep '[[:lower:]]'
a
b
c

As you can see the result depends on collating of locale and using range operator `-' is undefined out of C locale. regex(7):

       If two characters  in  the  list are  separated  by '-', this is
       shorthand for the full range of characters between those two
       (inclusive) in the collating sequence, for example,  "[0-9]" in
       ASCII matches any decimal digit. [...] Ranges are very collating-
       -sequence-dependent, and portable programs should avoid relying
       on them.

See POSIX / Single UNIX Specification for more details. Also there is heavily-commented bug in this bug tracking system about this issue, that I cannot find right now.

Reassigning to grep component and closing a not a bug.

Note You need to log in before you can comment on or make changes to this bug.