Bug 683535

Summary: regular expressions when using '-' in alpha range will match out of the given range
Product: [Fedora] Fedora Reporter: Mel <mel>
Component: grepAssignee: Jaroslav Škarvada <jskarvad>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 13CC: jskarvad, lkundrak, ppisar
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-03-10 21:55:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Mel 2011-03-09 16:35:53 UTC
Description of problem:
regular expressions when using '-' in alpha range will match out of the given range
for example: [a-z] will match [A-Ya-z]

Version-Release number of selected component (if applicable):
pcre-7.8.3.fc12.i686 -yes, this fedora 13

How reproducible:
cat << MEL >foo
AA
aa
BB
bb
CC
cc
XX
xx
YY
yy
ZZ
zz
MEL
echo ----- grep UPPER -----
grep '[A-Z]' foo
echo ----- grep lower -----
grep '[a-z]' foo


Steps to Reproduce:
1. grep a regular expression alpha range with '-' to match just upper case lines -ie: '[A-Z]'
2. grep a regular expression alpha range with '-' to match just lower case lines - ie: '[a-z]'
  
Actual results:
----- grep [A-Z] -----
AA
BB
bb
CC
cc
XX
xx
YY
yy
ZZ
zz
----- grep [a-z] -----
AA
aa
BB
bb
CC
cc
XX
xx
YY
yy
zz


Expected results:
----- grep [A-Z] -----
AA
BB
CC
XX
YY
ZZ
----- grep [a-z] -----
aa
bb
cc
xx
yy
zz


Additional info:
'[BC]' works correctly, but '[B-C]' behaves as if it were '[B-Cc]'
[A-Z] acts is if it is [A-Zb-z]
[a-z] acts as if it is [A-Ya-z]
Behaves identical to the gawk bug:
https://bugzilla.redhat.com/show_bug.cgi?id=683519

Comment 1 Petr Pisar 2011-03-10 21:55:00 UTC
Thank you for report, however:

(1) You call grep. grep does not use libpcre, if not asked by `-P' option. Otherwise POSIX basic (by default) or extended regular matching is performed by regex(3) call to standard library:

$ printf 'A\na\nB\nb\nC\nc\n' | grep '[a-z]'
a
B
b
C
c

$ printf 'A\na\nB\nb\nC\nc\n' | grep -P '[a-z]'
a
b
c

(2) If you interpret the expression as PCRE (grep -P, pcregrep), you get expected behaviour:

$ printf 'A\na\nB\nb\nC\nc\n' | pcregrep '[a-z]'
a
b
c

(3) Results of your grep command depend on locale. The `[a-z]' does not mean all lower case letters. It means characters with ordinal number between oridnals of `a' and `z'. That's equivalent in C locale:

$ printf 'A\na\nB\nb\nC\nc\n' | LANG=C grep '[a-z]'
a
b
c

But does not have to be in any other, e.g.:

$ printf 'A\na\nB\nb\nC\nc\n' | LANG=cs_CZ.UTF-8 grep '[a-z]'
a
B
b
C
c

Use `[[:lower:]]' to express lower-cased letters disregarding current locale:

$ printf 'A\na\nB\nb\nC\nc\n' | LANG=cs_CZ.UTF-8 grep '[[:lower:]]'
a
b
c

As you can see the result depends on collating of locale and using range operator `-' is undefined out of C locale. regex(7):

       If two characters  in  the  list are  separated  by '-', this is
       shorthand for the full range of characters between those two
       (inclusive) in the collating sequence, for example,  "[0-9]" in
       ASCII matches any decimal digit. [...] Ranges are very collating-
       -sequence-dependent, and portable programs should avoid relying
       on them.

See POSIX / Single UNIX Specification for more details. Also there is heavily-commented bug in this bug tracking system about this issue, that I cannot find right now.

Reassigning to grep component and closing a not a bug.