Bug 683535 - regular expressions when using '-' in alpha range will match out of the given range
Summary: regular expressions when using '-' in alpha range will match out of the given...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: grep
Version: 13
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: Jaroslav Škarvada
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-03-09 16:35 UTC by Mel
Modified: 2011-03-10 21:55 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-03-10 21:55:00 UTC
Type: ---


Attachments (Terms of Use)

Description Mel 2011-03-09 16:35:53 UTC
Description of problem:
regular expressions when using '-' in alpha range will match out of the given range
for example: [a-z] will match [A-Ya-z]

Version-Release number of selected component (if applicable):
pcre-7.8.3.fc12.i686 -yes, this fedora 13

How reproducible:
cat << MEL >foo
AA
aa
BB
bb
CC
cc
XX
xx
YY
yy
ZZ
zz
MEL
echo ----- grep UPPER -----
grep '[A-Z]' foo
echo ----- grep lower -----
grep '[a-z]' foo


Steps to Reproduce:
1. grep a regular expression alpha range with '-' to match just upper case lines -ie: '[A-Z]'
2. grep a regular expression alpha range with '-' to match just lower case lines - ie: '[a-z]'
  
Actual results:
----- grep [A-Z] -----
AA
BB
bb
CC
cc
XX
xx
YY
yy
ZZ
zz
----- grep [a-z] -----
AA
aa
BB
bb
CC
cc
XX
xx
YY
yy
zz


Expected results:
----- grep [A-Z] -----
AA
BB
CC
XX
YY
ZZ
----- grep [a-z] -----
aa
bb
cc
xx
yy
zz


Additional info:
'[BC]' works correctly, but '[B-C]' behaves as if it were '[B-Cc]'
[A-Z] acts is if it is [A-Zb-z]
[a-z] acts as if it is [A-Ya-z]
Behaves identical to the gawk bug:
https://bugzilla.redhat.com/show_bug.cgi?id=683519

Comment 1 Petr Pisar 2011-03-10 21:55:00 UTC
Thank you for report, however:

(1) You call grep. grep does not use libpcre, if not asked by `-P' option. Otherwise POSIX basic (by default) or extended regular matching is performed by regex(3) call to standard library:

$ printf 'A\na\nB\nb\nC\nc\n' | grep '[a-z]'
a
B
b
C
c

$ printf 'A\na\nB\nb\nC\nc\n' | grep -P '[a-z]'
a
b
c

(2) If you interpret the expression as PCRE (grep -P, pcregrep), you get expected behaviour:

$ printf 'A\na\nB\nb\nC\nc\n' | pcregrep '[a-z]'
a
b
c

(3) Results of your grep command depend on locale. The `[a-z]' does not mean all lower case letters. It means characters with ordinal number between oridnals of `a' and `z'. That's equivalent in C locale:

$ printf 'A\na\nB\nb\nC\nc\n' | LANG=C grep '[a-z]'
a
b
c

But does not have to be in any other, e.g.:

$ printf 'A\na\nB\nb\nC\nc\n' | LANG=cs_CZ.UTF-8 grep '[a-z]'
a
B
b
C
c

Use `[[:lower:]]' to express lower-cased letters disregarding current locale:

$ printf 'A\na\nB\nb\nC\nc\n' | LANG=cs_CZ.UTF-8 grep '[[:lower:]]'
a
b
c

As you can see the result depends on collating of locale and using range operator `-' is undefined out of C locale. regex(7):

       If two characters  in  the  list are  separated  by '-', this is
       shorthand for the full range of characters between those two
       (inclusive) in the collating sequence, for example,  "[0-9]" in
       ASCII matches any decimal digit. [...] Ranges are very collating-
       -sequence-dependent, and portable programs should avoid relying
       on them.

See POSIX / Single UNIX Specification for more details. Also there is heavily-commented bug in this bug tracking system about this issue, that I cannot find right now.

Reassigning to grep component and closing a not a bug.


Note You need to log in before you can comment on or make changes to this bug.