| Summary: | regular expressions when using '-' in alpha range will match out of the given range | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Mel <mel> |
| Component: | grep | Assignee: | Jaroslav Škarvada <jskarvad> |
| Status: | CLOSED NOTABUG | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 13 | CC: | jskarvad, lkundrak, ppisar |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-03-10 21:55:00 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Mel
2011-03-09 16:35:53 UTC
Thank you for report, however:
(1) You call grep. grep does not use libpcre, if not asked by `-P' option. Otherwise POSIX basic (by default) or extended regular matching is performed by regex(3) call to standard library:
$ printf 'A\na\nB\nb\nC\nc\n' | grep '[a-z]'
a
B
b
C
c
$ printf 'A\na\nB\nb\nC\nc\n' | grep -P '[a-z]'
a
b
c
(2) If you interpret the expression as PCRE (grep -P, pcregrep), you get expected behaviour:
$ printf 'A\na\nB\nb\nC\nc\n' | pcregrep '[a-z]'
a
b
c
(3) Results of your grep command depend on locale. The `[a-z]' does not mean all lower case letters. It means characters with ordinal number between oridnals of `a' and `z'. That's equivalent in C locale:
$ printf 'A\na\nB\nb\nC\nc\n' | LANG=C grep '[a-z]'
a
b
c
But does not have to be in any other, e.g.:
$ printf 'A\na\nB\nb\nC\nc\n' | LANG=cs_CZ.UTF-8 grep '[a-z]'
a
B
b
C
c
Use `[[:lower:]]' to express lower-cased letters disregarding current locale:
$ printf 'A\na\nB\nb\nC\nc\n' | LANG=cs_CZ.UTF-8 grep '[[:lower:]]'
a
b
c
As you can see the result depends on collating of locale and using range operator `-' is undefined out of C locale. regex(7):
If two characters in the list are separated by '-', this is
shorthand for the full range of characters between those two
(inclusive) in the collating sequence, for example, "[0-9]" in
ASCII matches any decimal digit. [...] Ranges are very collating-
-sequence-dependent, and portable programs should avoid relying
on them.
See POSIX / Single UNIX Specification for more details. Also there is heavily-commented bug in this bug tracking system about this issue, that I cannot find right now.
Reassigning to grep component and closing a not a bug.
|