Red Hat Bugzilla – Bug 76328
case insensitive pattern matching behavior is wrong
Last modified: 2016-11-24 10:14:03 EST
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Description of problem:
Since upgrading to Redhat 8.0, we have noticed that our CAD software
installation scripts (Synopsys, Inc.) and in-house Perl scripts are now
broken. Some debugging has narrowed down the cause to pattern-matching. In
Redhat 8.0, pattern matches seem to be incorrectly case-insensitive, where they
are case-sensitive on other linux/UNIX installations (including earlier Redhat
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. echo testing | grep "[A-Z]"
Actual Results: testing
Expected Results: <null result>
Note that the 'grep' command above should return any matches on the uppercase
set of alpha characters. For some reason, Redhat 8.0 is matching the string.
Other things I have tried:
* downloaded the source for the latest stable grep (2.5) and recompiled from
scratch on the Redhat 8.0 box
* reinstalled RH 8.0 from scratch, after checksum-verifying all media
* installed latest kernel update (2.4.18-17.8.0)
The test command above returns nothing, as expected, on the following boxes:
* Redhat 6.2 on i686
* HP-UX 11.0
* FreeBSD 4.4
* Solaris 5.8
Note that I have set the severity of this bug at high, as it is preventing us
from installing CAD software, running our own in-house software, etc. I
understand that this may be re-prioritized, depending on the root cause.
Software that uses grep and expects '[A-Z]' to match in the POSIX locale but
doesn't set the locale to POSIX is buggy and should be fixed.
It should use (for example): LANG=C grep '[A-Z]'
*** Bug 85532 has been marked as a duplicate of this bug. ***
POSIX compatibility should be the default for Red Hat Linux, NOT Microsoft DOS
This locale-dependent behaviour *is* POSIX compliant. (Read the spec.)
This should really be called a "regex" problem (possibly) because it happens
with awk and sed also.
What's this about Linux not being POSIX complient on a default installation?
Books have been written about grep showing that it's case sensitive. While
grep may depend on the character chart used, shouldn't the default installation
of Linux work as books have said? Why "break" grep, awk, and sed?
That being said, my FreeBSD box at home works just FINE when I try:
echo "abcd" | grep [A-Z]
my FreeBSD box doesn't display anything (as I'd expect). But Linux does.
Huh? I don't even have the $LANG variable set on my FreeBSD box, and it wasn't
set by default.
SO, I decided to see for myself. I tested a default installed RedHat 8.0
Here are some test cases that I came up with. First I'll show them with notes
off to the side, then I'll show them without the notes:
$ echo "abcxyz" | grep [Y-Z] NOTE: OK, so it gets multi-case characters
$ echo $LANG NOTE: Yep, that infamous "$LANG"
$ echo "abcxy" | grep [Y-Z] NOTE: no response -- strange
$ echo "abcxz" | grep [Y-Z] NOTE: There *IS* a resonse here, though
$ echo "abcxz" | grep [A-A] NOTE: Strange (similar to above, though)
$ echo "axz" | grep [A-B] NOTE: OK, same as the first Y-Z example
$ echo "axz" | grep [a-b] NOTE: WTHeck? It does this for LOWER case, but
$ echo "axz" | grep [a-a] NOTE: Again, lower but not upper?!
$ echo "axz" | grep [A] NOTE: "Wow."
_and WITHOUT the notes:_
$ echo "abcxyz" | grep [Y-Z]
$ echo $LANG
$ echo "abcxy" | grep [Y-Z]
$ echo "abcxz" | grep [Y-Z]
$ echo "abcxz" | grep [A-A]
$ echo "axz" | grep [A-B]
$ echo "axz" | grep [a-b]
$ echo "axz" | grep [a-a]
$ echo "axz" | grep [a-a]
$ echo "axz" | grep [A]
SO, this must mean that Linux is sometimes case sensitive, sometimes not.
In actuality, what it really means is that the order of the character chart
for "en_US.UTF-8" must be in some order like:
"a A b B ..."
To verify this, I tried:
$ echo "Axy" | grep [a-b]
Which supports my conclusion.
Idly wondering what the Unicode people were smoking, I went looking around for
their character chart.
All that I could find was:
This shows that A-Z is listed from 0041 to 005A and a-z is from 0061 to 007A.
Meaning that it is NOT mixed up like "a A b B ..."!
SO, my conclusion is one of the following:
1. I have the wrong Unicode character chart and the Unicode people really were
2. The way that Linux does it's regex processing is wrong (since the problem
manifests in awk and sed also).
3. The way that Linux does it's unicode processing is wrong.
I think it's either 2 or 3. Since when you set $LANG to C, it works, I might
lean slightly to 3.
You guys seem to say all of this is POSIX compliant, though, but I don't
understand -- how is this POSIX complient?
Enough of this bitching, please. This POSIX requirement has been beaten to death
on the appropriate forums and is not going to change.
> In the POSIX locale, a range expression represents the set of collating
> elements that fall between two elements in the collation sequence, inclusive.
> In other locales, a range expression has unspecified behavior: strictly
> conforming applications shall not rely on whether the range expression is
> valid, or on the set of collating elements matched. A range expression shall be
> expressed as the starting point and the ending point separated by a hyphen (
So, in regexps you can NEVER portably use range expressions unless locale is
POSIX or C. And the rule is using LC_COLLATE, which is "dictionary order",
that is aAbB or AaBb in most locales.
As for globs:
> The description of basic regular expression bracket expressions in the Base
> Definitions volume of IEEE Std 1003.1-2001, Section 9.3.5, RE Bracket
> Expression shall also apply to the pattern bracket expression, except that the
> exclamation mark character ( '!' ) shall replace the circumflex character ( '^' )
> in its role in a "non-matching list" in the regular expression notation. A
> bracket expression starting with an unquoted circumflex character produces
> unspecified results.
i.e. same rules apply.
James has hit the nail on the head for me. Books have been written,
demonstrating and exploiting the case-sensitive nature of ranges in regexps.
User scripts use this. Retail software depends on this. Regardless of the
technical reasons, Redhat sells a product. The product has changed and it
breaks stuff. That's bad.
You're right -- you shouldn't have to look online in order to get your scripts
to work properly. They should work as expected and defined by default. The
don't. That's the issue.
Miloslav, I'm afraid you missundertand our point. I'll try to make it slightly
The problem is that BY DEFAULT, your locale is set so that grep, awk, and sed
DO NOT work as specified for the POSIX locale. By default Linux is NOT the
POSIX locale! Does this mean Linux is not POSIX compliant by default? I'd say
yes. Technically, though, perhaps just NOT being in the POSIX locale means
that you're POSIX compliant -- after all, you work in an "unspecified" way.
Could you then say that Microsoft is POSIX compliant because it is not the the
POSIX locale? :D (Of course Microsoft doesn't really offer an option.)
Sigh. I thought the reason we were using *nix and not Microsoft was because of
crap like this.
Additionally, from your statement, it appears that my conclusion number 1 was
correct: that the Unicode people were smoking crack. Of course, that's
assuming that you're right about your character map. Does anyone have a
character map for "en_US.UTF-8?"
WELL, looking in: /usr/share/i18n/charmaps/UTF-8.gz
I found it. I gunzip'ed it and guess what? The range from A-Z is 0041 to 005A
(as I said above), and the range from 0061 to 007A (as I said above). Whoops!
This isn't what you said. The Unicode people *ARE* right.
So which is it? Is the regular expression processing for range contiguous? Or
do grep, awk, and sed JUST NOT WORK without the POSIX locale in Linux?
1. The Unicode people were right. Their chart is a good one.
2. The way Linux handles range expressions for regular expressions processing
is not proper and does not conform to they way it was intended.
3. Linux may have a bug in the way that it handles Unicode processing.
4. By default, Linux RedHat 8.0 is not set up to use a POSIX enviornment.
5. The UTF-8 character set should not be used on Linux 8.0 because range
expressions work in an "unspecified manner" and your basic scripts will not
6. RedHat should avoid breaking legacy code left and right in order to bring
out a new character set that is not used for POSIX compliance.
Damn, you guys almost had a Linux convert in me. I'll stick to FreeBSD until
Linux sorts our their "unspecified behavior" problems. God knows I don't want
to have to "hack" my enviornment in order to get my scripts to work.
> You're right -- you shouldn't have to look online in order to get your
> to work properly. They should work as expected and defined by default. The
> don't. That's the issue.
No. The issue is your scripts are WRONG and not guarrateed to work on any
POSIX-compliant system, just happened to work on whatever you were using before.
> The problem is that BY DEFAULT, your locale is set so that grep, awk, and sed
> DO NOT work as specified for the POSIX locale.
Which is *right*, because *users* don't live in POSIX, but in USA,
China, Czech Republic.
> By default Linux is NOT the POSIX locale! Does this mean Linux is not POSIX
> compliant by default?
NO. ISO C and POSIX have the notion of locales and the "implementation" (i.e.
Linux OS) can provide many locales, also the "implementation" either is or is
conforming to the standard, regardless of currently used locale.
"POSIX" is just *one locale* the standard requires to exist, which has well-
defined behavior analogous to what you are used to and think POSIX requires.
It does NOT, unless the locale is "POSIX" or, equivalently, "C".
> I'd say yes.
I'd say you should read the standard.
> Technically, though, perhaps just NOT being in the POSIX locale means
> that you're POSIX compliant -- after all, you work in an "unspecified" way.
That's exactly what the excerpt from the standard regarding range expression
> Could you then say that Microsoft is POSIX compliant because it is not the
> POSIX locale? :D (Of course Microsoft doesn't really offer an option.)
No. The "implementation" is required to provide many things *regardless* of
locale. BTW, Microsoft did have a POSIX compliant interface in NT 3.x.
> Sigh. I thought the reason we were using *nix and not Microsoft was because
> crap like this.
You mean crap like programs relying on POSIX-unspecified behavior?
> So which is it? Is the regular expression processing for range contiguous?
It is "you don't understand range expressions". Range expressions are EXPLICITLY
defined to be ranges in *collating sequence*, that is LC_COLLATE-dependent
dictionary searching order. They are NOT necessarily ranges of character codes
in any encoding you might think of.
> Or do grep, awk, and sed JUST NOT WORK without the POSIX locale in Linux?
They work exactly as they are supposed to. It just is not what you expect
them to work.
> 1. The Unicode people were right. Their chart is a good one.
Character codes are irrelevant in range expressions.
> 2. The way Linux handles range expressions for regular expressions
> is not proper and does not conform to they way it was intended.
It is excactly what POSIX mandates. You should have complained *before* it was
mandated, but I bet you never noticed.
> 4. By default, Linux RedHat 8.0 is not set up to use a POSIX enviornment.
Correction: By default, Red Hat Linux 8.0 is not set up to use "POSIX" locale.
Red Hat Linux 8.0 is (in this respect) a conforming POSIX "implementation"
> 5. The UTF-8 character set should not be used on Linux 8.0 because range
> expressions work in an "unspecified manner" and your basic scripts will not
Again, this has no relation to UTF-8. You get the same behavior when
locale is set just to "en_US" (using ISO-8859-1).
> 6. RedHat should avoid breaking legacy code left and right in order to bring
> out a new character set that is not used for POSIX compliance.
Based on wrong assumptions you can only arrive at a wrong conclusion.
> I'll stick to FreeBSD until Linux sorts our their "unspecified behavior"
YOUR scripts depend of unspecified behavior, YOU have things to sort out.
> God knows I don't want to have to "hack" my enviornment in order to get my
> scripts to work.
Disclaimer: Just in case someone misinterprets this (I admit I'm a bit
frustrated by now), I don't work for Red Hat and this can therefore in
no way be interpreted as an official statement of Red Hat, Inc.
Ok, I give up. Apparently all the books written about how awk, sed, and grep
work are wrong and Linux RedHat 8.0 is right.
Apparently RedHat versions before 8.0 and every other Unix variant that I've
seen just happened to use the POSIX compliant locale, but RedHat 8.0 does
it "right" and doesn't necessarily follow along with the previous way of
SO, if my scripts don't conform to POSIX, how do I make them conform?
What other commands do I need to add to the top of my 2 line scripts in order
to get them to work in the new Linux enviornment like they did in the previous
No grep bug here.
There is just one POSIX (C) locale, the rest are locales for the languages and
territories people live in.
If your script relies on POSIX locale collation, you just have to say so,
whether by doing export LC_ALL=C for the whole script or just for the commands
where it matters, say:
echo testing | LC_ALL=C grep "[A-Z]"
something | LC_ALL=C sort