76328 – case insensitive pattern matching behavior is wrong

Bug 76328 - case insensitive pattern matching behavior is wrong

Summary: case insensitive pattern matching behavior is wrong

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	glibc
Sub Component:
Version:	8.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Jakub Jelinek
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	85532 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-10-20 03:16 UTC by Need Real Name
Modified:	2016-11-24 15:14 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2003-03-07 20:27:31 UTC
Embargoed:

Attachments	(Terms of Use)

Description Need Real Name 2002-10-20 03:16:23 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

Description of problem:
Since upgrading to Redhat 8.0, we have noticed that our CAD software 
installation scripts (Synopsys, Inc.) and in-house Perl scripts are now 
broken.  Some debugging has narrowed down the cause to pattern-matching.  In 
Redhat 8.0, pattern matches seem to be incorrectly case-insensitive, where they 
are case-sensitive on other linux/UNIX installations (including earlier Redhat 
versions).

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. echo testing | grep "[A-Z]"
	

Actual Results:  testing

Expected Results:  <null result>

Additional info:

Note that the 'grep' command above should return any matches on the uppercase 
set of alpha characters.  For some reason, Redhat 8.0 is matching the string.  
Other things I have tried:

* downloaded the source for the latest stable grep (2.5) and recompiled from 
scratch on the Redhat 8.0 box
* reinstalled RH 8.0 from scratch, after checksum-verifying all media
* installed latest kernel update (2.4.18-17.8.0)

The test command above returns nothing, as expected, on the following boxes:

* Redhat 6.2 on i686
* HP-UX 11.0 
* FreeBSD 4.4
* Solaris 5.8

Note that I have set the severity of this bug at high, as it is preventing us 
from installing CAD software, running our own in-house software, etc.  I 
understand that this may be re-prioritized, depending on the root cause.

Comment 1 Tim Waugh 2002-10-20 10:34:27 UTC

Software that uses grep and expects '[A-Z]' to match in the POSIX locale but
doesn't set the locale to POSIX is buggy and should be fixed.

It should use (for example): LANG=C grep '[A-Z]'

Comment 2 Mike McLean 2003-03-03 23:08:09 UTC

*** Bug 85532 has been marked as a duplicate of this bug. ***

Comment 3 Need Real Name 2003-03-04 05:57:42 UTC

POSIX compatibility should be the default for Red Hat Linux, NOT Microsoft DOS
compatibility!!!

Comment 4 Tim Waugh 2003-03-04 09:43:28 UTC

This locale-dependent behaviour *is* POSIX compliant. (Read the spec.)

Comment 5 James Turnbull 2003-03-06 22:33:02 UTC

This should really be called a "regex" problem (possibly) because it happens 
with awk and sed also.

What's this about Linux not being POSIX complient on a default installation?

Books have been written about grep showing that it's case sensitive.  While 
grep may depend on the character chart used, shouldn't the default installation 
of Linux work as books have said?  Why "break" grep, awk, and sed?  

That being said, my FreeBSD box at home works just FINE when I try:

     echo "abcd" | grep [A-Z]

my FreeBSD box doesn't display anything (as I'd expect).  But Linux does.  
Huh?  I don't even have the $LANG variable set on my FreeBSD box, and it wasn't 
set by default.

SO, I decided to see for myself.  I tested a default installed RedHat 8.0 
system.
Here are some test cases that I came up with.  First I'll show them with notes 
off to the side, then I'll show them without the notes:

$ echo "abcxyz" | grep [Y-Z] NOTE:  OK, so it gets multi-case characters
abcxyz
$ echo $LANG                 NOTE:  Yep, that infamous "$LANG"
en_US.UTF-8
$ echo "abcxy" | grep [Y-Z]  NOTE:  no response -- strange
$ echo "abcxz" | grep [Y-Z]  NOTE:  There *IS* a resonse here, though
abcxz

$ echo "abcxz" | grep [A-A]  NOTE:  Strange (similar to above, though)
$ echo "axz" | grep [A-B]    NOTE:  OK, same as the first Y-Z example
$ echo "axz" | grep [a-b]    NOTE:  WTHeck?  It does this for LOWER case, but 
not upper?
axz
$ echo "axz" | grep [a-a]    NOTE:  Again, lower but not upper?!
axz
$ echo "axz" | grep [A]      NOTE:  "Wow."
$

_and WITHOUT the notes:_
$ echo "abcxyz" | grep [Y-Z] 
abcxyz
$ echo $LANG
en_US.UTF-8
$ echo "abcxy" | grep [Y-Z]  
$ echo "abcxz" | grep [Y-Z]
abcxz

$ echo "abcxz" | grep [A-A]
$ echo "axz" | grep [A-B]
$ echo "axz" | grep [a-b]
axz
$ echo "axz" | grep [a-a]
axz

$ echo "axz" | grep [a-a]
axz
$ echo "axz" | grep [A]
$

SO, this must mean that Linux is sometimes case sensitive, sometimes not.

In actuality, what it really means is that the order of the character chart 
for "en_US.UTF-8" must be in some order like:
     "a A b B ..."
To verify this, I tried:
$ echo "Axy" | grep [a-b]
Axy
$

Which supports my conclusion.

Idly wondering what the Unicode people were smoking, I went looking around for 
their character chart.
All that I could find was:
http://www.unicode.org/charts/PDF/U0000.pdf

This shows that A-Z is listed from 0041 to 005A and a-z is from 0061 to 007A.  
Meaning that it is NOT mixed up like "a A b B ..."!

SO, my conclusion is one of the following:

1.  I have the wrong Unicode character chart and the Unicode people really were 
smoking something.

2.  The way that Linux does it's regex processing is wrong (since the problem 
manifests in awk and sed also).

3.  The way that Linux does it's unicode processing is wrong.

I think it's either 2 or 3.  Since when you set $LANG to C, it works, I might 
lean slightly to 3.

You guys seem to say all of this is POSIX compliant, though, but I don't 
understand -- how is this POSIX complient?

Comment 6 Miloslav Trmac 2003-03-07 11:41:53 UTC

Enough of this bitching, please. This POSIX requirement has been beaten to death
on the appropriate forums and is not going to change.

http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html
> In the POSIX locale, a range expression represents the set of collating
> elements that fall between two elements in the collation sequence, inclusive.
> In other locales, a range expression has unspecified behavior: strictly
> conforming applications shall not rely on whether the range expression is
> valid, or on the set of collating elements matched. A range expression shall be
> expressed as the starting point and the ending point separated by a hyphen (
'-' ).

So, in regexps you can NEVER portably use range expressions unless locale is
POSIX or C. And the rule is using LC_COLLATE, which is "dictionary order",
that is aAbB or AaBb in most locales.

As for globs:
http://www.opengroup.org/onlinepubs/007904975/utilities/xcu_chap02.html
> The description of basic regular expression bracket expressions in the Base
> Definitions volume of IEEE Std 1003.1-2001, Section 9.3.5, RE Bracket
> Expression shall also apply to the pattern bracket expression, except that the
> exclamation mark character ( '!' ) shall replace the circumflex character ( '^' )
> in its role in a "non-matching list" in the regular expression notation. A
> bracket expression starting with an unquoted circumflex character produces
> unspecified results.

i.e. same rules apply.

Comment 7 Need Real Name 2003-03-07 17:04:04 UTC

James has hit the nail on the head for me.  Books have been written, 
demonstrating and exploiting the case-sensitive nature of ranges in regexps.  
User scripts use this.  Retail software depends on this.  Regardless of the 
technical reasons, Redhat sells a product.  The product has changed and it 
breaks stuff.  That's bad.

 - Chris

Comment 8 James Turnbull 2003-03-07 18:31:09 UTC

Thanks, Chris.
You're right -- you shouldn't have to look online in order to get your scripts 
to work properly.  They should work as expected and defined by default.  The 
don't.  That's the issue.

Miloslav, I'm afraid you missundertand our point.  I'll try to make it slightly 
more clear.

The problem is that BY DEFAULT, your locale is set so that grep, awk, and sed 
DO NOT work as specified for the POSIX locale.  By default Linux is NOT the 
POSIX locale!  Does this mean Linux is not POSIX compliant by default?  I'd say 
yes.  Technically, though, perhaps just NOT being in the POSIX locale means 
that you're POSIX compliant -- after all, you work in an "unspecified" way.  
Could you then say that Microsoft is POSIX compliant because it is not the the 
POSIX locale?  :D (Of course Microsoft doesn't really offer an option.)
Sigh.  I thought the reason we were using *nix and not Microsoft was because of 
crap like this.

Additionally, from your statement, it appears that my conclusion number 1 was 
correct:  that the Unicode people were smoking crack.  Of course, that's 
assuming that you're right about your character map.  Does anyone have a 
character map for "en_US.UTF-8?"  

WELL, looking in:  /usr/share/i18n/charmaps/UTF-8.gz
I found it.  I gunzip'ed it and guess what?  The range from A-Z is 0041 to 005A 
(as I said above), and the range from 0061 to 007A (as I said above).  Whoops!  
This isn't what you said.  The Unicode people *ARE* right.

So which is it?  Is the regular expression processing for range contiguous?  Or 
do grep, awk, and sed JUST NOT WORK without the POSIX locale in Linux?

My conclusions:

1.  The Unicode people were right.  Their chart is a good one.

2.  The way Linux handles range expressions for regular expressions processing 
is not proper and does not conform to they way it was intended.

3.  Linux may have a bug in the way that it handles Unicode processing.

4.  By default, Linux RedHat 8.0 is not set up to use a POSIX enviornment.

5.  The UTF-8 character set should not be used on Linux 8.0 because range 
expressions work in an "unspecified manner" and your basic scripts will not 
work.

6.  RedHat should avoid breaking legacy code left and right in order to bring 
out a new character set that is not used for POSIX compliance.


Damn, you guys almost had a Linux convert in me.  I'll stick to FreeBSD until 
Linux sorts our their "unspecified behavior" problems.  God knows I don't want 
to have to "hack" my enviornment in order to get my scripts to work.

Comment 9 Miloslav Trmac 2003-03-07 19:18:18 UTC

> You're right -- you shouldn't have to look online in order to get your 
scripts 
> to work properly.  They should work as expected and defined by default.  The 
> don't.  That's the issue.
No. The issue is your scripts are WRONG and not guarrateed to work on any
POSIX-compliant system, just happened to work on whatever you were using before.

> The problem is that BY DEFAULT, your locale is set so that grep, awk, and sed 
> DO NOT work as specified for the POSIX locale.
Which is *right*, because *users* don't live in POSIX, but in USA,
China, Czech Republic.

> By default Linux is NOT the POSIX locale!  Does this mean Linux is not POSIX 
> compliant by default?
NO. ISO C and POSIX have the notion of locales and the "implementation" (i.e.
Linux OS) can provide many locales, also the "implementation" either is or is 
not
conforming to the standard, regardless of currently used locale.
"POSIX" is just *one locale* the standard requires to exist, which has well-
defined behavior analogous to what you are used to and think POSIX requires.
It does NOT, unless the locale is "POSIX" or, equivalently, "C".

> I'd say yes.
I'd say you should read the standard.

> Technically, though, perhaps just NOT being in the POSIX locale means 
> that you're POSIX compliant -- after all, you work in an "unspecified" way.
That's exactly what the excerpt from the standard regarding range expression 
says.

> Could you then say that Microsoft is POSIX compliant because it is not the 
the 
> POSIX locale?  :D (Of course Microsoft doesn't really offer an option.)
No. The "implementation" is required to provide many things *regardless* of
locale. BTW, Microsoft did have a POSIX compliant interface in NT 3.x.

> Sigh.  I thought the reason we were using *nix and not Microsoft was because 
of 
> crap like this.
You mean crap like programs relying on POSIX-unspecified behavior?

> So which is it?  Is the regular expression processing for range contiguous?
It is "you don't understand range expressions". Range expressions are EXPLICITLY
defined to be ranges in *collating sequence*, that is LC_COLLATE-dependent
dictionary searching order. They are NOT necessarily ranges of character codes
in any encoding you might think of.

> Or do grep, awk, and sed JUST NOT WORK without the POSIX locale in Linux?
They work exactly as they are supposed to. It just is not what you expect
them to work.

> 1.  The Unicode people were right.  Their chart is a good one.
Character codes are irrelevant in range expressions.

> 2.  The way Linux handles range expressions for regular expressions 
processing 
> is not proper and does not conform to they way it was intended.
It is excactly what POSIX mandates. You should have complained *before* it was
mandated, but I bet you never noticed.

> 4.  By default, Linux RedHat 8.0 is not set up to use a POSIX enviornment.
Correction: By default, Red Hat Linux 8.0 is not set up to use "POSIX" locale.
Red Hat Linux 8.0 is (in this respect) a conforming POSIX "implementation"

> 5.  The UTF-8 character set should not be used on Linux 8.0 because range 
> expressions work in an "unspecified manner" and your basic scripts will not 
> work.
Again, this has no relation to UTF-8. You get the same behavior when
locale is set just to "en_US" (using ISO-8859-1).

> 6.  RedHat should avoid breaking legacy code left and right in order to bring
> out a new character set that is not used for POSIX compliance.
Based on wrong assumptions you can only arrive at a wrong conclusion.

> I'll stick to FreeBSD until Linux sorts our their "unspecified behavior"
> problems.
YOUR scripts depend of unspecified behavior, YOU have things to sort out.

> God knows I don't want to have to "hack" my enviornment in order to get my
> scripts to work.
Tough.


Disclaimer: Just in case someone misinterprets this (I admit I'm a bit
frustrated by now), I don't work for Red Hat and this can therefore in
no way be interpreted as an official statement of Red Hat, Inc.

Comment 10 James Turnbull 2003-03-07 19:48:37 UTC

Ok, I give up.  Apparently all the books written about how awk, sed, and grep 
work are wrong and Linux RedHat 8.0 is right.

Apparently RedHat versions before 8.0 and every other Unix variant that I've 
seen just happened to use the POSIX compliant locale, but RedHat 8.0 does 
it "right" and doesn't necessarily follow along with the previous way of 
operating.

SO, if my scripts don't conform to POSIX, how do I make them conform?
What other commands do I need to add to the top of my 2 line scripts in order 
to get them to work in the new Linux enviornment like they did in the previous 
ones?

Sigh.

Comment 11 Tim Waugh 2003-03-07 20:09:13 UTC

No grep bug here.

Comment 12 Jakub Jelinek 2003-03-07 20:27:31 UTC

There is just one POSIX (C) locale, the rest are locales for the languages and
territories people live in.
If your script relies on POSIX locale collation, you just have to say so,
whether by doing export LC_ALL=C for the whole script or just for the commands
where it matters, say:
echo testing | LC_ALL=C grep "[A-Z]"
something | LC_ALL=C sort
etc.

Note You need to log in before you can comment on or make changes to this bug.