Bug 26355 - regex code "forgets" characters in character classes
regex code "forgets" characters in character classes
Status: CLOSED RAWHIDE
Product: Red Hat Linux
Classification: Retired
Component: glibc (Show other bugs)
7.1
i386 Linux
medium Severity medium
: ---
: ---
Assigned To: Jakub Jelinek
Aaron Brown
Florence RC-1
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2001-02-06 15:44 EST by Michal Jaegermann
Modified: 2016-11-24 10:23 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2001-02-06 19:26:03 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Michal Jaegermann 2001-02-06 15:44:14 EST
There is a nasty bug in glibc regex code.  It requires the following
patch (whitespace likely mangled by "user-friendly" interfaces):

--- glibc-2.2.1/posix/regex.c~  Mon Oct 30 01:02:56 2000
+++ glibc-2.2.1/posix/regex.c   Tue Feb  6 12:18:40 2001
@@ -5209,7 +5209,7 @@
                  {
                    int not = (re_opcode_t) p1[3] == charset_not;
 
-                   if (c < (unsigned char) (p1[4] * BYTEWIDTH)
+                   if (c < (unsigned) (p1[4] * BYTEWIDTH)
                        && p1[5 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH)))
                      not = !not;
 
In the code in question 'p1[4]' holds a length of a bitmap for a
character class in a compiled regular expression (and the next clause
checks if a bit which corresponds to 'c' is really on IFF there is
something to check).  If a character class under consideration
includes something from a range 248-255 then 'p1[4]' equals 32, the
whole expression '(unsigned char) (p1[4] * BYTEWIDTH)' evaluates to 0
and the test always fails.

This bug is filed against glibc-2.2.1 but actually it goes back to
times immemorial. :-)  Because the same regexp code is used by currently
released versions of gawk (and who knows what else yet) this bug
affects it as well and can be easily demonstrated by the following
simple gawk script (a slight modification of a test case supplied
by Jorge Stolfi <stolfi@ic.unicamp.br>):

#! /usr/bin/gawk -f
BEGIN {
    for (c = 247; c < 256; c += 8)
        { do_test(c); }
}

function do_test(char,   pat,s,t)
{
    pat = sprintf("[an%c]*n", char);
    s = "bananas and ananases in canaan";
    t = s;
    gsub(pat, "AN", t);
    printf "%c %-8s  %s\n", char, pat, t;
}

   Michal Jaegermann
   michal@harddata.com

p.s. The same report also mailed to bugs@gnu.org
Comment 1 Glen Foster 2001-02-06 19:09:25 EST
This defect is considered MUST-FIX for Florence Release-Candidate #1
Comment 2 Jakub Jelinek 2001-02-06 19:25:57 EST
Submitted to libc-hacker together with a C testcase, though since
2001-02-02 this should not trigger in glibc, since regex is now
supporting multibyte character sets and has this broken code
ifdefed out (but people taking regex from glibc will probably
not have multibyte support enabled so it will matter for them).
Comment 3 Jakub Jelinek 2001-02-12 04:43:28 EST
This should be fixed in glibc-2.2.1-7 (together with memory leaks in
the new multibyte charset regex code).
Comment 4 Michal Jaegermann 2001-02-12 11:55:59 EST
Just a note that although the bug was filed about regex code in glibc its
presence is more pervasive. 'gawk', which was used to demonstrate the
bug, is using a _copy_ of this code and not a shared library.  The same
is likely about 'sed', and possibly 'grep', and 'tar', .... and who knows
what else.  I am afraid that I did not check all possibilities.

Note You need to log in before you can comment on or make changes to this bug.