Bug 26355 - regex code "forgets" characters in character classes
Summary: regex code "forgets" characters in character classes
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: glibc
Version: 7.1
Hardware: i386
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Jakub Jelinek
QA Contact: Aaron Brown
URL:
Whiteboard: Florence RC-1
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2001-02-06 20:44 UTC by Michal Jaegermann
Modified: 2016-11-24 15:23 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2001-02-07 00:26:03 UTC
Embargoed:


Attachments (Terms of Use)

Description Michal Jaegermann 2001-02-06 20:44:14 UTC
There is a nasty bug in glibc regex code.  It requires the following
patch (whitespace likely mangled by "user-friendly" interfaces):

--- glibc-2.2.1/posix/regex.c~  Mon Oct 30 01:02:56 2000
+++ glibc-2.2.1/posix/regex.c   Tue Feb  6 12:18:40 2001
@@ -5209,7 +5209,7 @@
                  {
                    int not = (re_opcode_t) p1[3] == charset_not;
 
-                   if (c < (unsigned char) (p1[4] * BYTEWIDTH)
+                   if (c < (unsigned) (p1[4] * BYTEWIDTH)
                        && p1[5 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH)))
                      not = !not;
 
In the code in question 'p1[4]' holds a length of a bitmap for a
character class in a compiled regular expression (and the next clause
checks if a bit which corresponds to 'c' is really on IFF there is
something to check).  If a character class under consideration
includes something from a range 248-255 then 'p1[4]' equals 32, the
whole expression '(unsigned char) (p1[4] * BYTEWIDTH)' evaluates to 0
and the test always fails.

This bug is filed against glibc-2.2.1 but actually it goes back to
times immemorial. :-)  Because the same regexp code is used by currently
released versions of gawk (and who knows what else yet) this bug
affects it as well and can be easily demonstrated by the following
simple gawk script (a slight modification of a test case supplied
by Jorge Stolfi <stolfi.br>):

#! /usr/bin/gawk -f
BEGIN {
    for (c = 247; c < 256; c += 8)
        { do_test(c); }
}

function do_test(char,   pat,s,t)
{
    pat = sprintf("[an%c]*n", char);
    s = "bananas and ananases in canaan";
    t = s;
    gsub(pat, "AN", t);
    printf "%c %-8s  %s\n", char, pat, t;
}

   Michal Jaegermann
   michal

p.s. The same report also mailed to bugs

Comment 1 Glen Foster 2001-02-07 00:09:25 UTC
This defect is considered MUST-FIX for Florence Release-Candidate #1

Comment 2 Jakub Jelinek 2001-02-07 00:25:57 UTC
Submitted to libc-hacker together with a C testcase, though since
2001-02-02 this should not trigger in glibc, since regex is now
supporting multibyte character sets and has this broken code
ifdefed out (but people taking regex from glibc will probably
not have multibyte support enabled so it will matter for them).

Comment 3 Jakub Jelinek 2001-02-12 09:43:28 UTC
This should be fixed in glibc-2.2.1-7 (together with memory leaks in
the new multibyte charset regex code).

Comment 4 Michal Jaegermann 2001-02-12 16:55:59 UTC
Just a note that although the bug was filed about regex code in glibc its
presence is more pervasive. 'gawk', which was used to demonstrate the
bug, is using a _copy_ of this code and not a shared library.  The same
is likely about 'sed', and possibly 'grep', and 'tar', .... and who knows
what else.  I am afraid that I did not check all possibilities.


Note You need to log in before you can comment on or make changes to this bug.