Bug 26355 - regex code "forgets" characters in character classes
Summary: regex code "forgets" characters in character classes
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: glibc   
(Show other bugs)
Version: 7.1
Hardware: i386
OS: Linux
Target Milestone: ---
Assignee: Jakub Jelinek
QA Contact: Aaron Brown
Whiteboard: Florence RC-1
Depends On:
TreeView+ depends on / blocked
Reported: 2001-02-06 20:44 UTC by Michal Jaegermann
Modified: 2016-11-24 15:23 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2001-02-07 00:26:03 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

Description Michal Jaegermann 2001-02-06 20:44:14 UTC
There is a nasty bug in glibc regex code.  It requires the following
patch (whitespace likely mangled by "user-friendly" interfaces):

--- glibc-2.2.1/posix/regex.c~  Mon Oct 30 01:02:56 2000
+++ glibc-2.2.1/posix/regex.c   Tue Feb  6 12:18:40 2001
@@ -5209,7 +5209,7 @@
                    int not = (re_opcode_t) p1[3] == charset_not;
-                   if (c < (unsigned char) (p1[4] * BYTEWIDTH)
+                   if (c < (unsigned) (p1[4] * BYTEWIDTH)
                        && p1[5 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH)))
                      not = !not;
In the code in question 'p1[4]' holds a length of a bitmap for a
character class in a compiled regular expression (and the next clause
checks if a bit which corresponds to 'c' is really on IFF there is
something to check).  If a character class under consideration
includes something from a range 248-255 then 'p1[4]' equals 32, the
whole expression '(unsigned char) (p1[4] * BYTEWIDTH)' evaluates to 0
and the test always fails.

This bug is filed against glibc-2.2.1 but actually it goes back to
times immemorial. :-)  Because the same regexp code is used by currently
released versions of gawk (and who knows what else yet) this bug
affects it as well and can be easily demonstrated by the following
simple gawk script (a slight modification of a test case supplied
by Jorge Stolfi <stolfi@ic.unicamp.br>):

#! /usr/bin/gawk -f
    for (c = 247; c < 256; c += 8)
        { do_test(c); }

function do_test(char,   pat,s,t)
    pat = sprintf("[an%c]*n", char);
    s = "bananas and ananases in canaan";
    t = s;
    gsub(pat, "AN", t);
    printf "%c %-8s  %s\n", char, pat, t;

   Michal Jaegermann

p.s. The same report also mailed to bugs@gnu.org

Comment 1 Glen Foster 2001-02-07 00:09:25 UTC
This defect is considered MUST-FIX for Florence Release-Candidate #1

Comment 2 Jakub Jelinek 2001-02-07 00:25:57 UTC
Submitted to libc-hacker together with a C testcase, though since
2001-02-02 this should not trigger in glibc, since regex is now
supporting multibyte character sets and has this broken code
ifdefed out (but people taking regex from glibc will probably
not have multibyte support enabled so it will matter for them).

Comment 3 Jakub Jelinek 2001-02-12 09:43:28 UTC
This should be fixed in glibc-2.2.1-7 (together with memory leaks in
the new multibyte charset regex code).

Comment 4 Michal Jaegermann 2001-02-12 16:55:59 UTC
Just a note that although the bug was filed about regex code in glibc its
presence is more pervasive. 'gawk', which was used to demonstrate the
bug, is using a _copy_ of this code and not a shared library.  The same
is likely about 'sed', and possibly 'grep', and 'tar', .... and who knows
what else.  I am afraid that I did not check all possibilities.

Note You need to log in before you can comment on or make changes to this bug.