There is a nasty bug in glibc regex code. It requires the following patch (whitespace likely mangled by "user-friendly" interfaces): --- glibc-2.2.1/posix/regex.c~ Mon Oct 30 01:02:56 2000 +++ glibc-2.2.1/posix/regex.c Tue Feb 6 12:18:40 2001 @@ -5209,7 +5209,7 @@ { int not = (re_opcode_t) p1[3] == charset_not; - if (c < (unsigned char) (p1[4] * BYTEWIDTH) + if (c < (unsigned) (p1[4] * BYTEWIDTH) && p1[5 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) not = !not; In the code in question 'p1[4]' holds a length of a bitmap for a character class in a compiled regular expression (and the next clause checks if a bit which corresponds to 'c' is really on IFF there is something to check). If a character class under consideration includes something from a range 248-255 then 'p1[4]' equals 32, the whole expression '(unsigned char) (p1[4] * BYTEWIDTH)' evaluates to 0 and the test always fails. This bug is filed against glibc-2.2.1 but actually it goes back to times immemorial. :-) Because the same regexp code is used by currently released versions of gawk (and who knows what else yet) this bug affects it as well and can be easily demonstrated by the following simple gawk script (a slight modification of a test case supplied by Jorge Stolfi <stolfi.br>): #! /usr/bin/gawk -f BEGIN { for (c = 247; c < 256; c += 8) { do_test(c); } } function do_test(char, pat,s,t) { pat = sprintf("[an%c]*n", char); s = "bananas and ananases in canaan"; t = s; gsub(pat, "AN", t); printf "%c %-8s %s\n", char, pat, t; } Michal Jaegermann michal p.s. The same report also mailed to bugs
This defect is considered MUST-FIX for Florence Release-Candidate #1
Submitted to libc-hacker together with a C testcase, though since 2001-02-02 this should not trigger in glibc, since regex is now supporting multibyte character sets and has this broken code ifdefed out (but people taking regex from glibc will probably not have multibyte support enabled so it will matter for them).
This should be fixed in glibc-2.2.1-7 (together with memory leaks in the new multibyte charset regex code).
Just a note that although the bug was filed about regex code in glibc its presence is more pervasive. 'gawk', which was used to demonstrate the bug, is using a _copy_ of this code and not a shared library. The same is likely about 'sed', and possibly 'grep', and 'tar', .... and who knows what else. I am afraid that I did not check all possibilities.