Red Hat Bugzilla – Bug 67734
glibc regexp submatching behaviour seems to have changed
Last modified: 2016-11-24 10:01:13 EST
Description of Problem:
The following well-known regexp from rfc2396:
used to work fine with glibc-2.2 in splitting up an URL into
components (scheme, authority, path, query, fragment), no longer
seems to work with glibc-2.3 regex.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. match the above regexp against "http://www.redhat.com/rhn/"
With glibc-2.3, the url matches as a path component.
With glibc-2.2, the url is split correctly in scheme, authority,
The same results with both versions.
I'm not sure whether this change is intentional.
If so, perhaps it's already documented somewhere?
Guess you need to provide more details on how exactly you match against that
regexp and what you see, because I've tried to reproduce this with:
main (int argc, char *argv)
if (setlocale (LC_ALL, "en_US.UTF-8") == NULL)
puts ("cannot set locale");
else if (regcomp (&re, "^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\\?([^#]*))?(#(.*))?", REG_EXTENDED) != REG_NOERROR)
puts ("cannot compile expression \"[a-f]*\"");
else if (regexec (&re, "http://www.redhat.com/rhn/", 20, mat, 0) == REG_NOMATCH)
puts ("no match");
for (i = 0; i < 20; ++i)
if (mat[i].rm_so != -1)
printf ("%d: %.*s\n", i, mat[i].rm_eo - mat[i].rm_so, "http://www.redhat.com/rhn/" + mat[i].rm_so);
and get the expected result both on glibc-2.2.5-34 and glibc-2.2.90-14.
Thanks for your fast response. Seems the problem is locale
related. Why is it necessary to set the locale to get this to work?
Also it seem only some locales work. Eg setting the locale to ja_JP,
de_DE.utf-8, or en_US.utf-8 work, but not C, POSIX, en_US, de_DE, etc.
It is a bug in regex code. I've spent half a day on it but have not managed
to fix it yet, hope the new regex code author will have a look at it too.
The difference between MBS and non-MBS locales is just in internal representation
of the regular expression (in the second case it is just a bitset for the first
range, in the first case it is alternative of bitset and all MBS chars above 128).
Confirmed fixed in glibc-2.2.90-15. Thanks :-)