Description of Problem: The following well-known regexp from rfc2396: "^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\\?([^#]*))?(#(.*))?" used to work fine with glibc-2.2 in splitting up an URL into components (scheme, authority, path, query, fragment), no longer seems to work with glibc-2.3 regex. Version-Release number of selected component (if applicable): glibc-2.2.90-14 How Reproducible: every time Steps to Reproduce: 1. match the above regexp against "http://www.redhat.com/rhn/" Actual Results: With glibc-2.3, the url matches as a path component. With glibc-2.2, the url is split correctly in scheme, authority, and path. Expected Results: The same results with both versions. Additional Information: I'm not sure whether this change is intentional. If so, perhaps it's already documented somewhere?
Guess you need to provide more details on how exactly you match against that regexp and what you see, because I've tried to reproduce this with: #include <sys/types.h> #include <regex.h> #include <locale.h> #include <stdio.h> int main (int argc, char *argv[]) { regex_t re; regmatch_t mat[20]; int i; if (setlocale (LC_ALL, "en_US.UTF-8") == NULL) puts ("cannot set locale"); else if (regcomp (&re, "^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\\?([^#]*))?(#(.*))?", REG_EXTENDED) != REG_NOERROR) puts ("cannot compile expression \"[a-f]*\""); else if (regexec (&re, "http://www.redhat.com/rhn/", 20, mat, 0) == REG_NOMATCH) puts ("no match"); else { for (i = 0; i < 20; ++i) if (mat[i].rm_so != -1) printf ("%d: %.*s\n", i, mat[i].rm_eo - mat[i].rm_so, "http://www.redhat.com/rhn/" + mat[i].rm_so); } return 0; } and get the expected result both on glibc-2.2.5-34 and glibc-2.2.90-14.
Thanks for your fast response. Seems the problem is locale related. Why is it necessary to set the locale to get this to work? Also it seem only some locales work. Eg setting the locale to ja_JP, de_DE.utf-8, or en_US.utf-8 work, but not C, POSIX, en_US, de_DE, etc.
It is a bug in regex code. I've spent half a day on it but have not managed to fix it yet, hope the new regex code author will have a look at it too. The difference between MBS and non-MBS locales is just in internal representation of the regular expression (in the second case it is just a bitset for the first range, in the first case it is alternative of bitset and all MBS chars above 128).
Confirmed fixed in glibc-2.2.90-15. Thanks :-)