Bug 67734
Summary: | glibc regexp submatching behaviour seems to have changed | ||
---|---|---|---|
Product: | [Retired] Red Hat Raw Hide | Reporter: | Jens Petersen <petersen> |
Component: | glibc | Assignee: | Jakub Jelinek <jakub> |
Status: | CLOSED RAWHIDE | QA Contact: | Brian Brock <bbrock> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 1.0 | CC: | fweimer |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2002-07-12 08:54:06 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jens Petersen
2002-07-01 09:16:55 UTC
Guess you need to provide more details on how exactly you match against that regexp and what you see, because I've tried to reproduce this with: #include <sys/types.h> #include <regex.h> #include <locale.h> #include <stdio.h> int main (int argc, char *argv[]) { regex_t re; regmatch_t mat[20]; int i; if (setlocale (LC_ALL, "en_US.UTF-8") == NULL) puts ("cannot set locale"); else if (regcomp (&re, "^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\\?([^#]*))?(#(.*))?", REG_EXTENDED) != REG_NOERROR) puts ("cannot compile expression \"[a-f]*\""); else if (regexec (&re, "http://www.redhat.com/rhn/", 20, mat, 0) == REG_NOMATCH) puts ("no match"); else { for (i = 0; i < 20; ++i) if (mat[i].rm_so != -1) printf ("%d: %.*s\n", i, mat[i].rm_eo - mat[i].rm_so, "http://www.redhat.com/rhn/" + mat[i].rm_so); } return 0; } and get the expected result both on glibc-2.2.5-34 and glibc-2.2.90-14. Thanks for your fast response. Seems the problem is locale related. Why is it necessary to set the locale to get this to work? Also it seem only some locales work. Eg setting the locale to ja_JP, de_DE.utf-8, or en_US.utf-8 work, but not C, POSIX, en_US, de_DE, etc. It is a bug in regex code. I've spent half a day on it but have not managed to fix it yet, hope the new regex code author will have a look at it too. The difference between MBS and non-MBS locales is just in internal representation of the regular expression (in the second case it is just a bitset for the first range, in the first case it is alternative of bitset and all MBS chars above 128). Confirmed fixed in glibc-2.2.90-15. Thanks :-) |