Bug 67734 - glibc regexp submatching behaviour seems to have changed
glibc regexp submatching behaviour seems to have changed
Status: CLOSED RAWHIDE
Product: Red Hat Raw Hide
Classification: Retired
Component: glibc (Show other bugs)
1.0
i386 Linux
medium Severity medium
: ---
: ---
Assigned To: Jakub Jelinek
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2002-07-01 05:16 EDT by Jens Petersen
Modified: 2016-11-24 10:01 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2002-07-12 04:54:06 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jens Petersen 2002-07-01 05:16:55 EDT
Description of Problem:
The following well-known regexp from rfc2396:

"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\\?([^#]*))?(#(.*))?"

used to work fine with glibc-2.2 in splitting up an URL into
components (scheme, authority, path, query, fragment), no longer
seems to work with glibc-2.3 regex.

Version-Release number of selected component (if applicable):
glibc-2.2.90-14

How Reproducible:
every time

Steps to Reproduce:
1. match the above regexp against "http://www.redhat.com/rhn/"

Actual Results:
With glibc-2.3, the url matches as a path component.
With glibc-2.2, the url is split correctly in scheme, authority,
and path.

Expected Results:
The same results with both versions.

Additional Information:
I'm not sure whether this change is intentional.
If so, perhaps it's already documented somewhere?
Comment 1 Jakub Jelinek 2002-07-01 06:48:06 EDT
Guess you need to provide more details on how exactly you match against that
regexp and what you see, because I've tried to reproduce this with:
#include <sys/types.h>
#include <regex.h>
#include <locale.h>
#include <stdio.h>

int
main (int argc, char *argv[])
{
  regex_t re;
  regmatch_t mat[20];
  int i;

  if (setlocale (LC_ALL, "en_US.UTF-8") == NULL)
    puts ("cannot set locale");
  else if (regcomp (&re, "^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\\?([^#]*))?(#(.*))?", REG_EXTENDED) != REG_NOERROR)
    puts ("cannot compile expression \"[a-f]*\"");
  else if (regexec (&re, "http://www.redhat.com/rhn/", 20, mat, 0) == REG_NOMATCH)
    puts ("no match");
  else
    {
      for (i = 0; i < 20; ++i)
        if (mat[i].rm_so != -1)
          printf ("%d: %.*s\n", i, mat[i].rm_eo - mat[i].rm_so, "http://www.redhat.com/rhn/" + mat[i].rm_so);
    }
  return 0;
}

and get the expected result both on glibc-2.2.5-34 and glibc-2.2.90-14.
Comment 2 Jens Petersen 2002-07-03 07:08:39 EDT
Thanks for your fast response.  Seems the problem is locale
related.  Why is it necessary to set the locale to get this to work?
Also it seem only some locales work.  Eg setting the locale to ja_JP,
de_DE.utf-8, or en_US.utf-8 work, but not C, POSIX, en_US, de_DE, etc.
Comment 3 Jakub Jelinek 2002-07-03 11:13:07 EDT
It is a bug in regex code. I've spent half a day on it but have not managed
to fix it yet, hope the new regex code author will have a look at it too.
The difference between MBS and non-MBS locales is just in internal representation
of the regular expression (in the second case it is just a bitset for the first
range, in the first case it is alternative of bitset and all MBS chars above 128).
Comment 4 Jens Petersen 2002-07-12 04:54:01 EDT
Confirmed fixed in glibc-2.2.90-15.  Thanks :-)

Note You need to log in before you can comment on or make changes to this bug.