Bug 67734

Summary: glibc regexp submatching behaviour seems to have changed
Product: [Retired] Red Hat Raw Hide Reporter: Jens Petersen <petersen>
Component: glibcAssignee: Jakub Jelinek <jakub>
Status: CLOSED RAWHIDE QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.0CC: fweimer
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2002-07-12 08:54:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jens Petersen 2002-07-01 09:16:55 UTC
Description of Problem:
The following well-known regexp from rfc2396:

"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\\?([^#]*))?(#(.*))?"

used to work fine with glibc-2.2 in splitting up an URL into
components (scheme, authority, path, query, fragment), no longer
seems to work with glibc-2.3 regex.

Version-Release number of selected component (if applicable):
glibc-2.2.90-14

How Reproducible:
every time

Steps to Reproduce:
1. match the above regexp against "http://www.redhat.com/rhn/"

Actual Results:
With glibc-2.3, the url matches as a path component.
With glibc-2.2, the url is split correctly in scheme, authority,
and path.

Expected Results:
The same results with both versions.

Additional Information:
I'm not sure whether this change is intentional.
If so, perhaps it's already documented somewhere?

Comment 1 Jakub Jelinek 2002-07-01 10:48:06 UTC
Guess you need to provide more details on how exactly you match against that
regexp and what you see, because I've tried to reproduce this with:
#include <sys/types.h>
#include <regex.h>
#include <locale.h>
#include <stdio.h>

int
main (int argc, char *argv[])
{
  regex_t re;
  regmatch_t mat[20];
  int i;

  if (setlocale (LC_ALL, "en_US.UTF-8") == NULL)
    puts ("cannot set locale");
  else if (regcomp (&re, "^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\\?([^#]*))?(#(.*))?", REG_EXTENDED) != REG_NOERROR)
    puts ("cannot compile expression \"[a-f]*\"");
  else if (regexec (&re, "http://www.redhat.com/rhn/", 20, mat, 0) == REG_NOMATCH)
    puts ("no match");
  else
    {
      for (i = 0; i < 20; ++i)
        if (mat[i].rm_so != -1)
          printf ("%d: %.*s\n", i, mat[i].rm_eo - mat[i].rm_so, "http://www.redhat.com/rhn/" + mat[i].rm_so);
    }
  return 0;
}

and get the expected result both on glibc-2.2.5-34 and glibc-2.2.90-14.

Comment 2 Jens Petersen 2002-07-03 11:08:39 UTC
Thanks for your fast response.  Seems the problem is locale
related.  Why is it necessary to set the locale to get this to work?
Also it seem only some locales work.  Eg setting the locale to ja_JP,
de_DE.utf-8, or en_US.utf-8 work, but not C, POSIX, en_US, de_DE, etc.


Comment 3 Jakub Jelinek 2002-07-03 15:13:07 UTC
It is a bug in regex code. I've spent half a day on it but have not managed
to fix it yet, hope the new regex code author will have a look at it too.
The difference between MBS and non-MBS locales is just in internal representation
of the regular expression (in the second case it is just a bitset for the first
range, in the first case it is alternative of bitset and all MBS chars above 128).

Comment 4 Jens Petersen 2002-07-12 08:54:01 UTC
Confirmed fixed in glibc-2.2.90-15.  Thanks :-)