635748 – regex: [A-z] detected as empty range with en_US.UTF-8

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 635748 - regex: [A-z] detected as empty range with en_US.UTF-8

Summary: regex: [A-z] detected as empty range with en_US.UTF-8

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	grep
Sub Component:
Version:	6.0
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Jaroslav Škarvada
QA Contact:	BaseOS QE - Apps
Docs Contact:
URL:
Whiteboard:
Depends On:	583011
Blocks:
TreeView+	depends on / blocked

Reported:	2010-09-20 15:55 UTC by Jaroslav Škarvada
Modified:	2010-11-24 11:25 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	583011
Environment:
Last Closed:	2010-11-24 11:25:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jaroslav Škarvada 2010-09-20 15:55:56 UTC

+++ This bug was initially created as a clone of Bug #583011 +++

"make pdf" in binutils is broken:

[hjl@gnu-6 bfd]$ make pdf
Making pdf in doc
make[1]: Entering directory `/export/build/gnu/binutils/build-x86_64-linux/bfd/doc'
TEXINPUTS="/export/gnu/import/git/binutils/bfd/../texinfo:$TEXINPUTS" \
	MAKEINFO='makeinfo --split-size=5000000   -I /export/gnu/import/git/binutils/bfd/doc' \
	`if test -f /export/gnu/import/git/binutils/bfd/../texinfo/util/texi2dvi; then echo /export/gnu/import/git/binutils/bfd/../texinfo/util/texi2dvi; else echo texi2dvi; fi` --pdf --batch -o bfd.pdf `test -f 'bfd.texinfo' || echo '/export/gnu/import/git/binutils/bfd/doc/'`bfd.texinfo
egrep: Invalid range end
/usr/bin/texi2dvi: cannot read .//export/gnu/import/git/binutils/bfd/doc/bfd.texinfo, skipping.
make[1]: *** [bfd.pdf] Error 1
make[1]: Leaving directory `/export/build/gnu/binutils/build-x86_64-linux/bfd/doc'
make: *** [pdf-recursive] Error 1
[hjl@gnu-6 bfd]$ 

/usr/bin/texi2dvi has

 # If the COMMAND_LINE_FILENAME is not absolute (e.g., --debug.tex),
  # prepend `./' in order to avoid that the tools take it as an option.
  echo "$command_line_filename" | $EGREP '^(/|[A-z]:/)' >&6 \
  || command_line_filename="./$command_line_filename"

and it no longer works:

[hjl@gnu-6 bfd]$ echo foo | egrep '^(/|[A-z]:/)'
egrep: Invalid range end
[hjl@gnu-6 bfd]$

--- Additional comment from hongjiu.lu on 2010-04-16 14:52:35 CEST ---

It doesn't like "[A-z]"

[hjl@gnu-6 bfd]$ echo foo | egrep '[A-z]'
egrep: Invalid range end
[hjl@gnu-6 bfd]$ echo foo | egrep '[A-Z]'
foo
[hjl@gnu-6 bfd]$

--- Additional comment from hongjiu.lu on 2010-04-16 14:58:47 CEST ---

I am not sure what "[A-z]" means. Shouldn't it be "[A-Z]"?

--- Additional comment from jskarvad on 2010-04-16 15:02:22 CEST ---

Hmm, it works perfectly for me:

$ make pdf
...
$ echo $?
0

$ echo foo | egrep '^(/|[A-z]:/)'
$ echo foo | egrep '[A-z]'
foo
$ echo foo | egrep '[A-Z]'
foo
$ egrep --version
GNU grep 2.6.3
...

What is your locale?

--- Additional comment from jskarvad on 2010-04-16 15:03:53 CEST ---

Ad comment 2:
[A-z] means A-Z, some special chars and a-z inclusive.

--- Additional comment from hongjiu.lu on 2010-04-16 15:09:36 CEST ---

(In reply to comment #3)
> 
> What is your locale?    

I have

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

--- Additional comment from jskarvad on 2010-04-16 15:13:18 CEST ---

Confirmed, I will investigate it. It works with C and cs_CZ.UTF-8 locales but not with en_US[.UTF-8] locale.

--- Additional comment from hongjiu.lu on 2010-04-16 15:16:16 CEST ---

(In reply to comment #4)
> Ad comment 2:
> [A-z] means A-Z, some special chars and a-z inclusive.    

From:

http://www.gnu.org/software/grep/manual/grep.html#Character-Classes-and-Bracket-Expressions

A-Z and a-z aren't in the same class. I am not sure if [A-z]
has any specific meaning. If I am reading

---
 # If the COMMAND_LINE_FILENAME is not absolute (e.g., --debug.tex),
  # prepend `./' in order to avoid that the tools take it as an option.
  echo "$command_line_filename" | $EGREP '^(/|[A-z]:/)' >&6 \
  || command_line_filename="./$command_line_filename"
---

correctly. '^(/|[A-z]:/)' is used to check absolute path
and [A-z] is for DOS/Windows drive letters. I don't think
some special chars are allowed. They may be limited to A-Z.

--- Additional comment from jskarvad on 2010-04-16 15:37:21 CEST ---

AFAIK, the first 127 codes should be same for all locales. Thus you can consult man ascii, what A-z should mean. If you have only upper-case letters, the A-Z should be enough. I am not familiar with your case, if there can't be underscores "_" and lower case letters in the path then it should be OK.

I will consult the [A-z] case with grep upstream. I think it should return same for all locales.

--- Additional comment from jskarvad on 2010-04-20 17:02:24 CEST ---

It happened if compiled --without-included-regex, maybe glibc issue? I will investigate it more deeply.

--- Additional comment from jskarvad on 2010-04-23 14:42:46 CEST ---

I wasn't correct in previous comments - the ordering may change according to LC_COLLATE settings. But according to my tests I think there shouldn't be range error with glibc regex. I will investigate.

--- Additional comment from jskarvad on 2010-09-20 12:57:16 CEST ---

I suggest you to rewrite the code or use LC_COLLATE=C, e.g.:

echo foo | LC_COLLATE=C egrep '^(/|[A-z]:/)'

Otherwise it will not match what you want even in case the 'invalid range' problem will be fixed (e.g. the 'a' and 'Z' wouldn't match).

--- Additional comment from jskarvad on 2010-09-20 12:58:42 CEST ---

Other packages are also affected (e.g. sed), it looks like glibc bug. The failing code:

#define _GNU_SOURCE 1
#include <stdio.h>
#include <string.h>
#include <regex.h>
#include <locale.h>

int main(int argc, char *argv[])
{
  char re[] = "[A-z]";
  struct re_pattern_buffer buf = {0};
  const char *err;

  setlocale(LC_ALL, "en_US.UTF-8");
  re_set_syntax(RE_NO_EMPTY_RANGES);
  if ((err = re_compile_pattern(re, strlen(re), &buf)))
    printf("%s\n", err);

  return 0;
}

While:

$ echo -e "a\nb\nc\nz\nA\nB\nC\nZ" | LC_ALL=en_US.UTF-8 sort 
a
A
b
B
c
C
z
Z

Thus reassigning to glibc.

--- Additional comment from schwab on 2010-09-20 13:40:48 CEST ---

If you want ASCII collation use LC_COLLATE=C.

--- Additional comment from jskarvad on 2010-09-20 16:53:37 CEST ---

> If you want ASCII collation use LC_COLLATE=C.
This is not about ASCII collation. This is about inconsistency between collation and regex ranges.

Having the following two programs p1, p2:

p1:
---
#include <stdio.h>
#include <string.h>
#include <locale.h>

int main(int argc, char *argv[])
{
  setlocale(LC_ALL, "");
  printf("%d\n", strcoll(argv[1], argv[2]));
  return 0;
}


p2:
---
#define _GNU_SOURCE 1
#include <stdio.h>
#include <string.h>
#include <regex.h>
#include <locale.h>

int main(int argc, char *argv[])
{
  struct re_pattern_buffer buf = {0};
  const char *err;

  setlocale(LC_ALL, "");
  re_set_syntax(RE_NO_EMPTY_RANGES);
  if ((err = re_compile_pattern(argv[1], strlen(argv[1]), &buf)))
    printf("%s\n", err);

  return 0;
}


And following test case:

$ LC_ALL=en_US.UTF-8 ./p1 A b
-1

$ LC_ALL=en_US.UTF-8 ./p2 [A-b]
Invalid range end

$ LC_ALL=cs_CZ.UTF-8 ./p1 A b
-1

$ LC_ALL=cs_CZ.UTF-8 ./p2 [A-b]

This behaviour seems inconsistent. Please note that the regex code built into grep works as expected but the glibc regex doesn't. 

From the 'man 7 regex':
...If two characters in  the  list  are separated  by  '-',  this  is shorthand for the full range of characters between those two (inclusive) in the collating sequence...

--- Additional comment from schwab on 2010-09-20 17:17:51 CEST ---

Range expressions are only defined for the POSIX locale.

--- Additional comment from pbonzini on 2010-09-20 17:46:19 CEST ---

Undefined behavior is not a random reason for closing bugs.  By your argument you could just rip off the range expression code and it would still be POSIX-compliant.

At least explain why [A-z] is an empty range, since it obviously is not.

Comment 2 Andreas Schwab 2010-09-20 16:17:56 UTC

Not a bug.

Comment 4 Andreas Schwab 2010-09-21 12:45:07 UTC

Nothing to fix here.

Comment 5 Paolo Bonzini 2010-09-21 13:04:30 UTC

See bug 583011 comment #19 and bug 583011 comment #21.

Comment 8 Paolo Bonzini 2010-09-21 14:48:18 UTC

Upon further analysis, there is a bug in grep too:

$ sed '/[A-Z]/p'
z
z
$ grep '/[A-Z]/p'
z
$

The problem here is that grep's DFA matcher is trying to use strcoll for single-byte matches, instead of glibc's own rules (whatever they are).  At the same time, grep relies on glibc to ascertain the validity of regular expressions, thus giving the inconsistent behavior.

I'm reassigning this to grep.  Any changes in glibc regex, such as the ones suggested in bug 583011 comment #24, are anyway too wide in scope for RHEL6.

Comment 10 Ondrej Hudlicky 2010-09-21 16:50:24 UTC

Re comment 8:  sed '/[A-Z]/p' .. Print the current pattern space (not match!)

Observing same regexp matching in sed and grep: 
$ echo z | sed 's/[A-Z]/1/'
a
$ echo Z | sed 's/[A-Z]/1/'
1

Comment 11 Paolo Bonzini 2010-09-21 16:58:12 UTC

Sorry, what I meant is:

$ echo z | sed -n '/[A-Z]/p'
$ echo z | grep '[A-Z]'                 # (2)
z

It is a bug even considering what is documented in the Migration Guide; see this:

$ echo 00z | egrep '(.)\1[A-Z]'
$

which is inconsistent with (2) above.

Comment 13 Paolo Bonzini 2010-09-23 12:03:57 UTC

Fixed by upstream commit 99d3c7e1308beb1ce9a3c535ca4b6581ebd653ee.

Comment 14 Jaroslav Škarvada 2010-10-06 10:30:38 UTC

Re comment 11: 

I am unable to reproduce with grep-2.6.3-2.el6 (currently in RHEL-6):

$ echo z | sed -n '/[A-Z]/p'
$ echo z | grep '[A-Z]'
$ echo 00z | egrep '(.)\1[A-Z]'
$

Comment 15 Paolo Bonzini 2010-10-06 12:09:44 UTC

Are you trying both en_US.UTF-8 and cs_CZ.UTF-8?

Comment 16 Jaroslav Škarvada 2010-10-06 12:25:55 UTC

Yes, both variants:

$ echo z | LC_ALL=en_US.UTF-8 sed -n '/[A-Z]/p'
$ echo z | LC_ALL=en_US.UTF-8 grep '[A-Z]'
$ echo 00z | LC_ALL=en_US.UTF-8 egrep '(.)\1[A-Z]'
$

$ echo z | LC_ALL=cs_CZ.UTF-8 sed -n '/[A-Z]/p'
$ echo z | LC_ALL=cs_CZ.UTF-8 grep '[A-Z]'
$ echo 00z | LC_ALL=cs_CZ.UTF-8 egrep '(.)\1[A-Z]'
$

Comment 17 Jaroslav Škarvada 2010-11-24 11:25:50 UTC

According to comments 8, 14 - 16 it seems there is nothing to fix, thus I am closing this. If other reproducer exists, feel free to reopen.

Note You need to log in before you can comment on or make changes to this bug.