Bug 635748

Summary:	regex: [A-z] detected as empty range with en_US.UTF-8
Product:	Red Hat Enterprise Linux 6	Reporter:	Jaroslav Škarvada <jskarvad>
Component:	grep	Assignee:	Jaroslav Škarvada <jskarvad>
Status:	CLOSED WORKSFORME	QA Contact:	BaseOS QE - Apps <qe-baseos-apps>
Severity:	medium	Docs Contact:
Priority:	low
Version:	6.0	CC:	hongjiu.lu, jakub, jskarvad, lkundrak, pbonzini, schwab, syeghiay
Target Milestone:	rc	Keywords:	Reopened
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	583011	Environment:
Last Closed:	2010-11-24 11:25:50 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	583011
Bug Blocks:

Description Jaroslav Škarvada 2010-09-20 15:55:56 UTC

+++ This bug was initially created as a clone of Bug #583011 +++

"make pdf" in binutils is broken:

[hjl@gnu-6 bfd]$ make pdf
Making pdf in doc
make[1]: Entering directory `/export/build/gnu/binutils/build-x86_64-linux/bfd/doc'
TEXINPUTS="/export/gnu/import/git/binutils/bfd/../texinfo:$TEXINPUTS" \
	MAKEINFO='makeinfo --split-size=5000000   -I /export/gnu/import/git/binutils/bfd/doc' \
	`if test -f /export/gnu/import/git/binutils/bfd/../texinfo/util/texi2dvi; then echo /export/gnu/import/git/binutils/bfd/../texinfo/util/texi2dvi; else echo texi2dvi; fi` --pdf --batch -o bfd.pdf `test -f 'bfd.texinfo' || echo '/export/gnu/import/git/binutils/bfd/doc/'`bfd.texinfo
egrep: Invalid range end
/usr/bin/texi2dvi: cannot read .//export/gnu/import/git/binutils/bfd/doc/bfd.texinfo, skipping.
make[1]: *** [bfd.pdf] Error 1
make[1]: Leaving directory `/export/build/gnu/binutils/build-x86_64-linux/bfd/doc'
make: *** [pdf-recursive] Error 1
[hjl@gnu-6 bfd]$ 

/usr/bin/texi2dvi has

 # If the COMMAND_LINE_FILENAME is not absolute (e.g., --debug.tex),
  # prepend `./' in order to avoid that the tools take it as an option.
  echo "$command_line_filename" | $EGREP '^(/|[A-z]:/)' >&6 \
  || command_line_filename="./$command_line_filename"

and it no longer works:

[hjl@gnu-6 bfd]$ echo foo | egrep '^(/|[A-z]:/)'
egrep: Invalid range end
[hjl@gnu-6 bfd]$

--- Additional comment from hongjiu.lu on 2010-04-16 14:52:35 CEST ---

It doesn't like "[A-z]"

[hjl@gnu-6 bfd]$ echo foo | egrep '[A-z]'
egrep: Invalid range end
[hjl@gnu-6 bfd]$ echo foo | egrep '[A-Z]'
foo
[hjl@gnu-6 bfd]$

--- Additional comment from hongjiu.lu on 2010-04-16 14:58:47 CEST ---

I am not sure what "[A-z]" means. Shouldn't it be "[A-Z]"?

--- Additional comment from jskarvad on 2010-04-16 15:02:22 CEST ---

Hmm, it works perfectly for me:

$ make pdf
...
$ echo $?
0

$ echo foo | egrep '^(/|[A-z]:/)'
$ echo foo | egrep '[A-z]'
foo
$ echo foo | egrep '[A-Z]'
foo
$ egrep --version
GNU grep 2.6.3
...

What is your locale?

--- Additional comment from jskarvad on 2010-04-16 15:03:53 CEST ---

Ad comment 2:
[A-z] means A-Z, some special chars and a-z inclusive.

--- Additional comment from hongjiu.lu on 2010-04-16 15:09:36 CEST ---

(In reply to comment #3)
> 
> What is your locale?    

I have

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

--- Additional comment from jskarvad on 2010-04-16 15:13:18 CEST ---

Confirmed, I will investigate it. It works with C and cs_CZ.UTF-8 locales but not with en_US[.UTF-8] locale.

--- Additional comment from hongjiu.lu on 2010-04-16 15:16:16 CEST ---

(In reply to comment #4)
> Ad comment 2:
> [A-z] means A-Z, some special chars and a-z inclusive.    

From:

http://www.gnu.org/software/grep/manual/grep.html#Character-Classes-and-Bracket-Expressions

A-Z and a-z aren't in the same class. I am not sure if [A-z]
has any specific meaning. If I am reading

---
 # If the COMMAND_LINE_FILENAME is not absolute (e.g., --debug.tex),
  # prepend `./' in order to avoid that the tools take it as an option.
  echo "$command_line_filename" | $EGREP '^(/|[A-z]:/)' >&6 \
  || command_line_filename="./$command_line_filename"
---

correctly. '^(/|[A-z]:/)' is used to check absolute path
and [A-z] is for DOS/Windows drive letters. I don't think
some special chars are allowed. They may be limited to A-Z.

--- Additional comment from jskarvad on 2010-04-16 15:37:21 CEST ---

AFAIK, the first 127 codes should be same for all locales. Thus you can consult man ascii, what A-z should mean. If you have only upper-case letters, the A-Z should be enough. I am not familiar with your case, if there can't be underscores "_" and lower case letters in the path then it should be OK.

I will consult the [A-z] case with grep upstream. I think it should return same for all locales.

--- Additional comment from jskarvad on 2010-04-20 17:02:24 CEST ---

It happened if compiled --without-included-regex, maybe glibc issue? I will investigate it more deeply.

--- Additional comment from jskarvad on 2010-04-23 14:42:46 CEST ---

I wasn't correct in previous comments - the ordering may change according to LC_COLLATE settings. But according to my tests I think there shouldn't be range error with glibc regex. I will investigate.

--- Additional comment from jskarvad on 2010-09-20 12:57:16 CEST ---

I suggest you to rewrite the code or use LC_COLLATE=C, e.g.:

echo foo | LC_COLLATE=C egrep '^(/|[A-z]:/)'

Otherwise it will not match what you want even in case the 'invalid range' problem will be fixed (e.g. the 'a' and 'Z' wouldn't match).

--- Additional comment from jskarvad on 2010-09-20 12:58:42 CEST ---

Other packages are also affected (e.g. sed), it looks like glibc bug. The failing code:

#define _GNU_SOURCE 1
#include <stdio.h>
#include <string.h>
#include <regex.h>
#include <locale.h>

int main(int argc, char *argv[])
{
  char re[] = "[A-z]";
  struct re_pattern_buffer buf = {0};
  const char *err;

  setlocale(LC_ALL, "en_US.UTF-8");
  re_set_syntax(RE_NO_EMPTY_RANGES);
  if ((err = re_compile_pattern(re, strlen(re), &buf)))
    printf("%s\n", err);

  return 0;
}

While:

$ echo -e "a\nb\nc\nz\nA\nB\nC\nZ" | LC_ALL=en_US.UTF-8 sort 
a
A
b
B
c
C
z
Z

Thus reassigning to glibc.

--- Additional comment from schwab on 2010-09-20 13:40:48 CEST ---

If you want ASCII collation use LC_COLLATE=C.

--- Additional comment from jskarvad on 2010-09-20 16:53:37 CEST ---

> If you want ASCII collation use LC_COLLATE=C.
This is not about ASCII collation. This is about inconsistency between collation and regex ranges.

Having the following two programs p1, p2:

p1:
---
#include <stdio.h>
#include <string.h>
#include <locale.h>

int main(int argc, char *argv[])
{
  setlocale(LC_ALL, "");
  printf("%d\n", strcoll(argv[1], argv[2]));
  return 0;
}


p2:
---
#define _GNU_SOURCE 1
#include <stdio.h>
#include <string.h>
#include <regex.h>
#include <locale.h>

int main(int argc, char *argv[])
{
  struct re_pattern_buffer buf = {0};
  const char *err;

  setlocale(LC_ALL, "");
  re_set_syntax(RE_NO_EMPTY_RANGES);
  if ((err = re_compile_pattern(argv[1], strlen(argv[1]), &buf)))
    printf("%s\n", err);

  return 0;
}


And following test case:

$ LC_ALL=en_US.UTF-8 ./p1 A b
-1

$ LC_ALL=en_US.UTF-8 ./p2 [A-b]
Invalid range end

$ LC_ALL=cs_CZ.UTF-8 ./p1 A b
-1

$ LC_ALL=cs_CZ.UTF-8 ./p2 [A-b]

This behaviour seems inconsistent. Please note that the regex code built into grep works as expected but the glibc regex doesn't. 

From the 'man 7 regex':
...If two characters in  the  list  are separated  by  '-',  this  is shorthand for the full range of characters between those two (inclusive) in the collating sequence...

--- Additional comment from schwab on 2010-09-20 17:17:51 CEST ---

Range expressions are only defined for the POSIX locale.

--- Additional comment from pbonzini on 2010-09-20 17:46:19 CEST ---

Undefined behavior is not a random reason for closing bugs.  By your argument you could just rip off the range expression code and it would still be POSIX-compliant.

At least explain why [A-z] is an empty range, since it obviously is not.

Comment 2 Andreas Schwab 2010-09-20 16:17:56 UTC

Not a bug.

Comment 4 Andreas Schwab 2010-09-21 12:45:07 UTC

Nothing to fix here.

Comment 5 Paolo Bonzini 2010-09-21 13:04:30 UTC

See bug 583011 comment #19 and bug 583011 comment #21.

Comment 8 Paolo Bonzini 2010-09-21 14:48:18 UTC

Upon further analysis, there is a bug in grep too:

$ sed '/[A-Z]/p'
z
z
$ grep '/[A-Z]/p'
z
$

The problem here is that grep's DFA matcher is trying to use strcoll for single-byte matches, instead of glibc's own rules (whatever they are).  At the same time, grep relies on glibc to ascertain the validity of regular expressions, thus giving the inconsistent behavior.

I'm reassigning this to grep.  Any changes in glibc regex, such as the ones suggested in bug 583011 comment #24, are anyway too wide in scope for RHEL6.

Comment 10 Ondrej Hudlicky 2010-09-21 16:50:24 UTC

Re comment 8:  sed '/[A-Z]/p' .. Print the current pattern space (not match!)

Observing same regexp matching in sed and grep: 
$ echo z | sed 's/[A-Z]/1/'
a
$ echo Z | sed 's/[A-Z]/1/'
1

Comment 11 Paolo Bonzini 2010-09-21 16:58:12 UTC

Sorry, what I meant is:

$ echo z | sed -n '/[A-Z]/p'
$ echo z | grep '[A-Z]'                 # (2)
z

It is a bug even considering what is documented in the Migration Guide; see this:

$ echo 00z | egrep '(.)\1[A-Z]'
$

which is inconsistent with (2) above.

Comment 13 Paolo Bonzini 2010-09-23 12:03:57 UTC

Fixed by upstream commit 99d3c7e1308beb1ce9a3c535ca4b6581ebd653ee.

Comment 14 Jaroslav Škarvada 2010-10-06 10:30:38 UTC

Re comment 11: 

I am unable to reproduce with grep-2.6.3-2.el6 (currently in RHEL-6):

$ echo z | sed -n '/[A-Z]/p'
$ echo z | grep '[A-Z]'
$ echo 00z | egrep '(.)\1[A-Z]'
$

Comment 15 Paolo Bonzini 2010-10-06 12:09:44 UTC

Are you trying both en_US.UTF-8 and cs_CZ.UTF-8?

Comment 16 Jaroslav Škarvada 2010-10-06 12:25:55 UTC

Yes, both variants:

$ echo z | LC_ALL=en_US.UTF-8 sed -n '/[A-Z]/p'
$ echo z | LC_ALL=en_US.UTF-8 grep '[A-Z]'
$ echo 00z | LC_ALL=en_US.UTF-8 egrep '(.)\1[A-Z]'
$

$ echo z | LC_ALL=cs_CZ.UTF-8 sed -n '/[A-Z]/p'
$ echo z | LC_ALL=cs_CZ.UTF-8 grep '[A-Z]'
$ echo 00z | LC_ALL=cs_CZ.UTF-8 egrep '(.)\1[A-Z]'
$

Comment 17 Jaroslav Škarvada 2010-11-24 11:25:50 UTC

According to comments 8, 14 - 16 it seems there is nothing to fix, thus I am closing this. If other reproducer exists, feel free to reopen.