583011 – grep: inconsistency with range expressions

Bug 583011 - grep: inconsistency with range expressions

Summary: grep: inconsistency with range expressions

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	grep
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Jaroslav Škarvada
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	635748
TreeView+	depends on / blocked

Reported:	2010-04-16 12:42 UTC by H.J. Lu
Modified:	2010-10-03 20:59 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Clones:	635748 (view as bug list)
Environment:
Last Closed:	2010-10-03 20:59:02 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description H.J. Lu 2010-04-16 12:42:42 UTC

"make pdf" in binutils is broken:

[hjl@gnu-6 bfd]$ make pdf
Making pdf in doc
make[1]: Entering directory `/export/build/gnu/binutils/build-x86_64-linux/bfd/doc'
TEXINPUTS="/export/gnu/import/git/binutils/bfd/../texinfo:$TEXINPUTS" \
	MAKEINFO='makeinfo --split-size=5000000   -I /export/gnu/import/git/binutils/bfd/doc' \
	`if test -f /export/gnu/import/git/binutils/bfd/../texinfo/util/texi2dvi; then echo /export/gnu/import/git/binutils/bfd/../texinfo/util/texi2dvi; else echo texi2dvi; fi` --pdf --batch -o bfd.pdf `test -f 'bfd.texinfo' || echo '/export/gnu/import/git/binutils/bfd/doc/'`bfd.texinfo
egrep: Invalid range end
/usr/bin/texi2dvi: cannot read .//export/gnu/import/git/binutils/bfd/doc/bfd.texinfo, skipping.
make[1]: *** [bfd.pdf] Error 1
make[1]: Leaving directory `/export/build/gnu/binutils/build-x86_64-linux/bfd/doc'
make: *** [pdf-recursive] Error 1
[hjl@gnu-6 bfd]$ 

/usr/bin/texi2dvi has

 # If the COMMAND_LINE_FILENAME is not absolute (e.g., --debug.tex),
  # prepend `./' in order to avoid that the tools take it as an option.
  echo "$command_line_filename" | $EGREP '^(/|[A-z]:/)' >&6 \
  || command_line_filename="./$command_line_filename"

and it no longer works:

[hjl@gnu-6 bfd]$ echo foo | egrep '^(/|[A-z]:/)'
egrep: Invalid range end
[hjl@gnu-6 bfd]$

Comment 1 H.J. Lu 2010-04-16 12:52:35 UTC

It doesn't like "[A-z]"

[hjl@gnu-6 bfd]$ echo foo | egrep '[A-z]'
egrep: Invalid range end
[hjl@gnu-6 bfd]$ echo foo | egrep '[A-Z]'
foo
[hjl@gnu-6 bfd]$

Comment 2 H.J. Lu 2010-04-16 12:58:47 UTC

I am not sure what "[A-z]" means. Shouldn't it be "[A-Z]"?

Comment 3 Jaroslav Škarvada 2010-04-16 13:02:22 UTC

Hmm, it works perfectly for me:

$ make pdf
...
$ echo $?
0

$ echo foo | egrep '^(/|[A-z]:/)'
$ echo foo | egrep '[A-z]'
foo
$ echo foo | egrep '[A-Z]'
foo
$ egrep --version
GNU grep 2.6.3
...

What is your locale?

Comment 4 Jaroslav Škarvada 2010-04-16 13:03:53 UTC

Ad comment 2:
[A-z] means A-Z, some special chars and a-z inclusive.

Comment 5 H.J. Lu 2010-04-16 13:09:36 UTC

(In reply to comment #3)
> 
> What is your locale?    

I have

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Comment 6 Jaroslav Škarvada 2010-04-16 13:13:18 UTC

Confirmed, I will investigate it. It works with C and cs_CZ.UTF-8 locales but not with en_US[.UTF-8] locale.

Comment 7 H.J. Lu 2010-04-16 13:16:16 UTC

(In reply to comment #4)
> Ad comment 2:
> [A-z] means A-Z, some special chars and a-z inclusive.    

From:

http://www.gnu.org/software/grep/manual/grep.html#Character-Classes-and-Bracket-Expressions

A-Z and a-z aren't in the same class. I am not sure if [A-z]
has any specific meaning. If I am reading

---
 # If the COMMAND_LINE_FILENAME is not absolute (e.g., --debug.tex),
  # prepend `./' in order to avoid that the tools take it as an option.
  echo "$command_line_filename" | $EGREP '^(/|[A-z]:/)' >&6 \
  || command_line_filename="./$command_line_filename"
---

correctly. '^(/|[A-z]:/)' is used to check absolute path
and [A-z] is for DOS/Windows drive letters. I don't think
some special chars are allowed. They may be limited to A-Z.

Comment 8 Jaroslav Škarvada 2010-04-16 13:37:21 UTC

AFAIK, the first 127 codes should be same for all locales. Thus you can consult man ascii, what A-z should mean. If you have only upper-case letters, the A-Z should be enough. I am not familiar with your case, if there can't be underscores "_" and lower case letters in the path then it should be OK.

I will consult the [A-z] case with grep upstream. I think it should return same for all locales.

Comment 9 Jaroslav Škarvada 2010-04-20 15:02:24 UTC

It happened if compiled --without-included-regex, maybe glibc issue? I will investigate it more deeply.

Comment 10 Jaroslav Škarvada 2010-04-23 12:42:46 UTC

I wasn't correct in previous comments - the ordering may change according to LC_COLLATE settings. But according to my tests I think there shouldn't be range error with glibc regex. I will investigate.

Comment 11 Jaroslav Škarvada 2010-09-20 10:57:16 UTC

I suggest you to rewrite the code or use LC_COLLATE=C, e.g.:

echo foo | LC_COLLATE=C egrep '^(/|[A-z]:/)'

Otherwise it will not match what you want even in case the 'invalid range' problem will be fixed (e.g. the 'a' and 'Z' wouldn't match).

Comment 12 Jaroslav Škarvada 2010-09-20 10:58:42 UTC

Other packages are also affected (e.g. sed), it looks like glibc bug. The failing code:

#define _GNU_SOURCE 1
#include <stdio.h>
#include <string.h>
#include <regex.h>
#include <locale.h>

int main(int argc, char *argv[])
{
  char re[] = "[A-z]";
  struct re_pattern_buffer buf = {0};
  const char *err;

  setlocale(LC_ALL, "en_US.UTF-8");
  re_set_syntax(RE_NO_EMPTY_RANGES);
  if ((err = re_compile_pattern(re, strlen(re), &buf)))
    printf("%s\n", err);

  return 0;
}

While:

$ echo -e "a\nb\nc\nz\nA\nB\nC\nZ" | LC_ALL=en_US.UTF-8 sort 
a
A
b
B
c
C
z
Z

Thus reassigning to glibc.

Comment 13 Andreas Schwab 2010-09-20 11:40:48 UTC

If you want ASCII collation use LC_COLLATE=C.

Comment 14 Jaroslav Škarvada 2010-09-20 14:53:37 UTC

> If you want ASCII collation use LC_COLLATE=C.
This is not about ASCII collation. This is about inconsistency between collation and regex ranges.

Having the following two programs p1, p2:

p1:
---
#include <stdio.h>
#include <string.h>
#include <locale.h>

int main(int argc, char *argv[])
{
  setlocale(LC_ALL, "");
  printf("%d\n", strcoll(argv[1], argv[2]));
  return 0;
}


p2:
---
#define _GNU_SOURCE 1
#include <stdio.h>
#include <string.h>
#include <regex.h>
#include <locale.h>

int main(int argc, char *argv[])
{
  struct re_pattern_buffer buf = {0};
  const char *err;

  setlocale(LC_ALL, "");
  re_set_syntax(RE_NO_EMPTY_RANGES);
  if ((err = re_compile_pattern(argv[1], strlen(argv[1]), &buf)))
    printf("%s\n", err);

  return 0;
}


And following test case:

$ LC_ALL=en_US.UTF-8 ./p1 A b
-1

$ LC_ALL=en_US.UTF-8 ./p2 [A-b]
Invalid range end

$ LC_ALL=cs_CZ.UTF-8 ./p1 A b
-1

$ LC_ALL=cs_CZ.UTF-8 ./p2 [A-b]

This behaviour seems inconsistent. Please note that the regex code built into grep works as expected but the glibc regex doesn't. 

From the 'man 7 regex':
...If two characters in  the  list  are separated  by  '-',  this  is shorthand for the full range of characters between those two (inclusive) in the collating sequence...

Comment 15 Andreas Schwab 2010-09-20 15:17:51 UTC

Range expressions are only defined for the POSIX locale.

Comment 16 Paolo Bonzini 2010-09-20 15:46:19 UTC

Undefined behavior is not a random reason for closing bugs.  By your argument you could just rip off the range expression code and it would still be POSIX-compliant.

At least explain why [A-z] is an empty range, since it obviously is not.

Comment 17 Andreas Schwab 2010-09-20 16:10:17 UTC

Undefined behaviour means undefined behaviour.

Comment 18 Paolo Bonzini 2010-09-20 16:46:27 UTC

Please start making sense.

Comment 19 Paolo Bonzini 2010-09-20 17:22:27 UTC

BTW, range expressions are _not_ undefined, in any locale:

"A range expression represents the set of collating elements that fall between two elements in the current collation sequence, inclusively. It is expressed as the starting point and the ending point separated by a hyphen (-)."

All POSIX says is that "range expressions must not be used in portable applications because their behaviour is dependent on the collating sequence".

Please at the very least enlighten people as to the meaning of "collation sequence" in glibc regex, how it is (not) related to strcoll/wcscoll, and why [A-z] fails but [a-Z] works.

Comment 20 Andreas Schwab 2010-09-21 12:44:27 UTC

Only in the POSIX locale.

Comment 21 Paolo Bonzini 2010-09-21 13:01:50 UTC

This is starting to be _really_ ridiculous.  Please back up your assertions with proper citations from the POSIX documents.

All I can see is this in the "Regular Expressions" chapter.

    Ranges will be treated according to the current collating sequence, and 
    include such characters that fall within the range based on that collating 
    sequence, regardless of character values.

and this in "sort":

    Comparisons [...] shall be performed using the collating sequence of the 
    current locale.

which suggests that there is one and only one collating sequence defined for any given locale, and that Jaroslav's experiments were correct.

Comment 22 Andreas Schwab 2010-09-21 13:17:24 UTC

Nothing to fix here.

Comment 23 Paolo Bonzini 2010-09-21 13:41:08 UTC

I'll ask the Austin Group for a clarification then.  I agree that leaving this closed until then is the best course of action.

Comment 24 Eric Blake 2010-09-21 14:12:21 UTC

POSIX 2008 (http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html section 9.3.5 bullet 7) states:
"In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched."

The behavior of [A-z] in en_US.UTF-8 is "unspecified", but _not_ "undefined".  A compliant app cannot guarantee what the behavior will be, but the behavior should at least be explainable, and as a QoI point, glibc should document and define this behavior as an extension to POSIX, so that apps relying on glibc can take advantage of this extension for known behavior.

Moreover, there's _nothing_ in POSIX that requires [A-Z] to match all collation elements that collate between A and Z when outside the POSIX locale, and it would be _just as equally valid_ for [A-Z] to have the same meaning in both POSIX and en_US.UTF-8, instead of glibc's current behavior that [A-Z] behaves more like [AbBcC...zZ], and in fact, it would be _more_ useful to users, given the number of "bug" reports against bash, sed, grep, gawk, ... all complaining about the effects of locales on range expressions.

However, if you insist that glibc will continue to represent range expressions as the sequence of collation elements between the beginning and end collation element, for all locales, then for QoI you should also fix things to use the same locale collation sequencing as strcoll.  That is, [A-z] is well-defined in the POSIX locale, and in all other locales where A collates before z (which includes en_US.UTF-8), it should also be valid by either formulation for the behavior of ranges outside the POSIX locale (the set of collation symbols that fall between A and z for the given locale, or matching the same set of collation symbols as would be selected in the POSIX locale).  Given the example in comment 14, I think we have proof that glibc is doing neither behavior, and I for one would love to see glibc documentation explaining why this is acceptable.

As a parting note, it was recently suggested on the grep list that maybe glibc should consider documenting the following behavior:
[A-Z] - the same range as would be selected in the POSIX locale, for all locales
[[.A.]-[.Z.]] - the range of collation elements that fall between A and Z for the given locale
That way, users would be able to select between which of two sane interpretations they would like for non-POSIX locale range expressions, while at the same time aiding the large number of scripts that mistakenly used range expressions outside the POSIX locale while assuming POSIX locale semantics.

Comment 25 Paolo Bonzini 2010-09-21 14:47:02 UTC

Upon further analysis, there is a bug in grep too:

$ sed '/[A-Z]/p'
z
z
$ grep '/[A-Z]/p'
z
$

The problem here is that grep's DFA matcher is trying to use strcoll for single-byte matches, instead of glibc's own rules (whatever they are).  At the same time, grep relies on glibc to ascertain the validity of regular expressions, thus giving the inconsistent behavior.

I'm reassigning this to grep and suggest that the text in comment #24 is taken to the upstream glibc bug tracker.

Comment 26 Paolo Bonzini 2010-09-21 14:56:19 UTC

$ echo z | sed -n '/[A-Z]/p'
$ echo z | grep '[A-Z]'
z

Comment 27 Paolo Bonzini 2010-09-23 12:15:13 UTC

Fixed by upstream commit 99d3c7e1308beb1ce9a3c535ca4b6581ebd653ee.

Note You need to log in before you can comment on or make changes to this bug.