Bug 108484 - [:alpha:] character class wrong for (some?) UTF-8 locales
Summary: [:alpha:] character class wrong for (some?) UTF-8 locales
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: grep
Version: 9
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Tim Waugh
QA Contact: Mike McLean
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2003-10-29 23:14 UTC by Malcolm Tredinnick
Modified: 2007-04-18 16:58 UTC (History)
0 users

Fixed In Version: 2.5.1-22
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2003-12-08 12:52:10 UTC
Embargoed:


Attachments (Terms of Use)
grep-2.5.1-bracket.patch (343 bytes, patch)
2003-10-30 16:22 UTC, Tim Waugh
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2004:079 0 normal SHIPPED_LIVE Updated grep package speeds UTF-8 searching 2004-09-01 04:00:00 UTC

Description Malcolm Tredinnick 2003-10-29 23:14:26 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6a) Gecko/20031014
Firebird/0.7+

Description of problem:
In at least some UTF-8 locales, the open bracket character ('[') is included in
the set of alphabetic characters. This leads to matches on word boundaries
breaking, for example.

The closing bracket is *not* included in the set of alphabetic characters,
neither are parantheses or braces.


Version-Release number of selected component (if applicable):
glibc-2.3.2-27.9

How reproducible:
Always

Steps to Reproduce:
1. echo [ | LANG=en_AU.UTF-8 grep -E "[[:alpha:]]" -


Actual Results:  The echoed string is matched (so '[' is returned).

Expected Results:  Nothing should have been matched.

Additional info:

Replace en_AU.UTF-8 with just en_AU and nothing is matched.
Replace en_AU.UTF-8 with de_DE.UTF-8 and the match happens.
Replace en_AU.UTF-8 with de_DE and nothing is matched.

Replace the '[' with ']' in all cases and nothing is matched.

The situation where we originally discovered this was when we were running a
search like

       echo "offset" | grep -w "a[offset]"

and it would only work in some locales.

Comment 1 Malcolm Tredinnick 2003-10-29 23:29:59 UTC
Hmm ... the last example was overly simplified and does not work in any locale.
But put something like "a[offset] = 6;" into a file called foo and run

    grep -w offset foo

and it doesn't work in en_AU.UTF-8 (my default locale), but does work in C and
en_AU, etc.

Comment 2 Jakub Jelinek 2003-10-30 13:50:41 UTC
Given that:
#include <locale.h>
#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <regex.h>

int main (void)
{
  regex_t re;
  regmatch_t rm[2];
  setlocale (LC_ALL, "");
  if (isalpha ('['))
    abort ();
  if (regcomp (&re, "[[:alpha:]]", 0) || !regexec (&re, "[", 2, rm, 0))
    abort ();
  return 0;
}
doesn't abort in LC_ALL=en_AU.UTF-8 nor any other locale that I've tried,
I'd say this has nothing to do with glibc but grep.
echo [ | LANG=en_AU.UTF-8 sed -n "/[[:alpha:]]/p"
doesn't print anything either.

Comment 3 Tim Waugh 2003-10-30 15:50:36 UTC
dfa bug.

Comment 4 Tim Waugh 2003-10-30 16:22:38 UTC
Created attachment 95604 [details]
grep-2.5.1-bracket.patch

Here is a potential fix.

Comment 5 Jay Turner 2004-09-02 02:13:26 UTC
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-079.html



Note You need to log in before you can comment on or make changes to this bug.