Bug 108484 - [:alpha:] character class wrong for (some?) UTF-8 locales
[:alpha:] character class wrong for (some?) UTF-8 locales
Status: CLOSED ERRATA
Product: Red Hat Linux
Classification: Retired
Component: grep (Show other bugs)
9
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: Tim Waugh
Mike McLean
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2003-10-29 18:14 EST by Malcolm Tredinnick
Modified: 2007-04-18 12:58 EDT (History)
0 users

See Also:
Fixed In Version: 2.5.1-22
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2003-12-08 07:52:10 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
grep-2.5.1-bracket.patch (343 bytes, patch)
2003-10-30 11:22 EST, Tim Waugh
no flags Details | Diff

  None (edit)
Description Malcolm Tredinnick 2003-10-29 18:14:26 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6a) Gecko/20031014
Firebird/0.7+

Description of problem:
In at least some UTF-8 locales, the open bracket character ('[') is included in
the set of alphabetic characters. This leads to matches on word boundaries
breaking, for example.

The closing bracket is *not* included in the set of alphabetic characters,
neither are parantheses or braces.


Version-Release number of selected component (if applicable):
glibc-2.3.2-27.9

How reproducible:
Always

Steps to Reproduce:
1. echo [ | LANG=en_AU.UTF-8 grep -E "[[:alpha:]]" -


Actual Results:  The echoed string is matched (so '[' is returned).

Expected Results:  Nothing should have been matched.

Additional info:

Replace en_AU.UTF-8 with just en_AU and nothing is matched.
Replace en_AU.UTF-8 with de_DE.UTF-8 and the match happens.
Replace en_AU.UTF-8 with de_DE and nothing is matched.

Replace the '[' with ']' in all cases and nothing is matched.

The situation where we originally discovered this was when we were running a
search like

       echo "offset" | grep -w "a[offset]"

and it would only work in some locales.
Comment 1 Malcolm Tredinnick 2003-10-29 18:29:59 EST
Hmm ... the last example was overly simplified and does not work in any locale.
But put something like "a[offset] = 6;" into a file called foo and run

    grep -w offset foo

and it doesn't work in en_AU.UTF-8 (my default locale), but does work in C and
en_AU, etc.
Comment 2 Jakub Jelinek 2003-10-30 08:50:41 EST
Given that:
#include <locale.h>
#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <regex.h>

int main (void)
{
  regex_t re;
  regmatch_t rm[2];
  setlocale (LC_ALL, "");
  if (isalpha ('['))
    abort ();
  if (regcomp (&re, "[[:alpha:]]", 0) || !regexec (&re, "[", 2, rm, 0))
    abort ();
  return 0;
}
doesn't abort in LC_ALL=en_AU.UTF-8 nor any other locale that I've tried,
I'd say this has nothing to do with glibc but grep.
echo [ | LANG=en_AU.UTF-8 sed -n "/[[:alpha:]]/p"
doesn't print anything either.
Comment 3 Tim Waugh 2003-10-30 10:50:36 EST
dfa bug.
Comment 4 Tim Waugh 2003-10-30 11:22:38 EST
Created attachment 95604 [details]
grep-2.5.1-bracket.patch

Here is a potential fix.
Comment 5 Jay Turner 2004-09-01 22:13:26 EDT
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-079.html

Note You need to log in before you can comment on or make changes to this bug.