Bug 108484 - [:alpha:] character class wrong for (some?) UTF-8 locales
[:alpha:] character class wrong for (some?) UTF-8 locales
Product: Red Hat Linux
Classification: Retired
Component: grep (Show other bugs)
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: Tim Waugh
Mike McLean
Depends On:
  Show dependency treegraph
Reported: 2003-10-29 18:14 EST by Malcolm Tredinnick
Modified: 2007-04-18 12:58 EDT (History)
0 users

See Also:
Fixed In Version: 2.5.1-22
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2003-12-08 07:52:10 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
grep-2.5.1-bracket.patch (343 bytes, patch)
2003-10-30 11:22 EST, Tim Waugh
no flags Details | Diff

  None (edit)
Description Malcolm Tredinnick 2003-10-29 18:14:26 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6a) Gecko/20031014

Description of problem:
In at least some UTF-8 locales, the open bracket character ('[') is included in
the set of alphabetic characters. This leads to matches on word boundaries
breaking, for example.

The closing bracket is *not* included in the set of alphabetic characters,
neither are parantheses or braces.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. echo [ | LANG=en_AU.UTF-8 grep -E "[[:alpha:]]" -

Actual Results:  The echoed string is matched (so '[' is returned).

Expected Results:  Nothing should have been matched.

Additional info:

Replace en_AU.UTF-8 with just en_AU and nothing is matched.
Replace en_AU.UTF-8 with de_DE.UTF-8 and the match happens.
Replace en_AU.UTF-8 with de_DE and nothing is matched.

Replace the '[' with ']' in all cases and nothing is matched.

The situation where we originally discovered this was when we were running a
search like

       echo "offset" | grep -w "a[offset]"

and it would only work in some locales.
Comment 1 Malcolm Tredinnick 2003-10-29 18:29:59 EST
Hmm ... the last example was overly simplified and does not work in any locale.
But put something like "a[offset] = 6;" into a file called foo and run

    grep -w offset foo

and it doesn't work in en_AU.UTF-8 (my default locale), but does work in C and
en_AU, etc.
Comment 2 Jakub Jelinek 2003-10-30 08:50:41 EST
Given that:
#include <locale.h>
#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <regex.h>

int main (void)
  regex_t re;
  regmatch_t rm[2];
  setlocale (LC_ALL, "");
  if (isalpha ('['))
    abort ();
  if (regcomp (&re, "[[:alpha:]]", 0) || !regexec (&re, "[", 2, rm, 0))
    abort ();
  return 0;
doesn't abort in LC_ALL=en_AU.UTF-8 nor any other locale that I've tried,
I'd say this has nothing to do with glibc but grep.
echo [ | LANG=en_AU.UTF-8 sed -n "/[[:alpha:]]/p"
doesn't print anything either.
Comment 3 Tim Waugh 2003-10-30 10:50:36 EST
dfa bug.
Comment 4 Tim Waugh 2003-10-30 11:22:38 EST
Created attachment 95604 [details]

Here is a potential fix.
Comment 5 Jay Turner 2004-09-01 22:13:26 EDT
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.