Bug 166478 - glibc or perl incorrect locale LC_CTYPE data
glibc or perl incorrect locale LC_CTYPE data
Status: CLOSED NOTABUG
Product: Fedora
Classification: Fedora
Component: perl (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Jason Vas Dias
David Lawrence
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-08-22 06:52 EDT by Jan "Yenya" Kasprzak
Modified: 2007-11-30 17:11 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-11-02 11:45:24 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jan "Yenya" Kasprzak 2005-08-22 06:52:08 EDT
Description of problem:
perl incorrectly classifies some alphanumeric characters when setlocale(LC_ALL,
"") (i.e. "use locale;") is called.

Version-Release number of selected component (if applicable):
glibc-2.3.5-10
perl-5.8.6-15

How reproducible:
100%


Steps to Reproduce:
1. Create /tmp/test.pl script with the following contents:
----------------
#!/usr/bin/perl -w -C

use strict;
use utf8;
use locale;
use Encode qw(decode);

my $str = decode('utf-8', "\xc3\x81\xc4\x8c");
        # U+00C1 "A with acute", U+010C "C with caron" (encoded in UTF-8)

$str =~ s/\W/-/g;
print $str, "\n";
-------------
2. chmod +x /tmp/test.pl
3. LC_ALL=en_US.UTF-8 /tmp/test.pl
  
Actual results:
Hyphen and the capital C with caron is printed. 

Expected results:
A with acute and C with caron should be printed, as both are letters and thus
should not match the \W wildcard.

Additional info:
When the "use locale;" line is commented out, the script works correctly, so
I think it is a locale bug.

I have tried other characters than the two above - all letters
used in the Czech language. Incorrectly classified are those with acute
above (i.e. those which are present in ISO-8859-1, and which have the Unicode
value between U+0080 and U+00FF). The other letters (US-ASCII, and letters with
caron above) are classified correctly by Perl even with "use locale;".

I am actually not sure whether this is a Perl bug or glibc bug. However,
the same perl-5.8.6 on FreeBSD 5.4 works correctly, and the same perl-5.8.6 on
SUSE SLES9 doesn't. The FC3 system is also buggy (glibc-2.3.5-0.fc3.1 and
perl-5.8.5-14.FC3). So I guess this is a glibc or locale bug rather than Perl bug.
Comment 1 Jakub Jelinek 2005-08-22 12:50:15 EDT
I don't think this looks like a glibc or locale bug, given that:
#include <locale.h>
#include <regex.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main (void)
{
  setlocale (LC_ALL, "en_US.UTF-8");
  printf ("%d\n", (int) iswalpha (L'\xc1'));
  printf ("%d\n", (int) iswalpha (L'\x10c'));
  regex_t re;
  if (regcomp (&re, "\\W", REG_EXTENDED) != 0)
    abort ();
  regmatch_t rm;
  const char *str = "abc\xc3\x81\xc4\x8c ";
  if (regexec (&re, str, 1, &rm, 0) != 0)
    abort ();
  if (rm.rm_so != strlen (str) - 1
      || rm.rm_eo != rm.rm_so + 1)
    abort ();
  return 0;
}

works.
Comment 2 Jason Vas Dias 2005-11-01 20:29:21 EST
One has to carefully analyse the perl man-pages to find out what is going on here.

I don't particularly agree with the way the upstream perl maintainers have done
this, but this is not a bug - it is the way perl is meant to behave.

The point is that /\w/ matches any ASCII word char, and /\W/ matches any ASCII
non-word char.

To match a UTF-8 word character, you have to use \p{IsWord} .

The \w wildcard is a synonym for the POSIX character class [:word:]. 

So this version of your program :
---
#!/usr/bin/perl -w -C

use strict;
use utf8;
use locale;
use Encode qw(decode);

my $str = decode('utf-8', "\xc3\x81\xc4\x8c");
        # U+00C1 "A with acute", U+010C "C with caron" (encoded in UTF-8)

print 'Is UTF-8:',utf8::is_utf8($str), 
      ' is word:', $str =~ /^\w+$/,' 
is UTF-8 word:', 
      $str =~ /^\p{IsWord}+$/, ' str:',$str, "\n";

Comment 3 Jan "Yenya" Kasprzak 2005-11-02 05:30:11 EST
Well, then tell me why the same Perl works on FreeBSD. Is it that Linux UTF-8
locales does not consider U+00C1 to be a character, while U+10C is a character?

Your statement that "\w matches any ASCII word char" is not true. See perlre(1):

       [...] If "use locale" is in effect, the list of alpha-
       betic characters generated by "\w" is taken from the current locale.
       See perllocale.  [...]

So the question is why Perl (or libc) in FreeBSD does consider U+00C1 to be a
character under the UTF-8 locale, while the same perl with glibc on Linux doesn't.
Comment 4 Jason Vas Dias 2005-11-02 11:45:24 EST
Sorry I submitted my previous comment before finishing it - 
then my machine rebooted (that's another story).

As I was saying in Comment #2 :

This version of your program shows the issue:
---
#!/usr/bin/perl -w -C

use strict;
use utf8;
use locale;
use Encode qw(decode);

my $str = decode('utf-8', "\xc3\x81\xc4\x8c");
        # U+00C1 "A with acute", U+010C "C with caron" (encoded in UTF-8)

print 'Is UTF-8:',utf8::is_utf8($str), 
      ' is word:', $str =~ /^\w+$/,
      ' is UTF-8 word: ', $str =~ /^\p{IsWord}+$/,
      ' str:',$str, "\n";
---

With the "en_US.UTF-8" locale in effect ( the default on Red Hat systems )
this prints:
$ ./test.pl 
Is UTF-8:1 is word: is posix word: 1 is UTF-8 word: 1 str:ÁČ

The point is that \p{IsWord} or [[:word:]] matches UTF-8 word characters, 
while \w / \W do not.

As the perlre man-page states:

"
       The following equivalences to Unicode \p{} constructs and equivalent
       backslash character classes (if available), will hold:

           [:...:]     \p{...}         backslash
       ...
           word        IsWord
       ...
"
ie. the [:word:] / \p{IsWord} classes are NOT equivalent to \w .

As I said, I don't particularly agree with the way the upstream perl
developers have done this, but this is intended behaviour.

RE: your comment #3: 
>  Your statement that "\w matches any ASCII word char" is not true. 
>  See perlre(1):
>       [...] If "use locale" is in effect, the list of alphabetic characters 
>             generated by "\w" is taken from the current locale.

Yes, that's alphabetic characters, not unicode sequences.

To match unicode sequences in the word class, you must use \p{IsWord} or 
[:word:] .

> So the question is why Perl (or libc) in FreeBSD does consider U+00C1 to be a
> character under the UTF-8 locale, while the same perl with glibc on Linux 
> doesn't.

Possibly because the default locale for Red Hat systems is UTF-8 enabled ?

Note You need to log in before you can comment on or make changes to this bug.