166478 – glibc or perl incorrect locale LC_CTYPE data

Bug 166478 - glibc or perl incorrect locale LC_CTYPE data

Summary: glibc or perl incorrect locale LC_CTYPE data

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	perl
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Jason Vas Dias
QA Contact:	David Lawrence
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-08-22 10:52 UTC by Jan "Yenya" Kasprzak
Modified:	2007-11-30 22:11 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2005-11-02 16:45:24 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Jan "Yenya" Kasprzak 2005-08-22 10:52:08 UTC

Description of problem:
perl incorrectly classifies some alphanumeric characters when setlocale(LC_ALL,
"") (i.e. "use locale;") is called.

Version-Release number of selected component (if applicable):
glibc-2.3.5-10
perl-5.8.6-15

How reproducible:
100%


Steps to Reproduce:
1. Create /tmp/test.pl script with the following contents:
----------------
#!/usr/bin/perl -w -C

use strict;
use utf8;
use locale;
use Encode qw(decode);

my $str = decode('utf-8', "\xc3\x81\xc4\x8c");
        # U+00C1 "A with acute", U+010C "C with caron" (encoded in UTF-8)

$str =~ s/\W/-/g;
print $str, "\n";
-------------
2. chmod +x /tmp/test.pl
3. LC_ALL=en_US.UTF-8 /tmp/test.pl
  
Actual results:
Hyphen and the capital C with caron is printed. 

Expected results:
A with acute and C with caron should be printed, as both are letters and thus
should not match the \W wildcard.

Additional info:
When the "use locale;" line is commented out, the script works correctly, so
I think it is a locale bug.

I have tried other characters than the two above - all letters
used in the Czech language. Incorrectly classified are those with acute
above (i.e. those which are present in ISO-8859-1, and which have the Unicode
value between U+0080 and U+00FF). The other letters (US-ASCII, and letters with
caron above) are classified correctly by Perl even with "use locale;".

I am actually not sure whether this is a Perl bug or glibc bug. However,
the same perl-5.8.6 on FreeBSD 5.4 works correctly, and the same perl-5.8.6 on
SUSE SLES9 doesn't. The FC3 system is also buggy (glibc-2.3.5-0.fc3.1 and
perl-5.8.5-14.FC3). So I guess this is a glibc or locale bug rather than Perl bug.

Comment 1 Jakub Jelinek 2005-08-22 16:50:15 UTC

I don't think this looks like a glibc or locale bug, given that:
#include <locale.h>
#include <regex.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main (void)
{
  setlocale (LC_ALL, "en_US.UTF-8");
  printf ("%d\n", (int) iswalpha (L'\xc1'));
  printf ("%d\n", (int) iswalpha (L'\x10c'));
  regex_t re;
  if (regcomp (&re, "\\W", REG_EXTENDED) != 0)
    abort ();
  regmatch_t rm;
  const char *str = "abc\xc3\x81\xc4\x8c ";
  if (regexec (&re, str, 1, &rm, 0) != 0)
    abort ();
  if (rm.rm_so != strlen (str) - 1
      || rm.rm_eo != rm.rm_so + 1)
    abort ();
  return 0;
}

works.

Comment 2 Jason Vas Dias 2005-11-02 01:29:21 UTC

One has to carefully analyse the perl man-pages to find out what is going on here.

I don't particularly agree with the way the upstream perl maintainers have done
this, but this is not a bug - it is the way perl is meant to behave.

The point is that /\w/ matches any ASCII word char, and /\W/ matches any ASCII
non-word char.

To match a UTF-8 word character, you have to use \p{IsWord} .

The \w wildcard is a synonym for the POSIX character class [:word:]. 

So this version of your program :
---
#!/usr/bin/perl -w -C

use strict;
use utf8;
use locale;
use Encode qw(decode);

my $str = decode('utf-8', "\xc3\x81\xc4\x8c");
        # U+00C1 "A with acute", U+010C "C with caron" (encoded in UTF-8)

print 'Is UTF-8:',utf8::is_utf8($str), 
      ' is word:', $str =~ /^\w+$/,' 
is UTF-8 word:', 
      $str =~ /^\p{IsWord}+$/, ' str:',$str, "\n";

Comment 3 Jan "Yenya" Kasprzak 2005-11-02 10:30:11 UTC

Well, then tell me why the same Perl works on FreeBSD. Is it that Linux UTF-8
locales does not consider U+00C1 to be a character, while U+10C is a character?

Your statement that "\w matches any ASCII word char" is not true. See perlre(1):

       [...] If "use locale" is in effect, the list of alpha-
       betic characters generated by "\w" is taken from the current locale.
       See perllocale.  [...]

So the question is why Perl (or libc) in FreeBSD does consider U+00C1 to be a
character under the UTF-8 locale, while the same perl with glibc on Linux doesn't.

Comment 4 Jason Vas Dias 2005-11-02 16:45:24 UTC

Sorry I submitted my previous comment before finishing it - 
then my machine rebooted (that's another story).

As I was saying in Comment #2 :

This version of your program shows the issue:
---
#!/usr/bin/perl -w -C

use strict;
use utf8;
use locale;
use Encode qw(decode);

my $str = decode('utf-8', "\xc3\x81\xc4\x8c");
        # U+00C1 "A with acute", U+010C "C with caron" (encoded in UTF-8)

print 'Is UTF-8:',utf8::is_utf8($str), 
      ' is word:', $str =~ /^\w+$/,
      ' is UTF-8 word: ', $str =~ /^\p{IsWord}+$/,
      ' str:',$str, "\n";
---

With the "en_US.UTF-8" locale in effect ( the default on Red Hat systems )
this prints:
$ ./test.pl 
Is UTF-8:1 is word: is posix word: 1 is UTF-8 word: 1 str:ÃÄ

The point is that \p{IsWord} or [[:word:]] matches UTF-8 word characters, 
while \w / \W do not.

As the perlre man-page states:

"
       The following equivalences to Unicode \p{} constructs and equivalent
       backslash character classes (if available), will hold:

           [:...:]     \p{...}         backslash
       ...
           word        IsWord
       ...
"
ie. the [:word:] / \p{IsWord} classes are NOT equivalent to \w .

As I said, I don't particularly agree with the way the upstream perl
developers have done this, but this is intended behaviour.

RE: your comment #3: 
>  Your statement that "\w matches any ASCII word char" is not true. 
>  See perlre(1):
>       [...] If "use locale" is in effect, the list of alphabetic characters 
>             generated by "\w" is taken from the current locale.

Yes, that's alphabetic characters, not unicode sequences.

To match unicode sequences in the word class, you must use \p{IsWord} or 
[:word:] .

> So the question is why Perl (or libc) in FreeBSD does consider U+00C1 to be a
> character under the UTF-8 locale, while the same perl with glibc on Linux 
> doesn't.

Possibly because the default locale for Red Hat systems is UTF-8 enabled ?

Note You need to log in before you can comment on or make changes to this bug.