Description of problem: perl incorrectly classifies some alphanumeric characters when setlocale(LC_ALL, "") (i.e. "use locale;") is called. Version-Release number of selected component (if applicable): glibc-2.3.5-10 perl-5.8.6-15 How reproducible: 100% Steps to Reproduce: 1. Create /tmp/test.pl script with the following contents: ---------------- #!/usr/bin/perl -w -C use strict; use utf8; use locale; use Encode qw(decode); my $str = decode('utf-8', "\xc3\x81\xc4\x8c"); # U+00C1 "A with acute", U+010C "C with caron" (encoded in UTF-8) $str =~ s/\W/-/g; print $str, "\n"; ------------- 2. chmod +x /tmp/test.pl 3. LC_ALL=en_US.UTF-8 /tmp/test.pl Actual results: Hyphen and the capital C with caron is printed. Expected results: A with acute and C with caron should be printed, as both are letters and thus should not match the \W wildcard. Additional info: When the "use locale;" line is commented out, the script works correctly, so I think it is a locale bug. I have tried other characters than the two above - all letters used in the Czech language. Incorrectly classified are those with acute above (i.e. those which are present in ISO-8859-1, and which have the Unicode value between U+0080 and U+00FF). The other letters (US-ASCII, and letters with caron above) are classified correctly by Perl even with "use locale;". I am actually not sure whether this is a Perl bug or glibc bug. However, the same perl-5.8.6 on FreeBSD 5.4 works correctly, and the same perl-5.8.6 on SUSE SLES9 doesn't. The FC3 system is also buggy (glibc-2.3.5-0.fc3.1 and perl-5.8.5-14.FC3). So I guess this is a glibc or locale bug rather than Perl bug.
I don't think this looks like a glibc or locale bug, given that: #include <locale.h> #include <regex.h> #include <stdio.h> #include <stdlib.h> #include <string.h> int main (void) { setlocale (LC_ALL, "en_US.UTF-8"); printf ("%d\n", (int) iswalpha (L'\xc1')); printf ("%d\n", (int) iswalpha (L'\x10c')); regex_t re; if (regcomp (&re, "\\W", REG_EXTENDED) != 0) abort (); regmatch_t rm; const char *str = "abc\xc3\x81\xc4\x8c "; if (regexec (&re, str, 1, &rm, 0) != 0) abort (); if (rm.rm_so != strlen (str) - 1 || rm.rm_eo != rm.rm_so + 1) abort (); return 0; } works.
One has to carefully analyse the perl man-pages to find out what is going on here. I don't particularly agree with the way the upstream perl maintainers have done this, but this is not a bug - it is the way perl is meant to behave. The point is that /\w/ matches any ASCII word char, and /\W/ matches any ASCII non-word char. To match a UTF-8 word character, you have to use \p{IsWord} . The \w wildcard is a synonym for the POSIX character class [:word:]. So this version of your program : --- #!/usr/bin/perl -w -C use strict; use utf8; use locale; use Encode qw(decode); my $str = decode('utf-8', "\xc3\x81\xc4\x8c"); # U+00C1 "A with acute", U+010C "C with caron" (encoded in UTF-8) print 'Is UTF-8:',utf8::is_utf8($str), ' is word:', $str =~ /^\w+$/,' is UTF-8 word:', $str =~ /^\p{IsWord}+$/, ' str:',$str, "\n";
Well, then tell me why the same Perl works on FreeBSD. Is it that Linux UTF-8 locales does not consider U+00C1 to be a character, while U+10C is a character? Your statement that "\w matches any ASCII word char" is not true. See perlre(1): [...] If "use locale" is in effect, the list of alpha- betic characters generated by "\w" is taken from the current locale. See perllocale. [...] So the question is why Perl (or libc) in FreeBSD does consider U+00C1 to be a character under the UTF-8 locale, while the same perl with glibc on Linux doesn't.
Sorry I submitted my previous comment before finishing it - then my machine rebooted (that's another story). As I was saying in Comment #2 : This version of your program shows the issue: --- #!/usr/bin/perl -w -C use strict; use utf8; use locale; use Encode qw(decode); my $str = decode('utf-8', "\xc3\x81\xc4\x8c"); # U+00C1 "A with acute", U+010C "C with caron" (encoded in UTF-8) print 'Is UTF-8:',utf8::is_utf8($str), ' is word:', $str =~ /^\w+$/, ' is UTF-8 word: ', $str =~ /^\p{IsWord}+$/, ' str:',$str, "\n"; --- With the "en_US.UTF-8" locale in effect ( the default on Red Hat systems ) this prints: $ ./test.pl Is UTF-8:1 is word: is posix word: 1 is UTF-8 word: 1 str:ÃÄ The point is that \p{IsWord} or [[:word:]] matches UTF-8 word characters, while \w / \W do not. As the perlre man-page states: " The following equivalences to Unicode \p{} constructs and equivalent backslash character classes (if available), will hold: [:...:] \p{...} backslash ... word IsWord ... " ie. the [:word:] / \p{IsWord} classes are NOT equivalent to \w . As I said, I don't particularly agree with the way the upstream perl developers have done this, but this is intended behaviour. RE: your comment #3: > Your statement that "\w matches any ASCII word char" is not true. > See perlre(1): > [...] If "use locale" is in effect, the list of alphabetic characters > generated by "\w" is taken from the current locale. Yes, that's alphabetic characters, not unicode sequences. To match unicode sequences in the word class, you must use \p{IsWord} or [:word:] . > So the question is why Perl (or libc) in FreeBSD does consider U+00C1 to be a > character under the UTF-8 locale, while the same perl with glibc on Linux > doesn't. Possibly because the default locale for Red Hat systems is UTF-8 enabled ?