Bug 104540

Summary: UTF8 breaks [^\w] regexp matches
Product: [Retired] Red Hat Linux Reporter: Jamie Zawinski <jwz>
Component: perlAssignee: Jason Vas Dias <jvdias>
Status: CLOSED CURRENTRELEASE QA Contact: David Lawrence <dkl>
Severity: medium Docs Contact:
Priority: medium    
Version: 9   
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: ALL Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-11-11 23:43:50 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jamie Zawinski 2003-09-16 21:58:52 UTC
If $LANG contains "utf8", then [^\w] doesn't work right:

      setenv LANG en_US
      echo -n "foo.bar" | \
      perl -e '$_ = <>; print join (" | ", split (/([^\w]+)/)) . "\n";'

            ===> "foo | . | bar" (right)


      setenv LANG en_US.utf8
      echo -n "foo.bar" | \
      perl -e '$_ = <>; print join (" | ", split (/([^\w]+)/)) . "\n";'

            ===> "foo.bar" (wrong!)


It works fine in both cases if you do $_ = "foo.bar" instead of reading
the text from stdin.

    This is perl, v5.8.0 built for i386-linux-thread-multi
    (with 1 registered patch, see perl -V for more detail)

    perl-5.8.0-88
    Red Hat Linux release 9 (Shrike)
    Linux 2.4.20-8smp #1 SMP Thu Mar 13 16:43:01 EST 2003 i686 athlon i386

Maybe this is a dup of 102106, I can't tell.

Comment 1 Jason Vas Dias 2005-11-11 23:43:50 UTC
Very sorry for the long delay in processing this bug report.
This bug is no longer a problem with the perl in any current Red Hat OS release.