Bug 1400267

Summary: PCRE 8.32 fails to recognize non-ASCII printable characters
Product: Red Hat Enterprise Linux 7 Reporter: kieran
Component: pcreAssignee: Petr Pisar <ppisar>
Status: CLOSED ERRATA QA Contact: Martin Kyral <mkyral>
Severity: high Docs Contact: Lenka Špačková <lkuprova>
Priority: unspecified    
Version: 7.2CC: fedora, isenfeld, mkyral, ovasik, ppisar, rcollet
Target Milestone: rcKeywords: Patch
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: pcre-8.32-17.el7 Doc Type: Release Note
Doc Text:
The PCRE library now correctly recognizes non-ASCII printable characters as required by Unicode When matching a Unicode string with non-ASCII printable characters using the Perl Compatible Regular Expressions (PCRE) library, the library was previously unable to correctly recognize printable non-ASCII characters. A patch has been applied, and the PCRE library now recognizes printable non-ASCII characters in UTF-8 mode.
Story Points: ---
Clone Of:
: 1402288 1434486 (view as bug list) Environment:
Last Closed: 2017-08-01 12:20:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1393865    
Attachments:
Description Flags
Fix ported to 8.32, first part
none
Fix ported to 8.32, second part
none
Upstream fix applicable to 8.32, third part none

Description kieran 2016-11-30 18:31:06 UTC
Description of problem:
preg_replace with unicode modifier in PHP 7.0.13 with PCRE 8.32 (from Centos 7 updates repo) does not work as expected in the below use case. 

Version-Release number of selected component (if applicable):
PHP 7.0.13 has been installed from cPanel (http://www.cpanel.com/) EasyApache 4 (ea-php70-7.0.13-1.1.1.cpanel.x86_64)

# php --version
PHP 7.0.13 (cli) (built: Nov 14 2016 15:24:31) ( NTS )
Copyright (c) 1997-2016 The PHP Group
Zend Engine v3.0.0, Copyright (c) 1998-2016 Zend Technologies
    with the ionCube PHP Loader (enabled) + Intrusion Protection from ioncube24.com (unconfigured) v6.0.4, Copyright (c) 2002-2016, by ionCube Ltd.

# yum info pcre-devel
Loaded plugins: fastestmirror, universal-hooks
Loading mirror speeds from cached hostfile
 * EA4: 85.13.201.2
 * base: mirror.vorboss.net
 * extras: mirror.vorboss.net
 * updates: mirror.vorboss.net
Installed Packages
Name        : pcre-devel
Arch        : x86_64
Version     : 8.32
Release     : 15.el7_2.1
Size        : 1.4 M
Repo        : installed
From repo   : updates
Summary     : Development files for pcre
URL         : http://www.pcre.org/
Licence     : BSD
Description : Development files (Headers, libraries for dynamic linking, etc)
            : for pcre.

How reproducible:
Always - 100%

Steps to Reproduce:
1. php -r "var_dump(preg_replace('/[\\x{0000}\\x{200B}-\\x{200D}\\x{FEFF}]|\\r?\\n|\\r/u', '', 'test'));"

Actual results:
string(0) ""

Expected results:
string(4) "test"

Additional info:

Comment 1 Remi Collet 2016-11-30 18:53:47 UTC
Notice: rh-php70 packages in RHSCL 2.3 are also affected.

Another example, run using RHEL / RHSCL official packages:

$ php -r "var_dump(PHP_VERSION, preg_replace('/[^[:print:]]/u', '', 'ČEZ'));"
string(6) "5.4.16"
string(2) "EZ"

$ scl enable rh-php56 bash
$ php -r "var_dump(PHP_VERSION, preg_replace('/[^[:print:]]/u', '', 'ČEZ'));"
string(6) "5.6.25"
string(2) "EZ"

$ scl enable rh-php70 bash
$ php -r "var_dump(PHP_VERSION, preg_replace('/[^[:print:]]/u', '', 'ČEZ'));"
string(6) "7.0.10"
string(2) "EZ"

While with fedora package (pcre 8.39):

$ php -r "var_dump(PHP_VERSION, preg_replace('/[^[:print:]]/u', '', 'ČEZ'));"
string(6) "7.0.13"
string(4) "ČEZ"

Comment 3 Petr Pisar 2016-12-01 16:06:38 UTC
(In reply to Remi Collet from comment #1)
> $ php -r "var_dump(PHP_VERSION, preg_replace('/[^[:print:]]/u', '', 'ČEZ'));"
> string(6) "5.4.16"
> string(2) "EZ"
> 
If I understand the PHP code correctly, you want to replace all non-printable characters in the "ČEZ" string with an empty string. In other words delete them.

And your issue is that PCRE thinks "Č" is not a printable character. Let me show the problem with pcretest tool:

$ printf '/[[:print:]]/8W\nČ\n' | pcretest
PCRE version 8.32 2012-11-30

  re> data> No match
data> 

That looks really bad. And indeed, it works in Fedora:

$ printf '/[[:print:]]/8W\nČ\n' | pcretest
PCRE version 8.39 2016-06-14

  re> data>  0: \x{10c}
data> 

I remember I fixed a related bug in matching POSIX classes against non-ASCII characters in Fedora. I will try to locate the fix.

Comment 4 Petr Pisar 2016-12-01 17:00:40 UTC
At the end, it was not what I thought. Your issue was fixed by upstream between 8.33 and 8.34 versions with this commit:

commit fa3832825e3fe0d49f93658882775cdd6c26129e
Author: ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>
Date:   Sat Nov 2 18:29:05 2013 +0000

    Update POSIX class handling in UCP mode.
    
    
    git-svn-id: svn://vcs.exim.org/pcre/code/trunk@1387 2f5784b3-3f2a-0410-8824-
cb99058d5e15

I will try to port it back.

Comment 5 Petr Pisar 2016-12-06 13:16:35 UTC
It also requires another commit that fixes it in the the JIT implementation:

commit 9885cc24e4771dbe6daadd2107e4552bb92aafa2
Author: zherczeg <zherczeg@2f5784b3-3f2a-0410-8824-cb99058d5e15>
Date:   Fri Nov 15 12:04:55 2013 +0000

    Add support for PT_PXGRAPH, PT_PXPRINT, and PT_PXPUNCT in JIT.
    
    git-svn-id: svn://vcs.exim.org/pcre/code/trunk@1402 2f5784b3-3f2a-0410-8824-
cb99058d5e15

Comment 6 Petr Pisar 2016-12-06 14:46:57 UTC
Created attachment 1228557 [details]
Fix ported to 8.32, first part

Comment 7 Petr Pisar 2016-12-06 14:47:27 UTC
Created attachment 1228571 [details]
Fix ported to 8.32, second part

Comment 8 Petr Pisar 2016-12-06 15:05:03 UTC
The reported behavior is a documented feature, pcrepattern(3):

       By  default,  in  UTF modes, characters with values greater than 128 do
       not match any of the POSIX character classes. However, if the  PCRE_UCP
       option  is passed to pcre_compile(), some of the classes are changed so
       that Unicode character properties are used. This is achieved by replac‐
       ing the POSIX classes by other sequences, as follows:

         [:alnum:]  becomes  \p{Xan}
         [:alpha:]  becomes  \p{L}
         [:blank:]  becomes  \h
         [:digit:]  becomes  \p{Nd}
         [:lower:]  becomes  \p{Ll}
         [:space:]  becomes  \p{Xps}
         [:upper:]  becomes  \p{Lu}
         [:word:]   becomes  \p{Xwd}

       Negated  versions,  such  as [:^alpha:] use \P instead of \p. The other
       POSIX classes are unchanged, and match only characters with code points
       less than 128.

The reported case enables PCRE_UCP option (PHP //u flag, pcretest //w flag) and uses [:print:] class that is not on the list. Thus the last sentence applies (The other POSIX classes [...] match only characters with code points less than 128).

But upstream decided that it was a mistake because Perl started to recognize Unicode characters in this case:

$ perl -e 'use utf8; print qq{MATCH\n} if q{Č} =~ /[[:print:]]/'
MATCH

Therefore I believe it makes sense to change the behavior in RHEL-7 too to match RHEL-7 perl. The attached patches implement the change and amend the documentation:

       Negated versions, such as [:^alpha:] use \P instead of \p. Three  other
       POSIX classes are handled specially in UCP mode:

       [:graph:] This  matches  characters that have glyphs that mark the page
                 when printed. In Unicode property terms, it matches all char‐
                 acters with the L, M, N, P, S, or Cf properties, except for:

                   U+061C           Arabic Letter Mark
                   U+180E           Mongolian Vowel Separator
                   U+2066 - U+2069  Various "isolate"s

       [:print:] This  matches  the  same  characters  as [:graph:] plus space
                 characters that are not controls, that  is,  characters  with
                 the Zs property.

       [:punct:] This matches all characters that have the Unicode P (punctua‐
                 tion) property, plus those characters whose code  points  are
                 less than 128 that have the S (Symbol) property.

       The  other  POSIX classes are unchanged, and match only characters with
       code points less than 128.

Comment 9 Petr Pisar 2016-12-06 15:17:52 UTC
(In reply to kieran from comment #0)
> Description of problem:
> preg_replace with unicode modifier in PHP 7.0.13 with PCRE 8.32 (from Centos
> 7 updates repo) does not work as expected in the below use case.
[...]
> Steps to Reproduce:
> 1. php -r
> "var_dump(preg_replace('/[\\x{0000}\\x{200B}-\\x{200D}\\x{FEFF}]|\\r?\\n|\\r/
> u', '', 'test'));"
> 
> Actual results:
> string(0) ""
> 
> Expected results:
> string(4) "test"
> 
Excuse me, I cannot reproduce your issue with PCRE. If I'm not mistaken, it deletes matching characters from string "test". The matching characters should be one of seven U+0000, U+200B, U+200C, U+200D, U+FEFF or carriage-return or new-line characters. None of these characters occur in the ASCII string "test". And this is exactly how PCRE behaves now:

$ printf '%s\n%s\n' '/[\x{0000}\x{200B}-\x{200D}\x{FEFF}]|\r?\n|\r/8W' 'test' | pcretest
PCRE version 8.32 2012-11-30

  re> data> No match
data> 

If PHP behaves differently, then it's some glitch in PHP probably.

Comment 12 kieran 2016-12-06 18:27:58 UTC
(In reply to Petr Pisar from comment #9)
> (In reply to kieran from comment #0)
> > Description of problem:
> > preg_replace with unicode modifier in PHP 7.0.13 with PCRE 8.32 (from Centos
> > 7 updates repo) does not work as expected in the below use case.
> [...]
> > Steps to Reproduce:
> > 1. php -r
> > "var_dump(preg_replace('/[\\x{0000}\\x{200B}-\\x{200D}\\x{FEFF}]|\\r?\\n|\\r/
> > u', '', 'test'));"
> > 
> > Actual results:
> > string(0) ""
> > 
> > Expected results:
> > string(4) "test"
> > 
> Excuse me, I cannot reproduce your issue with PCRE. If I'm not mistaken, it
> deletes matching characters from string "test". The matching characters
> should be one of seven U+0000, U+200B, U+200C, U+200D, U+FEFF or
> carriage-return or new-line characters. None of these characters occur in
> the ASCII string "test". And this is exactly how PCRE behaves now:
> 
> $ printf '%s\n%s\n' '/[\x{0000}\x{200B}-\x{200D}\x{FEFF}]|\r?\n|\r/8W'
> 'test' | pcretest
> PCRE version 8.32 2012-11-30
> 
>   re> data> No match
> data> 
> 
> If PHP behaves differently, then it's some glitch in PHP probably.

I can't replicate using pcretest either, but that's only half the picture as pcretest does not perform any replacement. The issue in my snippet definitely exists under PHP 7.0.13 and the system PCRE implementation (8.32).

http://php.net/manual/en/pcre.installation.php suggests that 8.37+ is used with PHP 7

The issue doesn't exist if PHP 7.0.13 is compiled with PCRE 8.38 instead of the backport (8.32)

Comment 13 Petr Pisar 2016-12-07 07:53:36 UTC
(In reply to kieran from comment #12)
> (In reply to Petr Pisar from comment #9)
> > (In reply to kieran from comment #0)
> > > Description of problem:
> > > preg_replace with unicode modifier in PHP 7.0.13 with PCRE 8.32 (from Centos
> > > 7 updates repo) does not work as expected in the below use case.
> > [...]
> > > Steps to Reproduce:
> > > 1. php -r
> > > "var_dump(preg_replace('/[\\x{0000}\\x{200B}-\\x{200D}\\x{FEFF}]|\\r?\\n|\\r/
> > > u', '', 'test'));"
> > > 
> > > Actual results:
> > > string(0) ""
> > > 
> > > Expected results:
> > > string(4) "test"
> > > 
> > Excuse me, I cannot reproduce your issue with PCRE. If I'm not mistaken, it
> > deletes matching characters from string "test". The matching characters
> > should be one of seven U+0000, U+200B, U+200C, U+200D, U+FEFF or
> > carriage-return or new-line characters. None of these characters occur in
> > the ASCII string "test". And this is exactly how PCRE behaves now:
> > 
> > $ printf '%s\n%s\n' '/[\x{0000}\x{200B}-\x{200D}\x{FEFF}]|\r?\n|\r/8W'
> > 'test' | pcretest
> > PCRE version 8.32 2012-11-30
> > 
> >   re> data> No match
> > data> 
> > 
> > If PHP behaves differently, then it's some glitch in PHP probably.
> 
> I can't replicate using pcretest either, but that's only half the picture as
> pcretest does not perform any replacement. The issue in my snippet
> definitely exists under PHP 7.0.13 and the system PCRE implementation (8.32).
> 
I found it. It's because of JIT. If I request pcretest to use JIT, it matches:

$ printf '%s\n%s\n' '/[\x{0000}\x{200B}-\x{200D}\x{FEFF}]|\r?\n|\r/8W' 'test' | pcretest -s++
PCRE version 8.32 2012-11-30

  re> data>  0: t (JIT)
data>

Comment 14 Petr Pisar 2016-12-07 07:59:46 UTC
(In reply to Petr Pisar from comment #13)
> I found it. It's because of JIT. If I request pcretest to use JIT, it
> matches:
> 
> $ printf '%s\n%s\n' '/[\x{0000}\x{200B}-\x{200D}\x{FEFF}]|\r?\n|\r/8W'
> 'test' | pcretest -s++
> PCRE version 8.32 2012-11-30
> 
>   re> data>  0: t (JIT)
> data>

I cloned this independent issue as bug #1402288.

Comment 16 Petr Pisar 2016-12-07 09:10:50 UTC
Created attachment 1228940 [details]
Upstream fix applicable to 8.32, third part

This fixes JIT compilation on 32-bit PowerPC.

Comment 23 errata-xmlrpc 2017-08-01 12:20:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1909