Hide Forgot
Description of problem: preg_replace with unicode modifier in PHP 7.0.13 with PCRE 8.32 (from Centos 7 updates repo) does not work as expected in the below use case. Version-Release number of selected component (if applicable): PHP 7.0.13 has been installed from cPanel (http://www.cpanel.com/) EasyApache 4 (ea-php70-7.0.13-1.1.1.cpanel.x86_64) # php --version PHP 7.0.13 (cli) (built: Nov 14 2016 15:24:31) ( NTS ) Copyright (c) 1997-2016 The PHP Group Zend Engine v3.0.0, Copyright (c) 1998-2016 Zend Technologies with the ionCube PHP Loader (enabled) + Intrusion Protection from ioncube24.com (unconfigured) v6.0.4, Copyright (c) 2002-2016, by ionCube Ltd. # yum info pcre-devel Loaded plugins: fastestmirror, universal-hooks Loading mirror speeds from cached hostfile * EA4: 85.13.201.2 * base: mirror.vorboss.net * extras: mirror.vorboss.net * updates: mirror.vorboss.net Installed Packages Name : pcre-devel Arch : x86_64 Version : 8.32 Release : 15.el7_2.1 Size : 1.4 M Repo : installed From repo : updates Summary : Development files for pcre URL : http://www.pcre.org/ Licence : BSD Description : Development files (Headers, libraries for dynamic linking, etc) : for pcre. How reproducible: Always - 100% Steps to Reproduce: 1. php -r "var_dump(preg_replace('/[\\x{0000}\\x{200B}-\\x{200D}\\x{FEFF}]|\\r?\\n|\\r/u', '', 'test'));" Actual results: string(0) "" Expected results: string(4) "test" Additional info:
Notice: rh-php70 packages in RHSCL 2.3 are also affected. Another example, run using RHEL / RHSCL official packages: $ php -r "var_dump(PHP_VERSION, preg_replace('/[^[:print:]]/u', '', 'ČEZ'));" string(6) "5.4.16" string(2) "EZ" $ scl enable rh-php56 bash $ php -r "var_dump(PHP_VERSION, preg_replace('/[^[:print:]]/u', '', 'ČEZ'));" string(6) "5.6.25" string(2) "EZ" $ scl enable rh-php70 bash $ php -r "var_dump(PHP_VERSION, preg_replace('/[^[:print:]]/u', '', 'ČEZ'));" string(6) "7.0.10" string(2) "EZ" While with fedora package (pcre 8.39): $ php -r "var_dump(PHP_VERSION, preg_replace('/[^[:print:]]/u', '', 'ČEZ'));" string(6) "7.0.13" string(4) "ČEZ"
(In reply to Remi Collet from comment #1) > $ php -r "var_dump(PHP_VERSION, preg_replace('/[^[:print:]]/u', '', 'ČEZ'));" > string(6) "5.4.16" > string(2) "EZ" > If I understand the PHP code correctly, you want to replace all non-printable characters in the "ČEZ" string with an empty string. In other words delete them. And your issue is that PCRE thinks "Č" is not a printable character. Let me show the problem with pcretest tool: $ printf '/[[:print:]]/8W\nČ\n' | pcretest PCRE version 8.32 2012-11-30 re> data> No match data> That looks really bad. And indeed, it works in Fedora: $ printf '/[[:print:]]/8W\nČ\n' | pcretest PCRE version 8.39 2016-06-14 re> data> 0: \x{10c} data> I remember I fixed a related bug in matching POSIX classes against non-ASCII characters in Fedora. I will try to locate the fix.
At the end, it was not what I thought. Your issue was fixed by upstream between 8.33 and 8.34 versions with this commit: commit fa3832825e3fe0d49f93658882775cdd6c26129e Author: ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> Date: Sat Nov 2 18:29:05 2013 +0000 Update POSIX class handling in UCP mode. git-svn-id: svn://vcs.exim.org/pcre/code/trunk@1387 2f5784b3-3f2a-0410-8824- cb99058d5e15 I will try to port it back.
It also requires another commit that fixes it in the the JIT implementation: commit 9885cc24e4771dbe6daadd2107e4552bb92aafa2 Author: zherczeg <zherczeg@2f5784b3-3f2a-0410-8824-cb99058d5e15> Date: Fri Nov 15 12:04:55 2013 +0000 Add support for PT_PXGRAPH, PT_PXPRINT, and PT_PXPUNCT in JIT. git-svn-id: svn://vcs.exim.org/pcre/code/trunk@1402 2f5784b3-3f2a-0410-8824- cb99058d5e15
Created attachment 1228557 [details] Fix ported to 8.32, first part
Created attachment 1228571 [details] Fix ported to 8.32, second part
The reported behavior is a documented feature, pcrepattern(3): By default, in UTF modes, characters with values greater than 128 do not match any of the POSIX character classes. However, if the PCRE_UCP option is passed to pcre_compile(), some of the classes are changed so that Unicode character properties are used. This is achieved by replac‐ ing the POSIX classes by other sequences, as follows: [:alnum:] becomes \p{Xan} [:alpha:] becomes \p{L} [:blank:] becomes \h [:digit:] becomes \p{Nd} [:lower:] becomes \p{Ll} [:space:] becomes \p{Xps} [:upper:] becomes \p{Lu} [:word:] becomes \p{Xwd} Negated versions, such as [:^alpha:] use \P instead of \p. The other POSIX classes are unchanged, and match only characters with code points less than 128. The reported case enables PCRE_UCP option (PHP //u flag, pcretest //w flag) and uses [:print:] class that is not on the list. Thus the last sentence applies (The other POSIX classes [...] match only characters with code points less than 128). But upstream decided that it was a mistake because Perl started to recognize Unicode characters in this case: $ perl -e 'use utf8; print qq{MATCH\n} if q{Č} =~ /[[:print:]]/' MATCH Therefore I believe it makes sense to change the behavior in RHEL-7 too to match RHEL-7 perl. The attached patches implement the change and amend the documentation: Negated versions, such as [:^alpha:] use \P instead of \p. Three other POSIX classes are handled specially in UCP mode: [:graph:] This matches characters that have glyphs that mark the page when printed. In Unicode property terms, it matches all char‐ acters with the L, M, N, P, S, or Cf properties, except for: U+061C Arabic Letter Mark U+180E Mongolian Vowel Separator U+2066 - U+2069 Various "isolate"s [:print:] This matches the same characters as [:graph:] plus space characters that are not controls, that is, characters with the Zs property. [:punct:] This matches all characters that have the Unicode P (punctua‐ tion) property, plus those characters whose code points are less than 128 that have the S (Symbol) property. The other POSIX classes are unchanged, and match only characters with code points less than 128.
(In reply to kieran from comment #0) > Description of problem: > preg_replace with unicode modifier in PHP 7.0.13 with PCRE 8.32 (from Centos > 7 updates repo) does not work as expected in the below use case. [...] > Steps to Reproduce: > 1. php -r > "var_dump(preg_replace('/[\\x{0000}\\x{200B}-\\x{200D}\\x{FEFF}]|\\r?\\n|\\r/ > u', '', 'test'));" > > Actual results: > string(0) "" > > Expected results: > string(4) "test" > Excuse me, I cannot reproduce your issue with PCRE. If I'm not mistaken, it deletes matching characters from string "test". The matching characters should be one of seven U+0000, U+200B, U+200C, U+200D, U+FEFF or carriage-return or new-line characters. None of these characters occur in the ASCII string "test". And this is exactly how PCRE behaves now: $ printf '%s\n%s\n' '/[\x{0000}\x{200B}-\x{200D}\x{FEFF}]|\r?\n|\r/8W' 'test' | pcretest PCRE version 8.32 2012-11-30 re> data> No match data> If PHP behaves differently, then it's some glitch in PHP probably.
(In reply to Petr Pisar from comment #9) > (In reply to kieran from comment #0) > > Description of problem: > > preg_replace with unicode modifier in PHP 7.0.13 with PCRE 8.32 (from Centos > > 7 updates repo) does not work as expected in the below use case. > [...] > > Steps to Reproduce: > > 1. php -r > > "var_dump(preg_replace('/[\\x{0000}\\x{200B}-\\x{200D}\\x{FEFF}]|\\r?\\n|\\r/ > > u', '', 'test'));" > > > > Actual results: > > string(0) "" > > > > Expected results: > > string(4) "test" > > > Excuse me, I cannot reproduce your issue with PCRE. If I'm not mistaken, it > deletes matching characters from string "test". The matching characters > should be one of seven U+0000, U+200B, U+200C, U+200D, U+FEFF or > carriage-return or new-line characters. None of these characters occur in > the ASCII string "test". And this is exactly how PCRE behaves now: > > $ printf '%s\n%s\n' '/[\x{0000}\x{200B}-\x{200D}\x{FEFF}]|\r?\n|\r/8W' > 'test' | pcretest > PCRE version 8.32 2012-11-30 > > re> data> No match > data> > > If PHP behaves differently, then it's some glitch in PHP probably. I can't replicate using pcretest either, but that's only half the picture as pcretest does not perform any replacement. The issue in my snippet definitely exists under PHP 7.0.13 and the system PCRE implementation (8.32). http://php.net/manual/en/pcre.installation.php suggests that 8.37+ is used with PHP 7 The issue doesn't exist if PHP 7.0.13 is compiled with PCRE 8.38 instead of the backport (8.32)
(In reply to kieran from comment #12) > (In reply to Petr Pisar from comment #9) > > (In reply to kieran from comment #0) > > > Description of problem: > > > preg_replace with unicode modifier in PHP 7.0.13 with PCRE 8.32 (from Centos > > > 7 updates repo) does not work as expected in the below use case. > > [...] > > > Steps to Reproduce: > > > 1. php -r > > > "var_dump(preg_replace('/[\\x{0000}\\x{200B}-\\x{200D}\\x{FEFF}]|\\r?\\n|\\r/ > > > u', '', 'test'));" > > > > > > Actual results: > > > string(0) "" > > > > > > Expected results: > > > string(4) "test" > > > > > Excuse me, I cannot reproduce your issue with PCRE. If I'm not mistaken, it > > deletes matching characters from string "test". The matching characters > > should be one of seven U+0000, U+200B, U+200C, U+200D, U+FEFF or > > carriage-return or new-line characters. None of these characters occur in > > the ASCII string "test". And this is exactly how PCRE behaves now: > > > > $ printf '%s\n%s\n' '/[\x{0000}\x{200B}-\x{200D}\x{FEFF}]|\r?\n|\r/8W' > > 'test' | pcretest > > PCRE version 8.32 2012-11-30 > > > > re> data> No match > > data> > > > > If PHP behaves differently, then it's some glitch in PHP probably. > > I can't replicate using pcretest either, but that's only half the picture as > pcretest does not perform any replacement. The issue in my snippet > definitely exists under PHP 7.0.13 and the system PCRE implementation (8.32). > I found it. It's because of JIT. If I request pcretest to use JIT, it matches: $ printf '%s\n%s\n' '/[\x{0000}\x{200B}-\x{200D}\x{FEFF}]|\r?\n|\r/8W' 'test' | pcretest -s++ PCRE version 8.32 2012-11-30 re> data> 0: t (JIT) data>
(In reply to Petr Pisar from comment #13) > I found it. It's because of JIT. If I request pcretest to use JIT, it > matches: > > $ printf '%s\n%s\n' '/[\x{0000}\x{200B}-\x{200D}\x{FEFF}]|\r?\n|\r/8W' > 'test' | pcretest -s++ > PCRE version 8.32 2012-11-30 > > re> data> 0: t (JIT) > data> I cloned this independent issue as bug #1402288.
Created attachment 1228940 [details] Upstream fix applicable to 8.32, third part This fixes JIT compilation on 32-bit PowerPC.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1909