Bug 1193524

Summary: Backport features required by grep: PCRE_ERROR_BADOFFSET, error details in case of PCRE_ERROR_BADUTF8
Product: Red Hat Enterprise Linux 6 Reporter: Jaroslav Škarvada <jskarvad>
Component: pcreAssignee: Petr Pisar <ppisar>
Status: CLOSED ERRATA QA Contact: Jan Kepler <jkejda>
Severity: unspecified Docs Contact: Lenka Špačková <lkuprova>
Priority: unspecified    
Version: 6.6CC: ant, jkejda, jorton, jskarvad, optak, qe-baseos-apps, tlavigne
Target Milestone: rcKeywords: FutureFeature, Patch, Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pcre-7.8-7.el6 Doc Type: Enhancement
Doc Text:
Improvements in the PCRE library To allow the grep utility to recover from PCRE matching failures if the binary file is not a valid UTF-8 sequence, the following features have been backported to the PCRE library: * The pcre_exec() function now checks for out-of-range starting offset values and reports PCRE_ERROR_BADOFFSET errors instead of reporting PCRE_ERROR_NOMATCH errors or looping infinitely. * If the pcre_exec() function is called to perform a UTF-8 match on an invalid UTF-8 subject string and the ovector array argument is large enough, the offset of the first subject string in the invalid UTF-8 byte, as well as the detailed reason code, are returned in the ovector array element. In addition, the pcretest utility can now be used to display these details. Note that with this update, the pcre_compile() function reports first invalid UTF-8 byte instead of the last byte. Also note that the signature of the pcre_valid_utf8() function, which is not intended for public use, has been changed. Finally, note that the pcretest utility now appends human-readable error messages to error codes.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-07-22 07:06:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1193030, 1217080    
Attachments:
Description Flags
ABI check results
none
Backport of 2) (without tests)
none
Introduction of PCRE_ERROR_BADOFFSET ported to pcre-7.8
none
Error details in case off PCRE_ERROR_BADUTF8 ported to pcre-7.8 (maximal variant) none

Description Jaroslav Škarvada 2015-02-17 14:31:39 UTC
Description of problem:
Grep has been recently rebased to 2.20 but there is problem with pcre-7.8-6, i.e. it can enter infinite loop on s390(x) when using pcre matcher.

Version-Release number of selected component (if applicable):
pcre-7.8-6.el6

How reproducible:
Always

Steps to Reproduce:
1. Rebuild grep-2.21 from Fedora 21 on s390(x) machine running RHEL-6

Actual results:
Failure

Expected results:
No failure

Additional info:
It seem pcre-7.8 and pcre-8.13 are ABI compatible, thus such rebase shouldn't break deps. It may be also possible to just backport the fix (I haven't checked).

Comment 1 Jaroslav Škarvada 2015-02-17 14:32:13 UTC
Created attachment 992723 [details]
ABI check results

Comment 2 Petr Pisar 2015-02-17 14:39:04 UTC
There is no way for rebasing PCRE. There are changes in behaviour, Unicode tables etc. which would disrupt user's expectations.

If you pin-point a specific feature or a bug-fix you are interested in, I could look if porting it to the pcre-7.8 is feasible.

Comment 3 Jaroslav Škarvada 2015-02-24 15:01:18 UTC
(In reply to Petr Pisar from comment #2)
> There is no way for rebasing PCRE. There are changes in behaviour, Unicode
> tables etc. which would disrupt user's expectations.
> 
Understood, changed to backport request.

> If you pin-point a specific feature or a bug-fix you are interested in, I
> could look if porting it to the pcre-7.8 is feasible.
>
The following functionality from the PCREAPI [1] spec is required by grep:

# PCRE_ERROR_BADUTF8        (-10)
#
# A string that contains an invalid UTF-8 byte sequence was passed
# as a subject, and the PCRE_NO_UTF8_CHECK option was not set.
# If the size of the output vector (ovecsize) is at least 2, the byte
# offset to the start of the the invalid # UTF-8 character is placed
# in the first element, and a reason code is placed in the second element.

Feature probably introduced by upstream commit: 253cc5a6c3f231e54aed2d47aed15291f4823e8f (unchecked)
Returning two status bytes in ovector which wasn't touched before should be harmless and it shouldn't break backward compatibility.

Also backporting the check for the valid pcre_exec offset would be great, and I think it also shouldn't break anything - better to return error than crash, introduced by upstream commit: d0526b89dd4529e725bcb51761b275d35b8b740e

[1] http://www.pcre.org/original/doc/html/pcreapi.htm

Comment 5 Petr Pisar 2015-02-25 11:56:27 UTC
I have no idea what commits you talk about because upstream uses SVN. So there are no hashes.

There are many changes regarding checking UTF-8 validity. Could you quote the requested Changelog <http://www.pcre.org/original/changelog.txt> entries?

Comment 6 Jaroslav Škarvada 2015-02-25 13:02:11 UTC
(In reply to Petr Pisar from comment #5)
> I have no idea what commits you talk about because upstream uses SVN. So
> there are no hashes.
> 
Ups sorry:
https://gitorious.org/pcre

just noted: 

Unofficial git mirror of the PCRE SVN repository.

> There are many changes regarding checking UTF-8 validity. Could you quote
> the requested Changelog <http://www.pcre.org/original/changelog.txt> entries?

1)
Probably (not checked to apply the patch) "Pass back detailed info when UTF-8 check fails at runtime."
http://vcs.pcre.org/viewvc?view=revision&revision=598

changelog entry:
5.  Comprehensive information about what went wrong is now returned by
    pcre_exec() and pcre_dfa_exec() when the UTF-8 string check fails, as long
    as the output vector has at least 2 elements. The offset of the start of
    the failing character and a reason code are placed in the vector.

2)
Test for ridiculous values of starting offsets; tidy UTF-8 code.
http://vcs.pcre.org/viewvc?view=revision&revision=567
Checked this patch, attaching backported version.

changelog entry:
14. Neither pcre_exec() nor pcre_dfa_exec() was checking that the value given
    as a starting offset was within the subject string. There is now a new
    error, PCRE_ERROR_BADOFFSET, which is returned if the starting offset is
    negative or greater than the length of the string. In order to test this,
    pcretest is extended to allow the setting of negative starting offsets.

Comment 7 Jaroslav Škarvada 2015-02-25 13:05:00 UTC
Created attachment 995159 [details]
Backport of 2) (without tests)

Comment 8 Jaroslav Škarvada 2015-02-25 13:05:46 UTC
(In reply to Jaroslav Škarvada from comment #7)
> Created attachment 995159 [details]
> Backport of 2) (without tests)

and without docs

Comment 9 Jaroslav Škarvada 2015-02-25 13:11:04 UTC
(In reply to Jaroslav Škarvada from comment #6) 
> changelog entry:
> 5.  Comprehensive information about what went wrong is now returned by
>     pcre_exec() and pcre_dfa_exec() when the UTF-8 string check fails, as
> long
>     as the output vector has at least 2 elements. The offset of the start of
>     the failing character and a reason code are placed in the vector.

Need only the offset information of the first failing character.

Comment 10 Petr Pisar 2015-02-26 09:57:02 UTC
Created attachment 995509 [details]
Introduction of PCRE_ERROR_BADOFFSET ported to pcre-7.8

Complete patch with documentation and tests. Only relevant parts were ported from upstream.

Comment 11 Petr Pisar 2015-02-26 10:33:41 UTC
I have three reservations to the detailed error reporting in case of PCRE_ERROR_BADUTF8 (the request #1):

(1) It changes signature of exported symbol _pcre_valid_utf8(). This is function is not intended for public use, but who knows who can use it. I could introduce the new _pcre_valid_utf8() under different name, but I'm not sure if this is worth of it.

(2) It changes output of pcretest(1) tool. It adds appends few words to the error messages:

    re> /xyz/
  data> xyz\>4
  Error -24

becomes:
    re> /xyz/
  data> xyz\>4
  Error -24 (bad offset value)

This is a prerequisite for demonstrating the requested detailed reporting:

 /abc/8
     <C3>]
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=6

I can skip this change, but I don't know how QA people will be happy with testing this feature because they will have to write C tests. On the other hand, with this changed pcretest output, they will maybe need to adjust all other tests that harness the pcretest tool. This could also affect customers.

(3) The pcre_compile() changes reporting of bad UTF-8 character from last byte to first byte of the bad character. As far as I know, this fact (first byte) was not documented before, but applications could get used to it ("bad UTF-8 sequence before `%s'"). This is my biggest worry.

Comment 12 Jaroslav Škarvada 2015-02-26 11:24:24 UTC
(In reply to Petr Pisar from comment #11)
> I have three reservations to the detailed error reporting in case of
> PCRE_ERROR_BADUTF8 (the request #1):
> 
> (1) It changes signature of exported symbol _pcre_valid_utf8(). This is
> function is not intended for public use, but who knows who can use it. I
> could introduce the new _pcre_valid_utf8() under different name, but I'm not
> sure if this is worth of it.
>
My personal view is that it is OK. ABI according to ABI check stays compatible. If it is not documented for public use, people using it (if there are any) do it on their own risk.

> (2) It changes output of pcretest(1) tool. It adds appends few words to the
> error messages:
> 
>     re> /xyz/
>   data> xyz\>4
>   Error -24
> 
> becomes:
>     re> /xyz/
>   data> xyz\>4
>   Error -24 (bad offset value)
> 
> This is a prerequisite for demonstrating the requested detailed reporting:
> 
>  /abc/8
>      <C3>]
> -Error -10
> +Error -10 (bad UTF-8 string) offset=0 reason=6
> 
> I can skip this change, but I don't know how QA people will be happy with
> testing this feature because they will have to write C tests. On the other
> hand, with this changed pcretest output, they will maybe need to adjust all
> other tests that harness the pcretest tool. This could also affect customers.
>
No idea, in grep we don't need it, let's wait for QA reply.
 
> (3) The pcre_compile() changes reporting of bad UTF-8 character from last
> byte to first byte of the bad character. As far as I know, this fact (first
> byte) was not documented before, but applications could get used to it ("bad
> UTF-8 sequence before `%s'"). This is my biggest worry.
>
Again if it is not documented, it is not guaranteed and I wouldn't be afraid about it too much. Also I have no problem w\wo this change. We only need error reporting for pcre_exec, so pcre_compile error reporting may stay as is.

Comment 13 Petr Pisar 2015-02-26 12:08:32 UTC
(In reply to Jaroslav Škarvada from comment #12)
> My personal view is that it is OK. ABI according to ABI check stays
> compatible. If it is not documented for public use, people using it (if
> there are any) do it on their own risk.
[...]
> Again if it is not documented, it is not guaranteed and I wouldn't be afraid
> about it too much.

I do not share your optimism. Similar approach lead to canceling zlib rebase because libxml2 used internal zlib functions.

Comment 14 Petr Pisar 2015-02-26 14:42:44 UTC
Created attachment 995651 [details]
Error details in case off PCRE_ERROR_BADUTF8 ported to pcre-7.8 (maximal variant)

This is the first request ported to pcre-7.8. It introduces all the changes (pcretest tool, pcre_compile() offset, changed private function).

Comment 17 Petr Pisar 2015-02-26 15:15:24 UTC
How to test:

Introduction of PCRE_ERROR_BADOFFSET feature

(1) Call pcre_exec(3) with startoffset argument whose value is out of subject string array. I.e. less then zero and greater then subject string.
Before: The call signals no match.
After: The call signals PCRE_ERROR_BADOFFSET error.

You can use pcretest tool instead. The startoffset is specified in the subject string as \>INTEGER suffix:

$ printf '%s\n%s\n' '/.+/' 'abc\>4' | pcretest
Before:
PCRE version 7.8 2008-09-05

  re> data> No match
data>
After:
PCRE version 7.8 2008-09-05

  re> data> Error -24 (bad offset value)
data> 

The same happens with negative startoffset:

$ printf '%s\n%s\n' '/.+/' 'abc\>-1' | ./pcretest 
PCRE version 7.8 2008-09-05

  re> data> Error -24 (bad offset value)
data> 


Error details in case off PCRE_ERROR_BADUTF8 feature:

(1) Pass invalid UTF-8 string as subject to pcre_exec().
Before: The overctor content is undefined.
After: The ovector array members will contain detailed error code and offset to
the wrong byte.

You can use pcretest tool instead:

$ printf '%s\n%s\n' '/abc/8' 'a\xc3]' | pcretest 
Before:
PCRE version 7.8 2008-09-05

  re> data> Error -10
data> 
After:
PCRE version 7.8 2008-09-05

  re> data> Error -10 (bad UTF-8 string) offset=1 reason=6
data> 

The reason codes are document in pcreapi(3) in section "Reason codes for invalid UTF-8 strings".

Comment 18 Petr Pisar 2015-02-26 16:08:31 UTC
The only occurrence of _pcre_valid_utf8 I found in RHEL server repository is pcre itself and php-devel-5.3.3-40.el6_6.

The php-devel package list the symbol in usr/include/php/main/php_compat.h which just just a definition of an C macro symbol (without any function prototype) which is defined only if PHP is bundles PCRE library. And that's not our case. Even so, the php-devel package does not list the symbol anywhere else.

So I believe none of RHEL packages use the private _pcre_valid_utf8() function.

Comment 19 Jan Kepler 2015-02-27 09:37:32 UTC
All tests PASSED on scratch build pcre-7.8-7.el6.

(x86_64 and s390x architectures, rest will be tested in erratum)

Comment 23 Anthony Symons 2015-07-02 03:47:34 UTC
I have encountered this in our RHEL 6 production environment, and reduced the test case down to:

[a1671492@grumpy ~]$ touch test.log
[a1671492@grumpy ~]$ gzip test.log 
[a1671492@grumpy ~]$ grep -P 'test' test.log.gz 
Aborted (core dumped)

Checking the backtrace it appears to trigger on a null start_ptr in frame 2 as described above.

(gdb) bt
#0  0x0000003083c32625 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003083c33e05 in abort () at abort.c:92
#2  0x0000000000403855 in Pexecute (buf=0x1edf000 "\037\213\b\b\225\201\310S\002\003mod_jk.log.1_2014_07_20_log", 
    size=<value optimized out>, match_size=0x7fff36926740, start_ptr=0x0) at pcresearch.c:160
#3  0x0000000000405211 in do_execute (beg=<value optimized out>, lim=0x1edf031 "") at main.c:973
#4  grepbuf (beg=<value optimized out>, lim=0x1edf031 "") at main.c:1008
#5  0x0000000000405a33 in grep (file=0x7fff36927740 "/data/bb/logs/httpd/mod_jk.log.1_2014_07_20_log.gz", 
    stats=0x619a40) at main.c:1169
#6  grepfile (file=0x7fff36927740 "/data/bb/logs/httpd/mod_jk.log.1_2014_07_20_log.gz", stats=0x619a40)
    at main.c:1258
#7  0x0000000000405fe9 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:2187
(gdb) frame 2
#2  0x0000000000403855 in Pexecute (buf=0x1edf000 "\037\213\b\b\225\201\310S\002\003mod_jk.log.1_2014_07_20_log", 
    size=<value optimized out>, match_size=0x7fff36926740, start_ptr=0x0) at pcresearch.c:160
160		  abort ();

Comparibly on RHEL7 we get:

[a1671492@ant ~]$ touch test.log
[a1671492@ant ~]$ gzip test.log 
[a1671492@ant ~]$ grep -P 'test' test.log.gz
grep: invalid UTF-8 byte sequence in input

I also request that you push this backport through to resolve this issue.

Comment 25 errata-xmlrpc 2015-07-22 07:06:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2015-1374.html