Bug 1193524
Summary: | Backport features required by grep: PCRE_ERROR_BADOFFSET, error details in case of PCRE_ERROR_BADUTF8 | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Jaroslav Škarvada <jskarvad> |
Component: | pcre | Assignee: | Petr Pisar <ppisar> |
Status: | CLOSED ERRATA | QA Contact: | Jan Kepler <jkejda> |
Severity: | unspecified | Docs Contact: | Lenka Špačková <lkuprova> |
Priority: | unspecified | ||
Version: | 6.6 | CC: | ant, jkejda, jorton, jskarvad, optak, qe-baseos-apps, tlavigne |
Target Milestone: | rc | Keywords: | FutureFeature, Patch, Reopened |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | pcre-7.8-7.el6 | Doc Type: | Enhancement |
Doc Text: |
Improvements in the PCRE library
To allow the grep utility to recover from PCRE matching failures if the binary file is not a valid UTF-8 sequence, the following features have been backported to the PCRE library:
* The pcre_exec() function now checks for out-of-range starting offset values and reports PCRE_ERROR_BADOFFSET errors instead of reporting PCRE_ERROR_NOMATCH errors or looping infinitely.
* If the pcre_exec() function is called to perform a UTF-8 match on an invalid UTF-8 subject string and the ovector array argument is large enough, the offset of the first subject string in the invalid UTF-8 byte, as well as the detailed reason code, are returned in the ovector array element. In addition, the pcretest utility can now be used to display these details. Note that with this update, the pcre_compile() function reports first invalid UTF-8 byte instead of the last byte. Also note that the signature of the pcre_valid_utf8() function, which is not intended for public use, has been changed. Finally, note that the pcretest utility now appends human-readable error messages to error codes.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2015-07-22 07:06:55 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1193030, 1217080 | ||
Attachments: |
Description
Jaroslav Škarvada
2015-02-17 14:31:39 UTC
Created attachment 992723 [details]
ABI check results
There is no way for rebasing PCRE. There are changes in behaviour, Unicode tables etc. which would disrupt user's expectations. If you pin-point a specific feature or a bug-fix you are interested in, I could look if porting it to the pcre-7.8 is feasible. (In reply to Petr Pisar from comment #2) > There is no way for rebasing PCRE. There are changes in behaviour, Unicode > tables etc. which would disrupt user's expectations. > Understood, changed to backport request. > If you pin-point a specific feature or a bug-fix you are interested in, I > could look if porting it to the pcre-7.8 is feasible. > The following functionality from the PCREAPI [1] spec is required by grep: # PCRE_ERROR_BADUTF8 (-10) # # A string that contains an invalid UTF-8 byte sequence was passed # as a subject, and the PCRE_NO_UTF8_CHECK option was not set. # If the size of the output vector (ovecsize) is at least 2, the byte # offset to the start of the the invalid # UTF-8 character is placed # in the first element, and a reason code is placed in the second element. Feature probably introduced by upstream commit: 253cc5a6c3f231e54aed2d47aed15291f4823e8f (unchecked) Returning two status bytes in ovector which wasn't touched before should be harmless and it shouldn't break backward compatibility. Also backporting the check for the valid pcre_exec offset would be great, and I think it also shouldn't break anything - better to return error than crash, introduced by upstream commit: d0526b89dd4529e725bcb51761b275d35b8b740e [1] http://www.pcre.org/original/doc/html/pcreapi.htm I have no idea what commits you talk about because upstream uses SVN. So there are no hashes. There are many changes regarding checking UTF-8 validity. Could you quote the requested Changelog <http://www.pcre.org/original/changelog.txt> entries? (In reply to Petr Pisar from comment #5) > I have no idea what commits you talk about because upstream uses SVN. So > there are no hashes. > Ups sorry: https://gitorious.org/pcre just noted: Unofficial git mirror of the PCRE SVN repository. > There are many changes regarding checking UTF-8 validity. Could you quote > the requested Changelog <http://www.pcre.org/original/changelog.txt> entries? 1) Probably (not checked to apply the patch) "Pass back detailed info when UTF-8 check fails at runtime." http://vcs.pcre.org/viewvc?view=revision&revision=598 changelog entry: 5. Comprehensive information about what went wrong is now returned by pcre_exec() and pcre_dfa_exec() when the UTF-8 string check fails, as long as the output vector has at least 2 elements. The offset of the start of the failing character and a reason code are placed in the vector. 2) Test for ridiculous values of starting offsets; tidy UTF-8 code. http://vcs.pcre.org/viewvc?view=revision&revision=567 Checked this patch, attaching backported version. changelog entry: 14. Neither pcre_exec() nor pcre_dfa_exec() was checking that the value given as a starting offset was within the subject string. There is now a new error, PCRE_ERROR_BADOFFSET, which is returned if the starting offset is negative or greater than the length of the string. In order to test this, pcretest is extended to allow the setting of negative starting offsets. Created attachment 995159 [details]
Backport of 2) (without tests)
(In reply to Jaroslav Škarvada from comment #7) > Created attachment 995159 [details] > Backport of 2) (without tests) and without docs (In reply to Jaroslav Škarvada from comment #6) > changelog entry: > 5. Comprehensive information about what went wrong is now returned by > pcre_exec() and pcre_dfa_exec() when the UTF-8 string check fails, as > long > as the output vector has at least 2 elements. The offset of the start of > the failing character and a reason code are placed in the vector. Need only the offset information of the first failing character. Created attachment 995509 [details]
Introduction of PCRE_ERROR_BADOFFSET ported to pcre-7.8
Complete patch with documentation and tests. Only relevant parts were ported from upstream.
I have three reservations to the detailed error reporting in case of PCRE_ERROR_BADUTF8 (the request #1): (1) It changes signature of exported symbol _pcre_valid_utf8(). This is function is not intended for public use, but who knows who can use it. I could introduce the new _pcre_valid_utf8() under different name, but I'm not sure if this is worth of it. (2) It changes output of pcretest(1) tool. It adds appends few words to the error messages: re> /xyz/ data> xyz\>4 Error -24 becomes: re> /xyz/ data> xyz\>4 Error -24 (bad offset value) This is a prerequisite for demonstrating the requested detailed reporting: /abc/8 <C3>] -Error -10 +Error -10 (bad UTF-8 string) offset=0 reason=6 I can skip this change, but I don't know how QA people will be happy with testing this feature because they will have to write C tests. On the other hand, with this changed pcretest output, they will maybe need to adjust all other tests that harness the pcretest tool. This could also affect customers. (3) The pcre_compile() changes reporting of bad UTF-8 character from last byte to first byte of the bad character. As far as I know, this fact (first byte) was not documented before, but applications could get used to it ("bad UTF-8 sequence before `%s'"). This is my biggest worry. (In reply to Petr Pisar from comment #11) > I have three reservations to the detailed error reporting in case of > PCRE_ERROR_BADUTF8 (the request #1): > > (1) It changes signature of exported symbol _pcre_valid_utf8(). This is > function is not intended for public use, but who knows who can use it. I > could introduce the new _pcre_valid_utf8() under different name, but I'm not > sure if this is worth of it. > My personal view is that it is OK. ABI according to ABI check stays compatible. If it is not documented for public use, people using it (if there are any) do it on their own risk. > (2) It changes output of pcretest(1) tool. It adds appends few words to the > error messages: > > re> /xyz/ > data> xyz\>4 > Error -24 > > becomes: > re> /xyz/ > data> xyz\>4 > Error -24 (bad offset value) > > This is a prerequisite for demonstrating the requested detailed reporting: > > /abc/8 > <C3>] > -Error -10 > +Error -10 (bad UTF-8 string) offset=0 reason=6 > > I can skip this change, but I don't know how QA people will be happy with > testing this feature because they will have to write C tests. On the other > hand, with this changed pcretest output, they will maybe need to adjust all > other tests that harness the pcretest tool. This could also affect customers. > No idea, in grep we don't need it, let's wait for QA reply. > (3) The pcre_compile() changes reporting of bad UTF-8 character from last > byte to first byte of the bad character. As far as I know, this fact (first > byte) was not documented before, but applications could get used to it ("bad > UTF-8 sequence before `%s'"). This is my biggest worry. > Again if it is not documented, it is not guaranteed and I wouldn't be afraid about it too much. Also I have no problem w\wo this change. We only need error reporting for pcre_exec, so pcre_compile error reporting may stay as is. (In reply to Jaroslav Škarvada from comment #12) > My personal view is that it is OK. ABI according to ABI check stays > compatible. If it is not documented for public use, people using it (if > there are any) do it on their own risk. [...] > Again if it is not documented, it is not guaranteed and I wouldn't be afraid > about it too much. I do not share your optimism. Similar approach lead to canceling zlib rebase because libxml2 used internal zlib functions. Created attachment 995651 [details]
Error details in case off PCRE_ERROR_BADUTF8 ported to pcre-7.8 (maximal variant)
This is the first request ported to pcre-7.8. It introduces all the changes (pcretest tool, pcre_compile() offset, changed private function).
How to test: Introduction of PCRE_ERROR_BADOFFSET feature (1) Call pcre_exec(3) with startoffset argument whose value is out of subject string array. I.e. less then zero and greater then subject string. Before: The call signals no match. After: The call signals PCRE_ERROR_BADOFFSET error. You can use pcretest tool instead. The startoffset is specified in the subject string as \>INTEGER suffix: $ printf '%s\n%s\n' '/.+/' 'abc\>4' | pcretest Before: PCRE version 7.8 2008-09-05 re> data> No match data> After: PCRE version 7.8 2008-09-05 re> data> Error -24 (bad offset value) data> The same happens with negative startoffset: $ printf '%s\n%s\n' '/.+/' 'abc\>-1' | ./pcretest PCRE version 7.8 2008-09-05 re> data> Error -24 (bad offset value) data> Error details in case off PCRE_ERROR_BADUTF8 feature: (1) Pass invalid UTF-8 string as subject to pcre_exec(). Before: The overctor content is undefined. After: The ovector array members will contain detailed error code and offset to the wrong byte. You can use pcretest tool instead: $ printf '%s\n%s\n' '/abc/8' 'a\xc3]' | pcretest Before: PCRE version 7.8 2008-09-05 re> data> Error -10 data> After: PCRE version 7.8 2008-09-05 re> data> Error -10 (bad UTF-8 string) offset=1 reason=6 data> The reason codes are document in pcreapi(3) in section "Reason codes for invalid UTF-8 strings". The only occurrence of _pcre_valid_utf8 I found in RHEL server repository is pcre itself and php-devel-5.3.3-40.el6_6. The php-devel package list the symbol in usr/include/php/main/php_compat.h which just just a definition of an C macro symbol (without any function prototype) which is defined only if PHP is bundles PCRE library. And that's not our case. Even so, the php-devel package does not list the symbol anywhere else. So I believe none of RHEL packages use the private _pcre_valid_utf8() function. All tests PASSED on scratch build pcre-7.8-7.el6. (x86_64 and s390x architectures, rest will be tested in erratum) I have encountered this in our RHEL 6 production environment, and reduced the test case down to: [a1671492@grumpy ~]$ touch test.log [a1671492@grumpy ~]$ gzip test.log [a1671492@grumpy ~]$ grep -P 'test' test.log.gz Aborted (core dumped) Checking the backtrace it appears to trigger on a null start_ptr in frame 2 as described above. (gdb) bt #0 0x0000003083c32625 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x0000003083c33e05 in abort () at abort.c:92 #2 0x0000000000403855 in Pexecute (buf=0x1edf000 "\037\213\b\b\225\201\310S\002\003mod_jk.log.1_2014_07_20_log", size=<value optimized out>, match_size=0x7fff36926740, start_ptr=0x0) at pcresearch.c:160 #3 0x0000000000405211 in do_execute (beg=<value optimized out>, lim=0x1edf031 "") at main.c:973 #4 grepbuf (beg=<value optimized out>, lim=0x1edf031 "") at main.c:1008 #5 0x0000000000405a33 in grep (file=0x7fff36927740 "/data/bb/logs/httpd/mod_jk.log.1_2014_07_20_log.gz", stats=0x619a40) at main.c:1169 #6 grepfile (file=0x7fff36927740 "/data/bb/logs/httpd/mod_jk.log.1_2014_07_20_log.gz", stats=0x619a40) at main.c:1258 #7 0x0000000000405fe9 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:2187 (gdb) frame 2 #2 0x0000000000403855 in Pexecute (buf=0x1edf000 "\037\213\b\b\225\201\310S\002\003mod_jk.log.1_2014_07_20_log", size=<value optimized out>, match_size=0x7fff36926740, start_ptr=0x0) at pcresearch.c:160 160 abort (); Comparibly on RHEL7 we get: [a1671492@ant ~]$ touch test.log [a1671492@ant ~]$ gzip test.log [a1671492@ant ~]$ grep -P 'test' test.log.gz grep: invalid UTF-8 byte sequence in input I also request that you push this backport through to resolve this issue. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2015-1374.html |