Bug 1878754
| Summary: | Segfault in __pmGetArchiveEnd_ctx due to setting 'f' to null and then using it | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Paulo Andrade <pandrade> | |
| Component: | pcp | Assignee: | Nathan Scott <nathans> | |
| Status: | CLOSED ERRATA | QA Contact: | Jan Kurik <jkurik> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 7.6 | CC: | agerstmayr, jijia, jkurik, mgoodwin, nathans, patrickm | |
| Target Milestone: | rc | Keywords: | Bugfix, Triaged, ZStream | |
| Target Release: | --- | |||
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | pcp-4.3.2-13.el7_9 | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1880403 (view as bug list) | Environment: | ||
| Last Closed: | 2020-12-15 11:18:17 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1880403 | |||
Hi Paulo, Do you have a reproducible test case for this problem that we can use for regression testing the fix? (looks like they're invoking pmlogextract there - could we get access to the problematic archive?) Thanks! It's been reproduced and fixed upstream now - commit below - no need for that archive now.
Paulo, this bug has been reported against RHEL7 but there are no future el7 ystream updates planned (i.e. no 7.10) - engineering focus is nowadays on RHEL8.
Does the customer need a zstream update for RHEL7? (urgent? a business case will need to be made)
If not, I'll move this BZ over to RHEL8 and ensure its included in the next available release.
thanks!
commit 9020728d5d0ee7507ab96baf673c4d3c4ece98a5
Author: Ken McDonell <kenj.au>
Date: Thu Sep 17 14:29:38 2020 +1000
src/libpcp/src/logutil.c: refactor __pmGetArchiveEnd_ctx()
GOTO considered helpful.
Rework to deal with customer-reported problem via Nathan. We were
dereferencing a NULL FILE pointer.
The logic is a bit convoluted here, but it is all to do with the
various ways in which an archive can be corrupted. The common code
path is unchanged.
Also qa/1280 (new) reproduced the original problem, and qa/417 was
remade because it deals with corrupted archives and now has slightly
different output for one test case.
Hi Nathan, The cu need a zstream update for RHEL7, let you know. Thanks. Best regards, Jian Jia Removing needinfo, as we were waiting for reproducer data from customer, and Jia Jian already provided the extra information :) Thanks folks - please follow the process for requesting a zstream fix next, and we'll chat again when the BZ clones are created. Paulo - hmm, it looks like you've created a clone of this BZ yourself? That's likely to hinder the process at this stage, unfortunately - you'll want to close that clone in order to progress this fix further. The zstream process is described here: https://mojo.redhat.com/docs/DOC-1021938 The clone(s) are the final stage of approval, and are created automatically by release engineering team folks. There several steps that must be done before that, however (see above doc, it needs to be followed carefully). cheers. Nathan, thanks for letting me know about my error. Jia Jian, can you help us with the justification? I only did check the coredump, and did not directly talk with the customer, but you might add more data before actually making the request. Below is initial data. --- > Business justification (customer impact; how would the defect impact customers’ business? why is it not possible to wait for a fix in Y-Stream?): There is no planned Y-Stream. > How many customers are impacted by this defect? Currently only one known case. > When was the defect introduced in this major version of RHEL? Is it a regression? It is a logical bug on special conditions or corrupted files that lead to a segfault. > What is the minimal patch set to fix this bug? (link, git commit,...) https://github.com/performancecopilot/pcp/commit/9020728d5d0ee7507ab96baf673c4d3c4ece98a5 Request zstream update. Customer is running rhel 7.6 plus updates. > Business justification (customer impact; how would the defect impact customers’ business? why is it not possible to wait for a fix in Y-Stream?): There is no planned Y-Stream. Customer is a hardware vendor and does extensive disk tests using pcp tools before selling their products. > How many customers are impacted by this defect? Currently only one known case. > When was the defect introduced in this major version of RHEL? Is it a regression? It is a logical bug on special conditions or corrupted files that lead to a segfault. > What is the minimal patch set to fix this bug? (link, git commit,...) https://github.com/performancecopilot/pcp/commit/9020728d5d0ee7507ab96baf673c4d3c4ece98a5 Jian, Paulo, I'm having a hard time seeing this BZ as satisfying the requirements of a zstream - single customer, not alot in the way of business justification, and for a relatively obscure situation (corrupt archive) - its unlikely many other customers are going to run into this issue. Also with this being requested for 7.6 that means we need to create and test zstream updates for each subsequent release as well (i.e. this is indirectly a request for zstreams from 7.6 thru 7.9, which is additional work for the PCP dev & QE team), because customers can and do upgrade in which case they still need the fix. Rather than close this request outright - perhaps we can try a different tack and piggyback on other efforts to reduce the costs. We are likely to produce a 7.9 zstream update for an unrelated issue in PCP (although that one also is not yet approved) - could the customer be convinced to move up to 7.9 in order that we can back-port and test a single zstream update with both fixes? thanks. OK, thanks Jian - please keep us posted and I'll move it forward with the other planned BZ for rhel-7.9.z. Note this fix will be included in rhel-8.4.0 as well. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pcp bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5440 |
The segfault is with pcp-4.1.0-4.el7, but reading the code should also happen at least with pcp-4.3.2-6.el7. The problem is in the pattern: found = 0; sts = PM_ERR_LOGREC; /* default error condition */ f = NULL; for (vol = lcp->l_maxvol; vol >= lcp->l_minvol; vol--) { ... /* * failure at the physical end of file may be related to a truncted * block flush for a growing archive. Scan temporal index, and use * last entry at or before end of physical file for this volume */ logend = (int)sizeof(__pmLogLabel) + 2*(int)sizeof(int); for (i = lcp->l_numti - 1; i >= 0; i--) { if (lcp->l_ti[i].ti_vol != vol) { if (f != acp->ac_mfp) { __pmFclose(f); f = NULL; } continue; } if (lcp->l_ti[i].ti_log <= physend) { logend = lcp->l_ti[i].ti_log; break; } } ... /* * Now chase it forwards from the last index entry ... * * BUG 357003 - pmchart can't read archive file * turns out the index may point to the _end_ of the last * valid record, so if not at start of volume, back up one * record, then scan forwards. */ __pmFseek(f, (long)logend, SEEK_SET); ... Above it is clear the 'continue' in 'for (i = lcp->l_numti - 1; i >= 0; i--)' might have set 'f' to NULL, and because it continues only the inner for loop, it will call __pmFseek with a NULL pointer. Maybe this condition can only happen with pcp-4.1.0-4.el7, or with corrupted data, but still, it should have some guard against this NULL pointer dereference.