Bug 1818710

Summary: pcp-atop is crashing due to an uninitialized value within a sort comparison routine
Product: Red Hat Enterprise Linux 7 Reporter: Nitin Kumar Bansal <nbansal>
Component: pcpAssignee: Nathan Scott <nathans>
Status: CLOSED ERRATA QA Contact: Jan Kurik <jkurik>
Severity: high Docs Contact:
Priority: high    
Version: 7.7CC: agerstmayr, alanm, dbasant, jkurik, mgoodwin, nathans, patrickm
Target Milestone: rcKeywords: Bugfix, Triaged, ZStream
Target Release: 7.9   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: pcp-4.3.2-12 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1851849 (view as bug list) Environment:
Last Closed: 2020-09-29 19:25:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1851849    

Comment 7 Nathan Scott 2020-04-03 04:34:28 UTC
OK, thanks Divya - I'll continue to look into it.

Comment 10 Nathan Scott 2020-04-07 04:29:52 UTC
Status is I've been unable to reproduce the problem locally since those earlier changes, making life much more difficult in terms of finding a fix.  Do they/you have insights as to what might trigger the problem?  All working find here.  :(

Comment 11 Nathan Scott 2020-04-09 07:09:03 UTC
Divya,

Are you absolutely certain your build included that and the other atop fixes from the previous bug?  For me, with all updates applied my one test case that intermittently tripped the issue has not triggered it since.  Our QE folk have also been trying without success to reproduce the problem.  Valgrind is reporting all memory accesses are safe, and after auditing the code again I cannot see a way we'd be able to incorrectly access memory there.

cheers.

Comment 14 Nathan Scott 2020-04-30 02:00:38 UTC
Hi Divya,

In the absence of valgrind output so far, I've audited the code paths in pcp-atop once more today.  I think I can see another set of code paths that that could be causing the crashes we've seen, and that is where the comparison routines (compcpu, compdsk, etc) are presented with elements from two differently sized arrays, where the smaller one has NULL'd task pointers.  In this case we'd see a NULL pointer passed into the comparison routine, and we'd crash with segv at the points described.

I've pushed an upstream commit to tackle this aspect (details below) - could you prepare a build for the customer with this and see if it resolves the issue?  Thanks!

commit 9e14d91e012fd2f5b395cb83ba2353a1ec4a7e3f
Author: Nathan Scott <nathans>
Date:   Thu Apr 30 11:56:42 2020 +1000

    pcp-atop: resolve other potential null task pointer dereferences
    
    Additional defensive counter measures in sort routines where we
    could potentially dereference null pointers.  Aiming to tackle a
    customer reported issue, which qa/1080 intermittently reproduces.
    
    Related to Red Hat BZ #1818710.

Comment 17 Divya 2020-05-11 07:37:40 UTC
Hello Nathan

Bad news! Even with recent set of patches included, it seems to be crashing at the same point with below backtrace: 

Program terminated with signal 11, Segmentation fault.
#0  compcpu (a=0x19ac838, b=0x19ac840) at showlinux.c:2045
2045		bcpu = (*(struct tstat **)b)->cpu.stime +
(gdb) bt
#0  compcpu (a=0x19ac838, b=0x19ac840) at showlinux.c:2045
#1  0x00007fb4868dde59 in msort_with_tmp (p=0x7ffd4b0d5730, b=0x19ac838, n=2) at msort.c:83
#2  0x00007fb4868ddbc8 in msort_with_tmp (n=2, b=0x19ac838, p=0x7ffd4b0d5730) at msort.c:45
#3  msort_with_tmp (p=0x7ffd4b0d5730, b=0x19ac830, n=3) at msort.c:54
#4  0x00007fb4868ddbc8 in msort_with_tmp (n=3, b=0x19ac830, p=0x7ffd4b0d5730) at msort.c:45
#5  msort_with_tmp (p=0x7ffd4b0d5730, b=0x19ac820, n=5) at msort.c:54
#6  0x00007fb4868ddbc8 in msort_with_tmp (n=5, b=0x19ac820, p=0x7ffd4b0d5730) at msort.c:45
#7  msort_with_tmp (p=0x7ffd4b0d5730, b=0x19ac7f8, n=10) at msort.c:54
#8  0x00007fb4868ddbc8 in msort_with_tmp (n=10, b=0x19ac7f8, p=0x7ffd4b0d5730) at msort.c:45
#9  msort_with_tmp (p=0x7ffd4b0d5730, b=0x19ac7a8, n=20) at msort.c:54
#10 0x00007fb4868ddbc8 in msort_with_tmp (n=20, b=0x19ac7a8, p=0x7ffd4b0d5730) at msort.c:45
#11 msort_with_tmp (p=0x7ffd4b0d5730, b=0x19ac708, n=40) at msort.c:54
#12 0x00007fb4868ddbc8 in msort_with_tmp (n=40, b=0x19ac708, p=0x7ffd4b0d5730) at msort.c:45
#13 msort_with_tmp (p=0x7ffd4b0d5730, b=0x19ac5d0, n=79) at msort.c:54
#14 0x00007fb4868de14c in msort_with_tmp (n=79, b=0x19ac5d0, p=0x7ffd4b0d5730) at msort.c:45
#15 __GI___qsort_r (b=b@entry=0x19ac5d0, n=n@entry=79, s=s@entry=8, cmp=0x419e30 <compcpu>, arg=arg@entry=0x0) at msort.c:297
#16 0x00007fb4868de1f8 in __GI_qsort (b=b@entry=0x19ac5d0, n=n@entry=79, s=s@entry=8, cmp=<optimized out>) at msort.c:308
#17 0x0000000000413dff in generic_samp (curtime=<optimized out>, nsecs=<optimized out>, devtstat=<optimized out>, sstat=<optimized out>, nexit=<optimized out>, noverflow=<optimized out>, flag=<optimized out>)
    at showgeneric.c:645
#18 0x000000000040396f in engine () at atop.c:684
#19 0x0000000000402fc3 in main (argc=4, argv=<optimized out>) at atop.c:477

Comment 18 Nathan Scott 2020-05-11 07:40:58 UTC
Thanks Divya - I'll keep looking.  :(  Were they able to reproduce it with valgrind?

Comment 19 Divya 2020-05-11 07:44:22 UTC
(In reply to Nathan Scott from comment #18)
> Thanks Divya - I'll keep looking.  :(  Were they able to reproduce it with
> valgrind?

Unfortunately no

Comment 23 Jan Kurik 2020-05-17 07:44:22 UTC
All the regression tests have passed.
Switching to VERIFIED and setting flag SanityOnly as I am unable to reproduce this issue.

Comment 24 Nathan Scott 2020-06-02 02:10:30 UTC
It wasn't mentioned here, which led to some accidental confusion, but there is one other relavant commit (already in 7.9)...

commit c22151f463e3e2494850210444a69948dc0fbdd6
Author: Nathan Scott <nathans>
Date:   Tue May 19 14:50:37 2020 +1000

    pcp-atop: resolve a new task pointer segv qa/1080 has encountered
    
    Related to Red Hat BZ #1818710

Comment 25 Nathan Scott 2020-06-02 02:10:45 UTC
*** Bug 1842480 has been marked as a duplicate of this bug. ***

Comment 34 errata-xmlrpc 2020-09-29 19:25:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: pcp security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3869