Bug 1373590

Summary: pcp-atop killed by SIGFPE
Product: Red Hat Enterprise Linux 6 Reporter: Deepu K S <dkochuka>
Component: pcpAssignee: Nathan Scott <nathans>
Status: CLOSED ERRATA QA Contact: Miloš Prchlík <mprchlik>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.8CC: brolley, dkochuka, fche, lberk, mbenitez, mcermak, mgoodwin, mprchlik
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-21 11:20:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ABRT captured problem directory (coredump included)
none
pcp atop log none

Description Deepu K S 2016-09-06 16:29:21 UTC
Description of problem:
Process /usr/libexec/pcp/bin/pcp-atop was killed by signal 8 (SIGFPE)

It looks like pcp-atop crashed due to a divide by zero condition.

Core was generated by `/usr/libexec/pcp/bin/pcp-atop'.
Program terminated with signal 8, Arithmetic exception.
#0  0x0000000000417ef0 in prisyst (sstat=0x83d5e0, curline=2, nsecs=1, avgval=0, fixedhead=0, selp=0x63bd20, highorderp=0x7fff3f7e590e "C", maxcpulines=999, maxdsklines=999, 
    maxmddlines=999, maxlvmlines=999, maxintlines=999, maxnfslines=999, maxcontlines=999) at showlinux.c:1241
1241	        extra.percputot = extra.cputot / sstat->cpu.nrcpu;
(gdb) bt
#0  0x0000000000417ef0 in prisyst (sstat=0x83d5e0, curline=2, nsecs=1, avgval=0, fixedhead=0, selp=0x63bd20, highorderp=0x7fff3f7e590e "C", maxcpulines=999, maxdsklines=999, 
    maxmddlines=999, maxlvmlines=999, maxintlines=999, maxnfslines=999, maxcontlines=999) at showlinux.c:1241
#1  0x00000000004141ef in generic_samp (curtime=1472711047.4024861, delta=1.0000000000000291, sstat=0x83d5e0, tstat=0x83c210, proclist=0x83c350, ndeviat=0, ntask=0, nactproc=0, 
    totproc=0, totrun=0, totslpi=0, totslpu=0, totzomb=0, nexit=0, noverflow=0, flags=1) at showgeneric.c:294
#2  0x000000000040535d in engine () at atop.c:671
#3  0x00000000004056f9 in main (argc=1, argv=<value optimized out>) at atop.c:449
(gdb) f 0
#0  0x0000000000417ef0 in prisyst (sstat=0x83d5e0, curline=2, nsecs=1, avgval=0, fixedhead=0, selp=0x63bd20, highorderp=0x7fff3f7e590e "C", maxcpulines=999, maxdsklines=999, 
    maxmddlines=999, maxlvmlines=999, maxintlines=999, maxnfslines=999, maxcontlines=999) at showlinux.c:1241
1241	        extra.percputot = extra.cputot / sstat->cpu.nrcpu;
(gdb) l
1236	        }
1237	
1238	        if (extra.cputot == 0)
1239	                extra.cputot = 1;             /* avoid divide-by-zero */
1240	
1241	        extra.percputot = extra.cputot / sstat->cpu.nrcpu;
1242	
1243	        if (extra.percputot == 0)
1244	                extra.percputot = 1;          /* avoid divide-by-zero */
1245	
(gdb) p *sstat
$1 = {stamp = {tv_sec = 0, tv_usec = 0}, cpu = {nrcpu = 0, devint = 0, csw = 0, nprocs = 0, lavg1 = -1, lavg5 = -1, lavg15 = -1, all = {cpunr = 0, stime = 0, utime = 0, ntime = 0, 
      itime = 0, wtime = 0, Itime = 0, Stime = 0, steal = 0, guest = 0, freqcnt = {maxfreq = 0, cnt = 0, ticks = 0}}, cpu = 0x83dc10}, mem = {physmem = 0, freemem = 0, buffermem = 0, 
    slabmem = 0, cachemem = 0, cachedrt = 0, totswap = 0, freeswap = 0, pgscans = 0, pgsteal = 0, allocstall = 0, swouts = 0, swins = 0, commitlim = 0, committed = 0, shmem = 0, 
    shmrss = 0, shmswp = 0, slabreclaim = 0, tothugepage = 0, freehugepage = 0, hugepagesz = 0, vmwballoon = 0}, net = {ipv4 = {Forwarding = 0, DefaultTTL = 0, InReceives = 0, 
      InHdrErrors = 0, InAddrErrors = 0, ForwDatagrams = 0, InUnknownProtos = 0, InDiscards = 0, InDelivers = 0, OutRequests = 0, OutDiscards = 0, OutNoRoutes = 0, ReasmTimeout = 0, 
      ReasmReqds = 0, ReasmOKs = 0, ReasmFails = 0, FragOKs = 0, FragFails = 0, FragCreates = 0}, icmpv4 = {InMsgs = 0, InErrors = 0, InDestUnreachs = 0, InTimeExcds = 0, 
      InParmProbs = 0, InSrcQuenchs = 0, InRedirects = 0, InEchos = 0, InEchoReps = 0, InTimestamps = 0, InTimestampReps = 0, InAddrMasks = 0, InAddrMaskReps = 0, OutMsgs = 0, 
      OutErrors = 0, OutDestUnreachs = 0, OutTimeExcds = 0, OutParmProbs = 0, OutSrcQuenchs = 0, OutRedirects = 0, OutEchos = 0, OutEchoReps = 0, OutTimestamps = 0, 
      OutTimestampReps = 0, OutAddrMasks = 0, OutAddrMaskReps = 0}, udpv4 = {InDatagrams = 0, NoPorts = 0, InErrors = 0, OutDatagrams = 0}, ipv6 = {Ip6InReceives = 0, 
      Ip6InHdrErrors = 0, Ip6InTooBigErrors = 0, Ip6InNoRoutes = 0, Ip6InAddrErrors = 0, Ip6InUnknownProtos = 0, Ip6InTruncatedPkts = 0, Ip6InDiscards = 0, Ip6InDelivers = 0, 
      Ip6OutForwDatagrams = 0, Ip6OutRequests = 0, Ip6OutDiscards = 0, Ip6OutNoRoutes = 0, Ip6ReasmTimeout = 0, Ip6ReasmReqds = 0, Ip6ReasmOKs = 0, Ip6ReasmFails = 0, Ip6FragOKs = 0, 
      Ip6FragFails = 0, Ip6FragCreates = 0, Ip6InMcastPkts = 0, Ip6OutMcastPkts = 0}, icmpv6 = {Icmp6InMsgs = 0, Icmp6InErrors = 0, Icmp6InDestUnreachs = 0, Icmp6InPktTooBigs = 0, 
      Icmp6InTimeExcds = 0, Icmp6InParmProblems = 0, Icmp6InEchos = 0, Icmp6InEchoReplies = 0, Icmp6InGroupMembQueries = 0, Icmp6InGroupMembResponses = 0, 
      Icmp6InGroupMembReductions = 0, Icmp6InRouterSolicits = 0, Icmp6InRouterAdvertisements = 0, Icmp6InNeighborSolicits = 0, Icmp6InNeighborAdvertisements = 0, Icmp6InRedirects = 0, 
      Icmp6OutMsgs = 0, Icmp6OutDestUnreachs = 0, Icmp6OutPktTooBigs = 0, Icmp6OutTimeExcds = 0, Icmp6OutParmProblems = 0, Icmp6OutEchoReplies = 0, Icmp6OutRouterSolicits = 0, 
      Icmp6OutNeighborSolicits = 0, Icmp6OutNeighborAdvertisements = 0, Icmp6OutRedirects = 0, Icmp6OutGroupMembResponses = 0, Icmp6OutGroupMembReductions = 0}, udpv6 = {
      Udp6InDatagrams = 0, Udp6NoPorts = 0, Udp6InErrors = 0, Udp6OutDatagrams = 0}, tcp = {RtoAlgorithm = 0, RtoMin = 0, RtoMax = 0, MaxConn = 0, ActiveOpens = 0, PassiveOpens = 0, 
      AttemptFails = 0, EstabResets = 0, CurrEstab = 0, InSegs = 0, OutSegs = 0, RetransSegs = 0, InErrs = 0, OutRsts = 0}}, intf = {nrintf = 0, intf = 0x83dc80}, dsk = {ndsk = 0, 
    nmdd = 0, nlvm = 0, dsk = 0x83dd40, mdd = 0x83de00, lvm = 0x83dda0}, nfs = {server = {netcnt = 0, netudpcnt = 0, nettcpcnt = 0, nettcpcon = 0, rpccnt = 0, rpcbadfmt = 0, 
      rpcbadaut = 0, rpcbadcln = 0, rpcread = 0, rpcwrite = 0, rchits = 0, rcmiss = 0, rcnoca = 0, nrbytes = 0, nwbytes = 0}, client = {rpccnt = 0, rpcretrans = 0, rpcautrefresh = 0, 
      rpcread = 0, rpcwrite = 0}, nrmounts = 0, nfsmnt = 0x83de60}, cfs = {nrcontainer = 0, cont = 0x0}, www = {accesses = 0, totkbytes = 0, uptime = 0, bworkers = 0, iworkers = 0}}
(gdb) p sstat->cpu.nrcpu
$2 = 0
(gdb)
Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux 6.8
pcp-3.10.9-6.el6.x86_64
pcp-system-tools-3.10.9-6.el6.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Run # pcp atop

2. It gives a "Floating point exception" error.


Actual results:
# cat var_log_messages 
Aug 30 10:31:10 lbnss22 kernel: pcp-atop[11688] trap divide error ip:417ef0 sp:7fff003a84a0 error:0 in pcp-atop[400000+34000]
Aug 30 10:33:44 lbnss22 kernel: pcp-atop[19345] trap divide error ip:417ef0 sp:7fffe01ac290 error:0 in pcp-atop[400000+34000]
Aug 30 10:36:21 lbnss22 kernel: pcp-atop[22592] trap divide error ip:417ef0 sp:7fffc225a930 error:0 in pcp-atop[400000+34000]
Aug 30 10:39:18 lbnss22 kernel: pcp-atop[25979] trap divide error ip:417ef0 sp:7ffe930e49d0 error:0 in pcp-atop[400000+34000]
Aug 30 11:20:37 lbnss22 kernel: pcp-atop[16194] trap divide error ip:417ef0 sp:7ffd64137240 error:0 in pcp-atop[400000+34000]
Sep  1 08:23:21 lbnss22 kernel: pcp-atop[13381] trap divide error ip:417ef0 sp:7ffc1f575630 error:0 in pcp-atop[400000+34000]
Sep  1 08:24:07 lbnss22 kernel: pcp-atop[14493] trap divide error ip:417ef0 sp:7fff3f7e5550 error:0 in pcp-atop[400000+34000]
Sep  1 08:24:07 lbnss22 abrt[14506]: Saved core dump of pid 14493 (/usr/libexec/pcp/bin/pcp-atop) to /var/spool/abrt/ccpp-2016-09-01-08:24:07-14493 (1458176 bytes)


Expected results:
No crashes.

Additional info:

Comment 1 Deepu K S 2016-09-06 16:33:21 UTC
Created attachment 1198357 [details]
ABRT captured problem directory (coredump included)

Comment 3 Nathan Scott 2016-09-08 01:43:12 UTC
Hi Deepu,

What does

$ pminfo -f hinv.ncpu

report on this system?  (I'm expecting some kind of error, just curious as to which one)

So far, I've been unable to reproduce the problem locally (with/without pmcd running, with/without pmdalinux running).

Thanks!

Comment 4 Frank Ch. Eigler 2016-09-12 13:49:23 UTC
(In reply to Deepu K S from comment #0)
> Description of problem:
> Process /usr/libexec/pcp/bin/pcp-atop was killed by signal 8 (SIGFPE)
> It looks like pcp-atop crashed due to a divide by zero condition 

Were you able to collect $PCP_DEBUG level traces?
% env PCP_DEBUG=2 pcp atop  2>/tmp/LOGFILE

Comment 5 Deepu K S 2016-09-12 14:37:05 UTC
(In reply to Nathan Scott from comment #3)
> Hi Deepu,
> 
> What does
> 
> $ pminfo -f hinv.ncpu
> 
> report on this system?  (I'm expecting some kind of error, just curious as
> to which one)
> 
> So far, I've been unable to reproduce the problem locally (with/without pmcd
> running, with/without pmdalinux running).
> 
> Thanks!

Sorry for the delay. I now have the output collected.

# pminfo -f hinv.ncpu
hinv.ncpu: pmLookupDesc: No PMCD agent for domain of request

# service pmcd status
Checking for pmcd: running


Output of # env PCP_DEBUG=10  pcp atop 2>pcp-atop.log
is attached.

The crash happens whenever the command is run. It also happens right away.

Most lines from logfile show
  PM_ID_NULL (<noname>): No PMCD agent for domain of request

Comment 6 Deepu K S 2016-09-12 14:38:28 UTC
Created attachment 1200232 [details]
pcp atop log

Comment 7 Frank Ch. Eigler 2016-09-12 14:52:00 UTC
pmFetch returns ...
pmResult dump from 0x83c2e0 timestamp: 1473338279.564024 14:37:59.564 numpmid: 11
  PM_ID_NULL (<noname>): No PMCD agent for domain of request
  PM_ID_NULL (<noname>): No PMCD agent for domain of request
  PM_ID_NULL (<noname>): No PMCD agent for domain of request


Oh, dear.  That suggests that pmdalinux and/or pmdaproc crashed or were taken out of service, and that automatic restarting (if any) was not successful.  (What version of PCP was this?)  A

 # service pmcd restart

should bring them back to life.  It is a bug in pcp-atop that it fails to report the problem and advise the user.

Comment 8 Nathan Scott 2016-09-12 22:44:41 UTC
Thanks Deepu, I understand whats happening now & know how to reproduce, a fix will follow shortly.

Comment 9 Nathan Scott 2016-09-26 03:38:17 UTC
This is fixed in upstream PCP via git commit 7157edb93 and will make its way into the next available RHEL6 PCP update from there.

Comment 12 Miloš Prchlík 2017-01-18 08:26:51 UTC
Verified with build pcp-3.10.9-8.el6.

Comment 14 errata-xmlrpc 2017-03-21 11:20:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0735.html