Bug 858384

Summary: pmcd segv during linux-pmda query
Product: [Fedora] Fedora Reporter: Frank Ch. Eigler <fche>
Component: pcpAssignee: Nathan Scott <nscott>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 16CC: fche, mgoodwin, nathans
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-20 10:14:04 EST Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description Frank Ch. Eigler 2012-09-18 16:16:42 EDT

# service pmcd start
# gdb .../pmcd $PID
(gdb) continue

---- meanwhile, from another window ----

% pmval kernel.pernode.cpu.sys

pmval: pmGetInDom(60.19): Timeout waiting for a response from PMCD

---- pmcd suffers segv ----

Program received signal SIGSEGV, Segmentation fault.
linux_table_scan (fp=0x2b01f0140d00, table=0x0) at linux_table.c:80
80		for (t=table; t->field; t++) {
(gdb) bt
#0  linux_table_scan (fp=0x2b01f0140d00, table=0x0) at linux_table.c:80
#1  0x00002b01f0bca715 in refresh_numa_meminfo (numa_meminfo=0x2b01f0dd6fb0)
    at numa_meminfo.c:121
#2  0x00002b01f0bc1bf4 in linux_refresh (pmda=0x2b01f013f6d0, 
    need_refresh=0x7fff6b2b9a80) at pmda.c:3787
#3  0x00002b01f0bc1f6d in linux_instance (indom=251658259, inst=-1, name=0x0, 
    result=0x7fff6b2b9bb0, pmda=<optimized out>) at pmda.c:3911
#4  0x00002b01eed90bc9 in DoInstance (cp=0x2b01f0140810, pb=0x2b01f0141000)
    at dopdus.c:315
#5  0x00002b01eed89e9c in HandleClientInput (fdsPtr=0x7fff6b2b9cc0) at pmcd.c:495
#6  0x00002b01eed887f1 in ClientLoop () at pmcd.c:869
#7  main (argc=<optimized out>, argv=<optimized out>) at pmcd.c:1161

(gdb) frame 2
(gdb) p numa_meminfo->node_info[0]
$3 = {meminfo = 0x0, memstat = 0x2b8724958f80}
(gdb) p numa_meminfo->node_info[0]->memstat
$4 = (struct linux_table *) 0x2b8724958f80
(gdb) p *numa_meminfo->node_info[0]->memstat
$5 = {field = 0x0, maxval = 97, val = 47859434300912, this = 47859414894504, 
  prev = 47859434420404, field_len = 613841155, valid = 11143}

You see the null meminfo ptr that causes the segv. The memstat also seems to be trash.  It's as though the struct just wasn't initialized.

There is also a larger issue.  If these .so pmda's are not rock solid, they should be invoked by pmcd via a pipe connection rather than the .so linkage.
Comment 1 Frank Ch. Eigler 2012-09-18 16:30:29 EDT
Interestingly, if preceded by 

% pminfo -f kernel.pernode

a subsequent

% pmval kernel.pernode.cpu.sys

will not crash.
Comment 2 Nathan Scott 2012-09-18 22:22:15 EDT
I have a fix, and pcp/qa test 286 will exercise this.  mgoodwin is just reviewing, will then commit.

It's to do with ordering of code execution, there are several one-trip guards in the linux pmda, and if metrics values/instances are fetched in a specific order, there's a case where some expected initialisation has not yet been performed.

A simpler test case (which the qa test uses) is to run pmval in local context mode, using: pmval -s 1 @:kernel.pernode.cpu.sys

The test also uses "pmprobe -L -i" which exercises the other unusual path (instance PDU only, no fetch).  These commands are run for every kernel metric.
Comment 3 Nathan Scott 2012-09-18 22:26:36 EDT
Oh, regarding the other issue...

| There is also a larger issue.  If these .so pmda's are not rock solid, they
| should be invoked by pmcd via a pipe connection rather than the .so linkage.

some discussion happened on IRC...

<fche> what about the concern that .so pmda's should be dispreferred?
<nathans> they are dispreferred in general, but not for the kernel agents (since they are used vastly more than any other)
<nathans> when its a separate process, its additional context switches, additional syscalls
<nathans> traditionally, we've leaned toward keeping kernel pmdas in-process, and most others left up to sysadmin to choose (but generally defaulting to separate process)
<fche> does this bug make you reconsider the balance of performance vs stability ?
<nathans> a little, for sure
<nathans> certainly makes me take up my axe (to grind) about shovelling more and more metrics into pmdalinux
<nathans> kernel metrics that is
<nathans> which could go out, like pmdakvm and such

Historically, the interaction between the CPU indom and the NUMA indom has been problematic in the Linux kernel PMDA.  These are a fair bit more complex and intertwined than any of the other instance domains unfortunately.
Comment 4 Nathan Scott 2012-09-20 21:15:27 EDT
Fix is committed upstream.
Comment 5 Fedora Update System 2012-10-25 18:17:54 EDT
pcp-3.6.9-1.el5 has been submitted as an update for Fedora EPEL 5.
Comment 6 Fedora Update System 2012-10-25 18:18:30 EDT
pcp-3.6.9-1.fc18 has been submitted as an update for Fedora 18.
Comment 7 Fedora Update System 2012-10-25 18:18:57 EDT
pcp-3.6.9-1.fc16 has been submitted as an update for Fedora 16.
Comment 8 Fedora Update System 2012-10-25 18:19:25 EDT
pcp-3.6.9-1.el6 has been submitted as an update for Fedora EPEL 6.
Comment 9 Fedora Update System 2012-10-25 18:19:55 EDT
pcp-3.6.9-1.fc17 has been submitted as an update for Fedora 17.
Comment 10 Fedora Update System 2012-10-26 14:34:01 EDT
Package pcp-3.6.9-1.el5:
* should fix your issue,
* was pushed to the Fedora EPEL 5 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=epel-testing pcp-3.6.9-1.el5'
as soon as you are able to.
Please go to the following url:
then log in and leave karma (feedback).
Comment 11 Fedora Update System 2012-12-20 10:14:06 EST
pcp-3.6.9-1.el6 has been pushed to the Fedora EPEL 6 stable repository.  If problems still persist, please make note of it in this bug report.