Bug 858384 - pmcd segv during linux-pmda query
pmcd segv during linux-pmda query
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: pcp (Show other bugs)
16
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Nathan Scott
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-09-18 16:16 EDT by Frank Ch. Eigler
Modified: 2012-12-20 10:14 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-12-20 10:14:04 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Frank Ch. Eigler 2012-09-18 16:16:42 EDT
pcp-debuginfo-3.6.8-1.fc16.x86_64
pcp-gui-1.5.5-1.fc16.x86_64
python-pcp-3.6.8-1.fc16.x86_64
pcp-libs-3.6.8-1.fc16.x86_64
pcp-3.6.8-1.fc16.x86_64
perl-PCP-PMDA-3.6.8-1.fc16.x86_64

# service pmcd start
# gdb .../pmcd $PID
(gdb) continue

---- meanwhile, from another window ----

% pmval kernel.pernode.cpu.sys

pmval: pmGetInDom(60.19): Timeout waiting for a response from PMCD

---- pmcd suffers segv ----

Program received signal SIGSEGV, Segmentation fault.
linux_table_scan (fp=0x2b01f0140d00, table=0x0) at linux_table.c:80
80		for (t=table; t->field; t++) {
(gdb) bt
#0  linux_table_scan (fp=0x2b01f0140d00, table=0x0) at linux_table.c:80
#1  0x00002b01f0bca715 in refresh_numa_meminfo (numa_meminfo=0x2b01f0dd6fb0)
    at numa_meminfo.c:121
#2  0x00002b01f0bc1bf4 in linux_refresh (pmda=0x2b01f013f6d0, 
    need_refresh=0x7fff6b2b9a80) at pmda.c:3787
#3  0x00002b01f0bc1f6d in linux_instance (indom=251658259, inst=-1, name=0x0, 
    result=0x7fff6b2b9bb0, pmda=<optimized out>) at pmda.c:3911
#4  0x00002b01eed90bc9 in DoInstance (cp=0x2b01f0140810, pb=0x2b01f0141000)
    at dopdus.c:315
#5  0x00002b01eed89e9c in HandleClientInput (fdsPtr=0x7fff6b2b9cc0) at pmcd.c:495
#6  0x00002b01eed887f1 in ClientLoop () at pmcd.c:869
#7  main (argc=<optimized out>, argv=<optimized out>) at pmcd.c:1161

(gdb) frame 2
(gdb) p numa_meminfo->node_info[0]
$3 = {meminfo = 0x0, memstat = 0x2b8724958f80}
(gdb) p numa_meminfo->node_info[0]->memstat
$4 = (struct linux_table *) 0x2b8724958f80
(gdb) p *numa_meminfo->node_info[0]->memstat
$5 = {field = 0x0, maxval = 97, val = 47859434300912, this = 47859414894504, 
  prev = 47859434420404, field_len = 613841155, valid = 11143}

You see the null meminfo ptr that causes the segv. The memstat also seems to be trash.  It's as though the struct just wasn't initialized.


There is also a larger issue.  If these .so pmda's are not rock solid, they should be invoked by pmcd via a pipe connection rather than the .so linkage.
Comment 1 Frank Ch. Eigler 2012-09-18 16:30:29 EDT
Interestingly, if preceded by 

% pminfo -f kernel.pernode

a subsequent

% pmval kernel.pernode.cpu.sys

will not crash.
Comment 2 Nathan Scott 2012-09-18 22:22:15 EDT
I have a fix, and pcp/qa test 286 will exercise this.  mgoodwin is just reviewing, will then commit.

It's to do with ordering of code execution, there are several one-trip guards in the linux pmda, and if metrics values/instances are fetched in a specific order, there's a case where some expected initialisation has not yet been performed.

A simpler test case (which the qa test uses) is to run pmval in local context mode, using: pmval -s 1 @:kernel.pernode.cpu.sys

The test also uses "pmprobe -L -i" which exercises the other unusual path (instance PDU only, no fetch).  These commands are run for every kernel metric.
Comment 3 Nathan Scott 2012-09-18 22:26:36 EDT
Oh, regarding the other issue...

| There is also a larger issue.  If these .so pmda's are not rock solid, they
| should be invoked by pmcd via a pipe connection rather than the .so linkage.

some discussion happened on IRC...

<fche> what about the concern that .so pmda's should be dispreferred?
<nathans> they are dispreferred in general, but not for the kernel agents (since they are used vastly more than any other)
<nathans> when its a separate process, its additional context switches, additional syscalls
<nathans> traditionally, we've leaned toward keeping kernel pmdas in-process, and most others left up to sysadmin to choose (but generally defaulting to separate process)
<fche> does this bug make you reconsider the balance of performance vs stability ?
<nathans> a little, for sure
<nathans> certainly makes me take up my axe (to grind) about shovelling more and more metrics into pmdalinux
<nathans> kernel metrics that is
<nathans> which could go out, like pmdakvm and such


Historically, the interaction between the CPU indom and the NUMA indom has been problematic in the Linux kernel PMDA.  These are a fair bit more complex and intertwined than any of the other instance domains unfortunately.
Comment 4 Nathan Scott 2012-09-20 21:15:27 EDT
Fix is committed upstream.
Comment 5 Fedora Update System 2012-10-25 18:17:54 EDT
pcp-3.6.9-1.el5 has been submitted as an update for Fedora EPEL 5.
https://admin.fedoraproject.org/updates/pcp-3.6.9-1.el5
Comment 6 Fedora Update System 2012-10-25 18:18:30 EDT
pcp-3.6.9-1.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/pcp-3.6.9-1.fc18
Comment 7 Fedora Update System 2012-10-25 18:18:57 EDT
pcp-3.6.9-1.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/pcp-3.6.9-1.fc16
Comment 8 Fedora Update System 2012-10-25 18:19:25 EDT
pcp-3.6.9-1.el6 has been submitted as an update for Fedora EPEL 6.
https://admin.fedoraproject.org/updates/pcp-3.6.9-1.el6
Comment 9 Fedora Update System 2012-10-25 18:19:55 EDT
pcp-3.6.9-1.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/pcp-3.6.9-1.fc17
Comment 10 Fedora Update System 2012-10-26 14:34:01 EDT
Package pcp-3.6.9-1.el5:
* should fix your issue,
* was pushed to the Fedora EPEL 5 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=epel-testing pcp-3.6.9-1.el5'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-EPEL-2012-13283/pcp-3.6.9-1.el5
then log in and leave karma (feedback).
Comment 11 Fedora Update System 2012-12-20 10:14:06 EST
pcp-3.6.9-1.el6 has been pushed to the Fedora EPEL 6 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.