858384 – pmcd segv during linux-pmda query

Bug 858384 - pmcd segv during linux-pmda query

Summary: pmcd segv during linux-pmda query

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	pcp
Sub Component:
Version:	16
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Nathan Scott
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-09-18 20:16 UTC by Frank Ch. Eigler
Modified:	2012-12-20 15:14 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-12-20 15:14:04 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Frank Ch. Eigler 2012-09-18 20:16:42 UTC

pcp-debuginfo-3.6.8-1.fc16.x86_64
pcp-gui-1.5.5-1.fc16.x86_64
python-pcp-3.6.8-1.fc16.x86_64
pcp-libs-3.6.8-1.fc16.x86_64
pcp-3.6.8-1.fc16.x86_64
perl-PCP-PMDA-3.6.8-1.fc16.x86_64

# service pmcd start
# gdb .../pmcd $PID
(gdb) continue

---- meanwhile, from another window ----

% pmval kernel.pernode.cpu.sys

pmval: pmGetInDom(60.19): Timeout waiting for a response from PMCD

---- pmcd suffers segv ----

Program received signal SIGSEGV, Segmentation fault.
linux_table_scan (fp=0x2b01f0140d00, table=0x0) at linux_table.c:80
80		for (t=table; t->field; t++) {
(gdb) bt
#0  linux_table_scan (fp=0x2b01f0140d00, table=0x0) at linux_table.c:80
#1  0x00002b01f0bca715 in refresh_numa_meminfo (numa_meminfo=0x2b01f0dd6fb0)
    at numa_meminfo.c:121
#2  0x00002b01f0bc1bf4 in linux_refresh (pmda=0x2b01f013f6d0, 
    need_refresh=0x7fff6b2b9a80) at pmda.c:3787
#3  0x00002b01f0bc1f6d in linux_instance (indom=251658259, inst=-1, name=0x0, 
    result=0x7fff6b2b9bb0, pmda=<optimized out>) at pmda.c:3911
#4  0x00002b01eed90bc9 in DoInstance (cp=0x2b01f0140810, pb=0x2b01f0141000)
    at dopdus.c:315
#5  0x00002b01eed89e9c in HandleClientInput (fdsPtr=0x7fff6b2b9cc0) at pmcd.c:495
#6  0x00002b01eed887f1 in ClientLoop () at pmcd.c:869
#7  main (argc=<optimized out>, argv=<optimized out>) at pmcd.c:1161

(gdb) frame 2
(gdb) p numa_meminfo->node_info[0]
$3 = {meminfo = 0x0, memstat = 0x2b8724958f80}
(gdb) p numa_meminfo->node_info[0]->memstat
$4 = (struct linux_table *) 0x2b8724958f80
(gdb) p *numa_meminfo->node_info[0]->memstat
$5 = {field = 0x0, maxval = 97, val = 47859434300912, this = 47859414894504, 
  prev = 47859434420404, field_len = 613841155, valid = 11143}

You see the null meminfo ptr that causes the segv. The memstat also seems to be trash.  It's as though the struct just wasn't initialized.


There is also a larger issue.  If these .so pmda's are not rock solid, they should be invoked by pmcd via a pipe connection rather than the .so linkage.

Comment 1 Frank Ch. Eigler 2012-09-18 20:30:29 UTC

Interestingly, if preceded by 

% pminfo -f kernel.pernode

a subsequent

% pmval kernel.pernode.cpu.sys

will not crash.

Comment 2 Nathan Scott 2012-09-19 02:22:15 UTC

I have a fix, and pcp/qa test 286 will exercise this.  mgoodwin is just reviewing, will then commit.

It's to do with ordering of code execution, there are several one-trip guards in the linux pmda, and if metrics values/instances are fetched in a specific order, there's a case where some expected initialisation has not yet been performed.

A simpler test case (which the qa test uses) is to run pmval in local context mode, using: pmval -s 1 @:kernel.pernode.cpu.sys

The test also uses "pmprobe -L -i" which exercises the other unusual path (instance PDU only, no fetch).  These commands are run for every kernel metric.

Comment 3 Nathan Scott 2012-09-19 02:26:36 UTC

Oh, regarding the other issue...

| There is also a larger issue.  If these .so pmda's are not rock solid, they
| should be invoked by pmcd via a pipe connection rather than the .so linkage.

some discussion happened on IRC...

<fche> what about the concern that .so pmda's should be dispreferred?
<nathans> they are dispreferred in general, but not for the kernel agents (since they are used vastly more than any other)
<nathans> when its a separate process, its additional context switches, additional syscalls
<nathans> traditionally, we've leaned toward keeping kernel pmdas in-process, and most others left up to sysadmin to choose (but generally defaulting to separate process)
<fche> does this bug make you reconsider the balance of performance vs stability ?
<nathans> a little, for sure
<nathans> certainly makes me take up my axe (to grind) about shovelling more and more metrics into pmdalinux
<nathans> kernel metrics that is
<nathans> which could go out, like pmdakvm and such


Historically, the interaction between the CPU indom and the NUMA indom has been problematic in the Linux kernel PMDA.  These are a fair bit more complex and intertwined than any of the other instance domains unfortunately.

Comment 4 Nathan Scott 2012-09-21 01:15:27 UTC

Fix is committed upstream.

Comment 5 Fedora Update System 2012-10-25 22:17:54 UTC

pcp-3.6.9-1.el5 has been submitted as an update for Fedora EPEL 5.
https://admin.fedoraproject.org/updates/pcp-3.6.9-1.el5

Comment 6 Fedora Update System 2012-10-25 22:18:30 UTC

pcp-3.6.9-1.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/pcp-3.6.9-1.fc18

Comment 7 Fedora Update System 2012-10-25 22:18:57 UTC

pcp-3.6.9-1.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/pcp-3.6.9-1.fc16

Comment 8 Fedora Update System 2012-10-25 22:19:25 UTC

pcp-3.6.9-1.el6 has been submitted as an update for Fedora EPEL 6.
https://admin.fedoraproject.org/updates/pcp-3.6.9-1.el6

Comment 9 Fedora Update System 2012-10-25 22:19:55 UTC

pcp-3.6.9-1.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/pcp-3.6.9-1.fc17

Comment 10 Fedora Update System 2012-10-26 18:34:01 UTC

Package pcp-3.6.9-1.el5:
* should fix your issue,
* was pushed to the Fedora EPEL 5 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=epel-testing pcp-3.6.9-1.el5'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-EPEL-2012-13283/pcp-3.6.9-1.el5
then log in and leave karma (feedback).

Comment 11 Fedora Update System 2012-12-20 15:14:06 UTC

pcp-3.6.9-1.el6 has been pushed to the Fedora EPEL 6 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.