Bug 858384
| Summary: | pmcd segv during linux-pmda query | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Frank Ch. Eigler <fche> |
| Component: | pcp | Assignee: | Nathan Scott <nscott-do-not-use> |
| Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 16 | CC: | fche, mgoodwin, nathans |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2012-12-20 15:14:04 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Interestingly, if preceded by % pminfo -f kernel.pernode a subsequent % pmval kernel.pernode.cpu.sys will not crash. I have a fix, and pcp/qa test 286 will exercise this. mgoodwin is just reviewing, will then commit. It's to do with ordering of code execution, there are several one-trip guards in the linux pmda, and if metrics values/instances are fetched in a specific order, there's a case where some expected initialisation has not yet been performed. A simpler test case (which the qa test uses) is to run pmval in local context mode, using: pmval -s 1 @:kernel.pernode.cpu.sys The test also uses "pmprobe -L -i" which exercises the other unusual path (instance PDU only, no fetch). These commands are run for every kernel metric. Oh, regarding the other issue... | There is also a larger issue. If these .so pmda's are not rock solid, they | should be invoked by pmcd via a pipe connection rather than the .so linkage. some discussion happened on IRC... <fche> what about the concern that .so pmda's should be dispreferred? <nathans> they are dispreferred in general, but not for the kernel agents (since they are used vastly more than any other) <nathans> when its a separate process, its additional context switches, additional syscalls <nathans> traditionally, we've leaned toward keeping kernel pmdas in-process, and most others left up to sysadmin to choose (but generally defaulting to separate process) <fche> does this bug make you reconsider the balance of performance vs stability ? <nathans> a little, for sure <nathans> certainly makes me take up my axe (to grind) about shovelling more and more metrics into pmdalinux <nathans> kernel metrics that is <nathans> which could go out, like pmdakvm and such Historically, the interaction between the CPU indom and the NUMA indom has been problematic in the Linux kernel PMDA. These are a fair bit more complex and intertwined than any of the other instance domains unfortunately. Fix is committed upstream. pcp-3.6.9-1.el5 has been submitted as an update for Fedora EPEL 5. https://admin.fedoraproject.org/updates/pcp-3.6.9-1.el5 pcp-3.6.9-1.fc18 has been submitted as an update for Fedora 18. https://admin.fedoraproject.org/updates/pcp-3.6.9-1.fc18 pcp-3.6.9-1.fc16 has been submitted as an update for Fedora 16. https://admin.fedoraproject.org/updates/pcp-3.6.9-1.fc16 pcp-3.6.9-1.el6 has been submitted as an update for Fedora EPEL 6. https://admin.fedoraproject.org/updates/pcp-3.6.9-1.el6 pcp-3.6.9-1.fc17 has been submitted as an update for Fedora 17. https://admin.fedoraproject.org/updates/pcp-3.6.9-1.fc17 Package pcp-3.6.9-1.el5: * should fix your issue, * was pushed to the Fedora EPEL 5 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=epel-testing pcp-3.6.9-1.el5' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-EPEL-2012-13283/pcp-3.6.9-1.el5 then log in and leave karma (feedback). pcp-3.6.9-1.el6 has been pushed to the Fedora EPEL 6 stable repository. If problems still persist, please make note of it in this bug report. |
pcp-debuginfo-3.6.8-1.fc16.x86_64 pcp-gui-1.5.5-1.fc16.x86_64 python-pcp-3.6.8-1.fc16.x86_64 pcp-libs-3.6.8-1.fc16.x86_64 pcp-3.6.8-1.fc16.x86_64 perl-PCP-PMDA-3.6.8-1.fc16.x86_64 # service pmcd start # gdb .../pmcd $PID (gdb) continue ---- meanwhile, from another window ---- % pmval kernel.pernode.cpu.sys pmval: pmGetInDom(60.19): Timeout waiting for a response from PMCD ---- pmcd suffers segv ---- Program received signal SIGSEGV, Segmentation fault. linux_table_scan (fp=0x2b01f0140d00, table=0x0) at linux_table.c:80 80 for (t=table; t->field; t++) { (gdb) bt #0 linux_table_scan (fp=0x2b01f0140d00, table=0x0) at linux_table.c:80 #1 0x00002b01f0bca715 in refresh_numa_meminfo (numa_meminfo=0x2b01f0dd6fb0) at numa_meminfo.c:121 #2 0x00002b01f0bc1bf4 in linux_refresh (pmda=0x2b01f013f6d0, need_refresh=0x7fff6b2b9a80) at pmda.c:3787 #3 0x00002b01f0bc1f6d in linux_instance (indom=251658259, inst=-1, name=0x0, result=0x7fff6b2b9bb0, pmda=<optimized out>) at pmda.c:3911 #4 0x00002b01eed90bc9 in DoInstance (cp=0x2b01f0140810, pb=0x2b01f0141000) at dopdus.c:315 #5 0x00002b01eed89e9c in HandleClientInput (fdsPtr=0x7fff6b2b9cc0) at pmcd.c:495 #6 0x00002b01eed887f1 in ClientLoop () at pmcd.c:869 #7 main (argc=<optimized out>, argv=<optimized out>) at pmcd.c:1161 (gdb) frame 2 (gdb) p numa_meminfo->node_info[0] $3 = {meminfo = 0x0, memstat = 0x2b8724958f80} (gdb) p numa_meminfo->node_info[0]->memstat $4 = (struct linux_table *) 0x2b8724958f80 (gdb) p *numa_meminfo->node_info[0]->memstat $5 = {field = 0x0, maxval = 97, val = 47859434300912, this = 47859414894504, prev = 47859434420404, field_len = 613841155, valid = 11143} You see the null meminfo ptr that causes the segv. The memstat also seems to be trash. It's as though the struct just wasn't initialized. There is also a larger issue. If these .so pmda's are not rock solid, they should be invoked by pmcd via a pipe connection rather than the .so linkage.