Red Hat Bugzilla – Bug 1202934
pcp 3.10.3 pmmgr/pmlogconf crashes older remote pmcd servers
Last modified: 2015-12-02 12:41:56 EST
A RH machine running f20 pcp 3.10.3 from fedora/koji monitors nearby pcp servers via pmmgr (and thus pmlogconf/pmlogger). It appears that something in the connection protocol has recently changed, in that old pmcd's (such as rhel7's 3.9.10, rhel6's 3.9.4) die. (It may matter that these pcp installations may also have been running pcpqa, so in some cases their pmcd.conf included -T3.) The symptom of the problem is a bunch of these remote pmcds just dying within minutes of the upgrade to the pmmgr-hosting pmcd server. One symptom is the affected pmcd.log containing a bunch of lines of this form (once per minute - pmmgr poll interval): Error: ClientLoop: error sending Conn ACK PDU to new client IPC protocol failure ... but no explicit "exiting" type message. The log just stops with pmcd dead. Many of these messages predate the 3.10.3 pcp update, so the 3.10.2 client code was apparently triggering it also. On one affected machine (rhel7 3.9.10), there is a more interesting item in the kernel logs: traps: pmcd[...] trap divide error [...] in libpcp_pmda.so.3 which on a debugger indicates: Program received signal SIGFPE, Arithmetic exception. 0x00002aaaae114603 in pmdaTreeName (pmns=0x555555782190, pmid=251709451, nameset=0x7fffffffe338) at tree.c:147 147 hashchain = pmns->htab[pmid % pmns->htabsize]; (gdb) bt #0 0x00002aaaae114603 in pmdaTreeName (pmns=0x555555782190, pmid=251709451, nameset=0x7fffffffe338) at tree.c:147 #1 0x00005555555630d2 in DoPMNSIDs (cp=cp@entry=0x5555557a2790, pb=0x5555557b4000) at dopdus.c:392 #2 0x000055555555b4a9 in HandleClientInput ( fdsPtr=fdsPtr@entry=0x7fffffffe420) at pmcd.c:350 #3 0x000055555555a2e5 in ClientLoop () at pmcd.c:697 #4 main (argc=<optimized out>, argv=<optimized out>) at pmcd.c:887 where pmns->htabsize == 0. "valgrind pmcd -f -T3" on an affected rhel6 box (running pcp 3.9.4) indicates an eventual crash probably in the same spot. a "pmcd -Dall" trace suggests this was the last packet received before death: secureconnect.c:__pmRecv[secure](1026, ..., 12, 0) -> 12 pduread(1026, ...): have 12, last read 12, still need 0 [29777]pmGetPDU: PMNS_IDS fd=1026 len=24 from=0 000: 18 700d 0 0 1000000 ac8000f fd client connection from ipc ver operations denied == ======================================== ======= ================= 1026 tofan.yyz.redhat.com 2 store __pmDecodeIDList IDlist dump: numids = 1 PMID[0]: 0x0f00c80a 60.50.10 (which happens to be kernel.percpu.interrupts.MCE, which one can normally read correctly, even on the affected machines. I assume some prior memory corruption happened. I have a complete 70MBish pmcd.log for this one.)
This message is a reminder that Fedora 21 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 21. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '21'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 21 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 21 changed to end-of-life (EOL) status on 2015-12-01. Fedora 21 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.