pcp-3.8.12-1.fc19 on x86-64 server, servicing a remote pmmgr, intermittently appears to crash the proc/linux pmda: % pcp Performance Co-Pilot configuration on FOO: platform: Linux FOO 3.10.5-201.fc19.x86_64 #1 SMP Wed Aug 7 16:25:24 UTC 2013 x86_64 hardware: 4 cpus, 9 disks, 1 node, 19949MB RAM timezone: EST+5 pmcd: Version 3.8.12-1, 5 agents, 2 clients pmda: pmcd proc[8] sendmail linux systemd rpm % pminfo -f proc ... proc.id.sgid_nm: pmLookupDesc: No PMCD agent for domain of request ... pmcd.log: Log for pmcd on FOO started Sun Feb 9 07:47:15 2014 active agent dom pid in out ver protocol parameters ============ === ===== === === === ======== ========== pmcd 2 2 dso i:5 lib=/var/lib/pcp/pmdas/pmcd/pmda_pmcd.so entry=pmcd_init [0x2aaaae3ab8c0] linux 60 2 dso i:4 lib=/var/lib/pcp/pmdas/linux/pmda_linux.so entry=linux_init [0x2aaaae7cd730] proc 3 18900 13 14 2 bin pipe cmd=/var/lib/pcp/pmdas/proc/pmdaproc -d 3 systemd 114 18905 15 16 2 bin pipe cmd=/var/lib/pcp/pmdas/systemd/pmdasystemd -d 114 sendmail 15 18971 17 18 2 bin pipe cmd=/var/lib/pcp/pmdas/sendmail/pmdasendmail -d 15 rpm 123 18972 19 20 2 bin pipe cmd=/var/lib/pcp/pmdas/rpm/pmdarpm -d 123 Host access list: 00 01 Cur/MaxCons host-spec host-mask lvl host-name == == =========== ======================================= ======================================= === ============== y y 0 0 192.168.1.10 255.255.255.255 0 localhost y y 0 0 / / 1 unix: n 0 0 0.0.0.0 0.0.0.0 4 .* n 0 0 :: :: 8 :* User access list empty: user-based access control turned off Group access list empty: group-based access control turned off pmcd: PID = 18892, PDU version = 2 pmcd request port(s): sts fd port family address === ==== ===== ====== ======= ok 1026 unix /var/run/pcp/pmcd.socket ok 1024 44321 inet INADDR_ANY ok 1025 44321 ipv6 INADDR_ANY [Sun Feb 9 08:27:39] pmcd(18892) Warning: pduread: timeout (after 5.000 sec) while attempting to read 12 bytes out of 12 in HDR on fd=14 [Sun Feb 9 08:28:02] pmcd(18892) Info: CleanupAgent ... Cleanup "proc" agent (dom 3): protocol failure for fd=14, exit(0) ^^^ note the pmda died about 45 minutes after it started. proc.log extract: UNKNOWN ) : 5713 (/usr/bin/spamd ) S 1 5713 5713 0 -1 4218944 15361 0 1 0 352 31 0 0 20 0 1 0 1549014777 291778560 16843 18446744073709551615 4194304 4199407 140734374877360 140734374876616 227976074547 0 0 4224 92163 18446744071580590201 0 0 17 2 0 0 87 0 0 6298944 6304288 29331456 140734374883070 140734374883147 140734374883147 140734374883305 0 4 140733360050934 140733360054251 0 40 140734363951840 140734363955164 0 0892 0 ^A UNKNOWN c : 5715 (spamd child) S 5713 5713 5713 0 -1 4219200 13521 0 1 0 2769 103 0 0 20 0 1 0 1549016009 309153792 21145 18446744073709551615 4194304 4199407 140734374877360 140734374876616 227976074547 0 0 4224 10240 18446744071580590201 0 0 17 0 0 0 2472 0 0 6298944 6304288 29331456 140734374883070 140734374883147 140734374883147 140734374883305 0 40733360050934 140733360054251 0 40 140734363951840 140734363955164 0 0892 0 ^A [...] UNKNOWN c : 5716 (spamd child) S 5713 5713 5713 0 -1 4219200 8679 0 0 0 37 8 0 0 20 0 1 0 1549016009 296628224 17993 18446744073709551615 4194304 4199407 140734374877360 140734374876616 227976074547 0 0 4224 10240 18446744071580590201 0 0 17 2 0 0 83 0 0 6298944 6304288 29331456 140734374883070 140734374883147 140734374883147 140734374883305 0 83305 0 0733360050934 140733360054251 0 840 140734363951840 140734363955164 0 80892 0 Log finished Sun Feb 9 08:28:02 2014 ^^^^ note how the pmda emitted that message before dying
> > [Sun Feb 9 08:27:39] pmcd(18892) Warning: pduread: timeout (after 5.000 sec) > while attempting to read 12 bytes out of 12 in HDR on fd=14 > [Sun Feb 9 08:28:02] pmcd(18892) Info: CleanupAgent ... > Cleanup "proc" agent (dom 3): protocol failure for fd=14, exit(0) > > ^^^ note the pmda died about 45 minutes after it started. *nod* - note also the exit(0) - indicating the PMDA chose to exit. And from proc.log... > 840 140734363951840 140734363955164 0 > 80892 0 > > Log finished Sun Feb 9 08:28:02 2014 > > ^^^^ note how the pmda emitted that message before dying This message is from the atexit handler - it appears pmdaproc either called exit(0) somewhere or just exited out of the main PMDA loop (have never seen that happen before, FWLIW) and returned zero from main. Very odd, not sure ... but those diagnostics ... ohhh, wait - proc_runq.c is reporting that for process states it does not recognise. Looks like that is because there's whitespace in the command names "(spamd )" and "(spamd child)" which must be taking something in the code by surprise. So, thats one issue - doesn't explain where the exit call is coming from though.
Failing to reproduce the problem here - specifically, trying to get that runq diagnostic to trigger with whitespace in the task_struct comm field... (in the hope that may trigger the clean exit somehow): diff --git a/qa/src/qa_test.c b/qa/src/qa_test.c index 8c0305d..5c033e6 100644 --- a/qa/src/qa_test.c +++ b/qa/src/qa_test.c @@ -1,7 +1,12 @@ #include <stdio.h> +#include <unistd.h> +int +main(int argc, char *argv[]) +{ + static char *kevin_spacey = "qa test "; -main(int argc, char *argv[]) { - printf("This program does nothing, and wastes a lot of space doing it!\n"); - exit(0); + /* modify argv[0], see how pmdaproc reacts */ + argv[0] = kevin_spacey; + return pause(); } $ ln 'qa_test' 'linked qa test ' $ ./linked\ qa\ test (pauses) ... pmdaproc seems to report the right thing for me - any proc.runq fetches cause no logged messages to appear either. Manual inspection of the /proc/<PID>/stat file suggests the problem *should* occur. Argh.
(Just updating the title to reflect our current understanding of the issue)
This message is a notice that Fedora 19 is now at end of life. Fedora has stopped maintaining and issuing updates for Fedora 19. It is Fedora's policy to close all bug reports from releases that are no longer maintained. Approximately 4 (four) weeks from now this bug will be closed as EOL if it remains open with a Fedora 'version' of '19'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 19 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
This bug appears to have been reported against 'rawhide' during the Fedora 22 development cycle. Changing version to '22'. More information and reason for this action is here: https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora22
Still appears with 3.10.5.
This bug appears to have been reported against 'rawhide' during the Fedora 23 development cycle. Changing version to '23'. (As we did not run this process for some time, it could affect also pre-Fedora 23 development cycle bugs. We are very sorry. It will help us with cleanup during Fedora 23 End Of Life. Thank you.) More information and reason for this action is here: https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora23
pcp-3.11.1-1.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-90112fb9ca
pcp-3.11.1-1.el5 has been pushed to the Fedora EPEL 5 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-5b519318e0
pcp-3.11.1-1.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-4969de37e5
To clarify, the fix here is to auto-restart agents that are unresponsive, which is typically due to an unexpected, very large latency during PMDA sampling (and fixing the source of that latency is outside of PCP, hence intractable). This is achieved through a combination of pmdaroot starting PMDAs (i.e. set PMCD_ROOT_AGENT=1 in /etc/sysconfig/pmcd - which is now the default) and: # chkconfig pmie on # service pmie start This enables the pmie rule which checks for agents that have exited, and automates their restart (within ~5 seconds - with a holdoff of 1 minute after any such attempt). A message is also logged to syslog at the time a restart is attempted. These two components to the fix first came together in pcp-3.11.1, however the pmie rule could be used in pcp-3.11.0 as well if anyone needs that.
pcp-3.11.1-1.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-e687eabad0
> (and fixing the source of that latency is outside of PCP, hence > intractable). PCP must already tolerate high latency underlying data sources, and does in cases such as pmdarpm with background threads. This is not intractable - it just requires a well designed program.
> This is not intractable [...] Oh, my note was unclear - the part that is not solvable in PCP is getting valid values at the time point requested. As to how PCP responds and deals with that, yes, many different potential solutions exist there, and of varying levels of complexity.
pcp-3.11.1-1.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.
pcp-3.11.1-1.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.
pcp-3.11.1-1.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report.
pcp-3.11.1-1.el5 has been pushed to the Fedora EPEL 5 stable repository. If problems still persist, please make note of it in this bug report.