Bug 1065803
Summary: | proc-pmda can timeout on fetch | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Frank Ch. Eigler <fche> |
Component: | pcp | Assignee: | Nathan Scott <nathans> |
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 23 | CC: | fche, mbenitez, mgoodwin, nathans, pcp, scox |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | pcp-3.11.1-1.fc24 pcp-3.11.1-1.fc23 pcp-3.11.1-1.fc22 pcp-3.11.1-1.el5 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-03-26 17:56:06 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Frank Ch. Eigler
2014-02-17 01:44:15 UTC
> > [Sun Feb 9 08:27:39] pmcd(18892) Warning: pduread: timeout (after 5.000 sec) > while attempting to read 12 bytes out of 12 in HDR on fd=14 > [Sun Feb 9 08:28:02] pmcd(18892) Info: CleanupAgent ... > Cleanup "proc" agent (dom 3): protocol failure for fd=14, exit(0) > > ^^^ note the pmda died about 45 minutes after it started. *nod* - note also the exit(0) - indicating the PMDA chose to exit. And from proc.log... > 840 140734363951840 140734363955164 0 > 80892 0 > > Log finished Sun Feb 9 08:28:02 2014 > > ^^^^ note how the pmda emitted that message before dying This message is from the atexit handler - it appears pmdaproc either called exit(0) somewhere or just exited out of the main PMDA loop (have never seen that happen before, FWLIW) and returned zero from main. Very odd, not sure ... but those diagnostics ... ohhh, wait - proc_runq.c is reporting that for process states it does not recognise. Looks like that is because there's whitespace in the command names "(spamd )" and "(spamd child)" which must be taking something in the code by surprise. So, thats one issue - doesn't explain where the exit call is coming from though. Failing to reproduce the problem here - specifically, trying to get that runq diagnostic to trigger with whitespace in the task_struct comm field... (in the hope that may trigger the clean exit somehow): diff --git a/qa/src/qa_test.c b/qa/src/qa_test.c index 8c0305d..5c033e6 100644 --- a/qa/src/qa_test.c +++ b/qa/src/qa_test.c @@ -1,7 +1,12 @@ #include <stdio.h> +#include <unistd.h> +int +main(int argc, char *argv[]) +{ + static char *kevin_spacey = "qa test "; -main(int argc, char *argv[]) { - printf("This program does nothing, and wastes a lot of space doing it!\n"); - exit(0); + /* modify argv[0], see how pmdaproc reacts */ + argv[0] = kevin_spacey; + return pause(); } $ ln 'qa_test' 'linked qa test ' $ ./linked\ qa\ test (pauses) ... pmdaproc seems to report the right thing for me - any proc.runq fetches cause no logged messages to appear either. Manual inspection of the /proc/<PID>/stat file suggests the problem *should* occur. Argh. (Just updating the title to reflect our current understanding of the issue) This message is a notice that Fedora 19 is now at end of life. Fedora has stopped maintaining and issuing updates for Fedora 19. It is Fedora's policy to close all bug reports from releases that are no longer maintained. Approximately 4 (four) weeks from now this bug will be closed as EOL if it remains open with a Fedora 'version' of '19'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 19 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. This bug appears to have been reported against 'rawhide' during the Fedora 22 development cycle. Changing version to '22'. More information and reason for this action is here: https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora22 Still appears with 3.10.5. This bug appears to have been reported against 'rawhide' during the Fedora 23 development cycle. Changing version to '23'. (As we did not run this process for some time, it could affect also pre-Fedora 23 development cycle bugs. We are very sorry. It will help us with cleanup during Fedora 23 End Of Life. Thank you.) More information and reason for this action is here: https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora23 pcp-3.11.1-1.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-90112fb9ca pcp-3.11.1-1.el5 has been pushed to the Fedora EPEL 5 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-5b519318e0 pcp-3.11.1-1.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-4969de37e5 To clarify, the fix here is to auto-restart agents that are unresponsive, which is typically due to an unexpected, very large latency during PMDA sampling (and fixing the source of that latency is outside of PCP, hence intractable). This is achieved through a combination of pmdaroot starting PMDAs (i.e. set PMCD_ROOT_AGENT=1 in /etc/sysconfig/pmcd - which is now the default) and: # chkconfig pmie on # service pmie start This enables the pmie rule which checks for agents that have exited, and automates their restart (within ~5 seconds - with a holdoff of 1 minute after any such attempt). A message is also logged to syslog at the time a restart is attempted. These two components to the fix first came together in pcp-3.11.1, however the pmie rule could be used in pcp-3.11.0 as well if anyone needs that. pcp-3.11.1-1.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-e687eabad0 > (and fixing the source of that latency is outside of PCP, hence
> intractable).
PCP must already tolerate high latency underlying data sources, and does in cases such as pmdarpm with background threads. This is not intractable - it just requires a well designed program.
> This is not intractable [...]
Oh, my note was unclear - the part that is not solvable in PCP is getting valid values at the time point requested. As to how PCP responds and deals with that, yes, many different potential solutions exist there, and of varying levels of complexity.
pcp-3.11.1-1.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report. pcp-3.11.1-1.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report. pcp-3.11.1-1.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report. pcp-3.11.1-1.el5 has been pushed to the Fedora EPEL 5 stable repository. If problems still persist, please make note of it in this bug report. |