Picture this. pcp 3.11.2, happily steaming along, until one of its pmdas (usually proc or linux) times out. The pcp 3.11.2 pmcd code responds by restarting the pmda. Normally that's fine, but what if the restart fails, by another timeout right then? Then pmcd is unaware, and its auto-restart logic doesn't trigger until the indefinite future (since the AgentDied flag is cleared). This has been observed in the wild. One possible cure is this patch, which passes hand-testing (running a tight killall -9 or -STOP loop against a target pmda), but needs more thought & probably proper QA: diff --git a/src/pmcd/src/config.c b/src/pmcd/src/config.c index 04d9db8bdb4f..ef92ce3230c0 100644 --- a/src/pmcd/src/config.c +++ b/src/pmcd/src/config.c @@ -1532,6 +1532,8 @@ AgentNegotiate(AgentInfo *aPtr) else fprintf(stderr, "pmcd: error at initial PDU exchange with " "%s PMDA: %s\n", aPtr->pmDomainLabel, pmErrStr(sts)); + + AgentDied = 1; /* signal to request auto-restart */ return PM_ERR_IPC; }
(In reply to Frank Ch. Eigler from comment #0) > [...] > restarting the pmda. Normally that's fine, but what if the restart fails, > by another timeout right then? pmie eventually notices and performs the restart more reliably? (*cough*) > [...] but needs more thought & probably proper QA: Yes, and certainly the latter - is this in-progress, or are you expecting someone else to work on fixing this? This is a regression that was introduced recently when moving away from the pmie-based solution (at your insistence), but you seem to have left this BZ assigned to me ... (as default owner? hence this followup - please assign to yourself if you intend to continue working on resolving this, thanks). Perhaps we should be adding back the more reliable pmie solution, as a safety net to counter this class of unexpected problem.
(In reply to Nathan Scott from comment #1) > > [...] but needs more thought & probably proper QA: > > Yes, and certainly the latter - is this in-progress, or are you expecting > someone else to work on fixing this? [...] It is a bug in reviewed, merged, shipped PCP code. Like any community contributor, I am expecting PCP maintainers to take the initiative in fixing bugs. As a courtesy, I may have time to help further polish the above fix, but it would be inappropriate to consider that my responsibility. > Perhaps we should be adding back the more reliable pmie solution, as a > safety net to counter this class of unexpected problem. It was more reliable in some ways and it was proven harmful in others. No faultless solution has so far made an appearance.
(In reply to Frank Ch. Eigler from comment #2) > As a courtesy, I may have time to help further polish the above fix, > but it would be inappropriate to consider that my responsibility. I'm simply asking "are you going to fix it", so noone doubles up on the work. Sounds like that's a definite maybe then? > > Perhaps we should be adding back the more reliable pmie solution, as a > > safety net to counter this class of unexpected problem. > > It was more reliable in some ways and it was proven harmful in others. It is clearly more reliable, and the perceived issues were just idle speculation that didn't stand up to scrutiny. If noone gets around to tackling this regression in the next release timeframe, we can just add back the pmie rule so folk at least have that fail-safe mechanism available. In fact, hmm, maybe that's the right permanent fix here - then we don't have to worry about this class of problem in the future.
> > It was more reliable in some ways and it was proven harmful in others. > > It is clearly more reliable, and the perceived issues were just idle > speculation that didn't stand up to scrutiny. This is an unfair and inaccurate characterization. The pmie based machinery simply does not work remotely, and harms by misdirecting signals to the central pmcd. Even if running running locally, imposes new load on its pmcd, and more so if one considers the other pmie default configuration. One may quibble about the exact degrees of harm, but this is all indisputable (and observed).
This bug appears to have been reported against 'rawhide' during the Fedora 25 development cycle. Changing version to '25'.
(In reply to Frank Ch. Eigler from comment #4) > [...] pmie based > machinery simply does not work remotely, and harms by misdirecting > signals to the central pmcd. (the local mode - i.e. "primary" - of pmie operation was introduced resolving these aspects some time back) > Even if running running locally, > imposes new load on its pmcd, and more so if one considers the > other pmie default configuration. One may quibble about the exact > degrees of harm, but this is all indisputable (and observed). FWLIW, measurements weren't presented to show this perceived load, and since no measurable impact is realistically expected (the kernel metrics fetched with the default pmie rules are a/ very few, b/ very cheap to sample and c/ infrequently sampled) ... there really is no expectation of problems from using a local mode pmie to provide the on-going verification for missed PMDA restarts too. Back to the original problem - this BZ is not seen as a high priority (esp. with pmie solution not being affected), and noone in the RH PCP team is planning to hack on this corner case. Hence, I'll reassign this one to you for now, Frank, as the author of the affected code. If this is not something you plan to hack on, please mark this one as WONTFIX and we'll move on. Thanks!
This message is a reminder that Fedora 25 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 25. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '25'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 25 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
This bug appears to have been reported against 'rawhide' during the Fedora 28 development cycle. Changing version to '28'.
pmie(1) already offers a robust method of restarting PMDAs as described earlier, and as automated by pcp-zeroconf. No further work is planned on this issue.