Red Hat Bugzilla – Bug 1334815
pmcd pmda auto-restart fails if failure encountered during restart
Last modified: 2017-08-23 11:14:32 EDT
Picture this. pcp 3.11.2, happily steaming along, until one of its pmdas (usually proc or linux) times out. The pcp 3.11.2 pmcd code responds by restarting the pmda. Normally that's fine, but what if the restart fails, by another timeout right then? Then pmcd is unaware, and its auto-restart logic doesn't trigger until the indefinite future (since the AgentDied flag is cleared). This has been observed in the wild. One possible cure is this patch, which passes hand-testing (running a tight killall -9 or -STOP loop against a target pmda), but needs more thought & probably proper QA: diff --git a/src/pmcd/src/config.c b/src/pmcd/src/config.c index 04d9db8bdb4f..ef92ce3230c0 100644 --- a/src/pmcd/src/config.c +++ b/src/pmcd/src/config.c @@ -1532,6 +1532,8 @@ AgentNegotiate(AgentInfo *aPtr) else fprintf(stderr, "pmcd: error at initial PDU exchange with " "%s PMDA: %s\n", aPtr->pmDomainLabel, pmErrStr(sts)); + + AgentDied = 1; /* signal to request auto-restart */ return PM_ERR_IPC; }
(In reply to Frank Ch. Eigler from comment #0) > [...] > restarting the pmda. Normally that's fine, but what if the restart fails, > by another timeout right then? pmie eventually notices and performs the restart more reliably? (*cough*) > [...] but needs more thought & probably proper QA: Yes, and certainly the latter - is this in-progress, or are you expecting someone else to work on fixing this? This is a regression that was introduced recently when moving away from the pmie-based solution (at your insistence), but you seem to have left this BZ assigned to me ... (as default owner? hence this followup - please assign to yourself if you intend to continue working on resolving this, thanks). Perhaps we should be adding back the more reliable pmie solution, as a safety net to counter this class of unexpected problem.
(In reply to Nathan Scott from comment #1) > > [...] but needs more thought & probably proper QA: > > Yes, and certainly the latter - is this in-progress, or are you expecting > someone else to work on fixing this? [...] It is a bug in reviewed, merged, shipped PCP code. Like any community contributor, I am expecting PCP maintainers to take the initiative in fixing bugs. As a courtesy, I may have time to help further polish the above fix, but it would be inappropriate to consider that my responsibility. > Perhaps we should be adding back the more reliable pmie solution, as a > safety net to counter this class of unexpected problem. It was more reliable in some ways and it was proven harmful in others. No faultless solution has so far made an appearance.
(In reply to Frank Ch. Eigler from comment #2) > As a courtesy, I may have time to help further polish the above fix, > but it would be inappropriate to consider that my responsibility. I'm simply asking "are you going to fix it", so noone doubles up on the work. Sounds like that's a definite maybe then? > > Perhaps we should be adding back the more reliable pmie solution, as a > > safety net to counter this class of unexpected problem. > > It was more reliable in some ways and it was proven harmful in others. It is clearly more reliable, and the perceived issues were just idle speculation that didn't stand up to scrutiny. If noone gets around to tackling this regression in the next release timeframe, we can just add back the pmie rule so folk at least have that fail-safe mechanism available. In fact, hmm, maybe that's the right permanent fix here - then we don't have to worry about this class of problem in the future.
> > It was more reliable in some ways and it was proven harmful in others. > > It is clearly more reliable, and the perceived issues were just idle > speculation that didn't stand up to scrutiny. This is an unfair and inaccurate characterization. The pmie based machinery simply does not work remotely, and harms by misdirecting signals to the central pmcd. Even if running running locally, imposes new load on its pmcd, and more so if one considers the other pmie default configuration. One may quibble about the exact degrees of harm, but this is all indisputable (and observed).
This bug appears to have been reported against 'rawhide' during the Fedora 25 development cycle. Changing version to '25'.
(In reply to Frank Ch. Eigler from comment #4) > [...] pmie based > machinery simply does not work remotely, and harms by misdirecting > signals to the central pmcd. (the local mode - i.e. "primary" - of pmie operation was introduced resolving these aspects some time back) > Even if running running locally, > imposes new load on its pmcd, and more so if one considers the > other pmie default configuration. One may quibble about the exact > degrees of harm, but this is all indisputable (and observed). FWLIW, measurements weren't presented to show this perceived load, and since no measurable impact is realistically expected (the kernel metrics fetched with the default pmie rules are a/ very few, b/ very cheap to sample and c/ infrequently sampled) ... there really is no expectation of problems from using a local mode pmie to provide the on-going verification for missed PMDA restarts too. Back to the original problem - this BZ is not seen as a high priority (esp. with pmie solution not being affected), and noone in the RH PCP team is planning to hack on this corner case. Hence, I'll reassign this one to you for now, Frank, as the author of the affected code. If this is not something you plan to hack on, please mark this one as WONTFIX and we'll move on. Thanks!