This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1334815 - pmcd pmda auto-restart fails if failure encountered during restart
pmcd pmda auto-restart fails if failure encountered during restart
Status: NEW
Product: Fedora
Classification: Fedora
Component: pcp (Show other bugs)
25
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: pcp-maint
Fedora Extras Quality Assurance
:
Depends On: 1323521
Blocks:
  Show dependency treegraph
 
Reported: 2016-05-10 11:03 EDT by Frank Ch. Eigler
Modified: 2017-08-23 11:14 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Frank Ch. Eigler 2016-05-10 11:03:44 EDT
Picture this.  pcp 3.11.2, happily steaming along, until one of its pmdas (usually proc or linux) times out.  The pcp 3.11.2 pmcd code responds by restarting the pmda.  Normally that's fine, but what if the restart fails, by another timeout right then?  Then pmcd is unaware, and its auto-restart logic doesn't trigger until the indefinite future (since the AgentDied flag is cleared).  This has been observed in the wild.

One possible cure is this patch, which passes hand-testing (running a tight killall -9 or -STOP loop against a target pmda), but needs more thought & probably proper QA:


diff --git a/src/pmcd/src/config.c b/src/pmcd/src/config.c
index 04d9db8bdb4f..ef92ce3230c0 100644
--- a/src/pmcd/src/config.c
+++ b/src/pmcd/src/config.c
@@ -1532,6 +1532,8 @@ AgentNegotiate(AgentInfo *aPtr)
     else
        fprintf(stderr, "pmcd: error at initial PDU exchange with "
                "%s PMDA: %s\n", aPtr->pmDomainLabel, pmErrStr(sts));
+
+    AgentDied = 1; /* signal to request auto-restart */
     return PM_ERR_IPC;
 }
Comment 1 Nathan Scott 2016-05-16 19:16:51 EDT
(In reply to Frank Ch. Eigler from comment #0)
> [...]
> restarting the pmda.  Normally that's fine, but what if the restart fails,
> by another timeout right then?

pmie eventually notices and performs the restart more reliably?  (*cough*)

> [...] but needs more thought & probably proper QA:

Yes, and certainly the latter - is this in-progress, or are you expecting someone else to work on fixing this?

This is a regression that was introduced recently when moving away from the pmie-based solution (at your insistence), but you seem to have left this BZ assigned to me ... (as default owner?  hence this followup - please assign to yourself if you intend to continue working on resolving this, thanks).

Perhaps we should be adding back the more reliable pmie solution, as a safety net to counter this class of unexpected problem.
Comment 2 Frank Ch. Eigler 2016-05-17 11:15:55 EDT
(In reply to Nathan Scott from comment #1)
> > [...] but needs more thought & probably proper QA:
> 
> Yes, and certainly the latter - is this in-progress, or are you expecting
> someone else to work on fixing this?  [...]

It is a bug in reviewed, merged, shipped PCP code.  Like any
community contributor, I am expecting PCP maintainers to take
the initiative in fixing bugs.  As a courtesy, I may have time
to help further polish the above fix, but it would be inappropriate
to consider that my responsibility.

> Perhaps we should be adding back the more reliable pmie solution, as a
> safety net to counter this class of unexpected problem.

It was more reliable in some ways and it was proven harmful in others.
No faultless solution has so far made an appearance.
Comment 3 Nathan Scott 2016-05-17 18:54:15 EDT
(In reply to Frank Ch. Eigler from comment #2)
>  As a courtesy, I may have time to help further polish the above fix,
> but it would be inappropriate to consider that my responsibility.

I'm simply asking "are you going to fix it", so noone doubles up on the work.  
Sounds like that's a definite maybe then?

> > Perhaps we should be adding back the more reliable pmie solution, as a
> > safety net to counter this class of unexpected problem.
> 
> It was more reliable in some ways and it was proven harmful in others.

It is clearly more reliable, and the perceived issues were just idle speculation that didn't stand up to scrutiny.

If noone gets around to tackling this regression in the next release timeframe, we can just add back the pmie rule so folk at least have that fail-safe mechanism available.  In fact, hmm, maybe that's the right permanent fix here - then we don't have to worry about this class of problem in the future.
Comment 4 Frank Ch. Eigler 2016-05-25 13:40:32 EDT
> > It was more reliable in some ways and it was proven harmful in others.
> 
> It is clearly more reliable, and the perceived issues were just idle
> speculation that didn't stand up to scrutiny.

This is an unfair and inaccurate characterization.  The pmie based
machinery simply does not work remotely, and harms by misdirecting
signals to the central pmcd.  Even if running running locally,
imposes new load on its pmcd, and more so if one considers the
other pmie default configuration.  One may quibble about the exact
degrees of harm, but this is all indisputable (and observed).
Comment 5 Jan Kurik 2016-07-26 00:50:01 EDT
This bug appears to have been reported against 'rawhide' during the Fedora 25 development cycle.
Changing version to '25'.
Comment 6 Nathan Scott 2016-09-26 02:11:36 EDT
(In reply to Frank Ch. Eigler from comment #4)
> [...] pmie based
> machinery simply does not work remotely, and harms by misdirecting
> signals to the central pmcd.

(the local mode - i.e. "primary" - of pmie operation was introduced resolving these aspects some time back)

>  Even if running running locally,
> imposes new load on its pmcd, and more so if one considers the
> other pmie default configuration.  One may quibble about the exact
> degrees of harm, but this is all indisputable (and observed).

FWLIW, measurements weren't presented to show this perceived load, and since no measurable impact is realistically expected (the kernel metrics fetched with the default pmie rules are a/ very few, b/ very cheap to sample and c/ infrequently sampled) ... there really is no expectation of problems from using a local mode pmie to provide the on-going verification for missed PMDA restarts too.


Back to the original problem - this BZ is not seen as a high priority (esp. with pmie solution not being affected), and noone in the RH PCP team is planning to hack on this corner case.  Hence, I'll reassign this one to you for now, Frank, as the author of the affected code.  If this is not something you plan to hack on, please mark this one as WONTFIX and we'll move on.  Thanks!

Note You need to log in before you can comment on or make changes to this bug.