Bug 1334815 - pmcd pmda auto-restart fails if failure encountered during restart
Summary: pmcd pmda auto-restart fails if failure encountered during restart
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: pcp
Version: 28
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: pcp-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On: 1323521
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-05-10 15:03 UTC by Frank Ch. Eigler
Modified: 2019-03-05 04:19 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-05 04:19:39 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Frank Ch. Eigler 2016-05-10 15:03:44 UTC
Picture this.  pcp 3.11.2, happily steaming along, until one of its pmdas (usually proc or linux) times out.  The pcp 3.11.2 pmcd code responds by restarting the pmda.  Normally that's fine, but what if the restart fails, by another timeout right then?  Then pmcd is unaware, and its auto-restart logic doesn't trigger until the indefinite future (since the AgentDied flag is cleared).  This has been observed in the wild.

One possible cure is this patch, which passes hand-testing (running a tight killall -9 or -STOP loop against a target pmda), but needs more thought & probably proper QA:


diff --git a/src/pmcd/src/config.c b/src/pmcd/src/config.c
index 04d9db8bdb4f..ef92ce3230c0 100644
--- a/src/pmcd/src/config.c
+++ b/src/pmcd/src/config.c
@@ -1532,6 +1532,8 @@ AgentNegotiate(AgentInfo *aPtr)
     else
        fprintf(stderr, "pmcd: error at initial PDU exchange with "
                "%s PMDA: %s\n", aPtr->pmDomainLabel, pmErrStr(sts));
+
+    AgentDied = 1; /* signal to request auto-restart */
     return PM_ERR_IPC;
 }

Comment 1 Nathan Scott 2016-05-16 23:16:51 UTC
(In reply to Frank Ch. Eigler from comment #0)
> [...]
> restarting the pmda.  Normally that's fine, but what if the restart fails,
> by another timeout right then?

pmie eventually notices and performs the restart more reliably?  (*cough*)

> [...] but needs more thought & probably proper QA:

Yes, and certainly the latter - is this in-progress, or are you expecting someone else to work on fixing this?

This is a regression that was introduced recently when moving away from the pmie-based solution (at your insistence), but you seem to have left this BZ assigned to me ... (as default owner?  hence this followup - please assign to yourself if you intend to continue working on resolving this, thanks).

Perhaps we should be adding back the more reliable pmie solution, as a safety net to counter this class of unexpected problem.

Comment 2 Frank Ch. Eigler 2016-05-17 15:15:55 UTC
(In reply to Nathan Scott from comment #1)
> > [...] but needs more thought & probably proper QA:
> 
> Yes, and certainly the latter - is this in-progress, or are you expecting
> someone else to work on fixing this?  [...]

It is a bug in reviewed, merged, shipped PCP code.  Like any
community contributor, I am expecting PCP maintainers to take
the initiative in fixing bugs.  As a courtesy, I may have time
to help further polish the above fix, but it would be inappropriate
to consider that my responsibility.

> Perhaps we should be adding back the more reliable pmie solution, as a
> safety net to counter this class of unexpected problem.

It was more reliable in some ways and it was proven harmful in others.
No faultless solution has so far made an appearance.

Comment 3 Nathan Scott 2016-05-17 22:54:15 UTC
(In reply to Frank Ch. Eigler from comment #2)
>  As a courtesy, I may have time to help further polish the above fix,
> but it would be inappropriate to consider that my responsibility.

I'm simply asking "are you going to fix it", so noone doubles up on the work.  
Sounds like that's a definite maybe then?

> > Perhaps we should be adding back the more reliable pmie solution, as a
> > safety net to counter this class of unexpected problem.
> 
> It was more reliable in some ways and it was proven harmful in others.

It is clearly more reliable, and the perceived issues were just idle speculation that didn't stand up to scrutiny.

If noone gets around to tackling this regression in the next release timeframe, we can just add back the pmie rule so folk at least have that fail-safe mechanism available.  In fact, hmm, maybe that's the right permanent fix here - then we don't have to worry about this class of problem in the future.

Comment 4 Frank Ch. Eigler 2016-05-25 17:40:32 UTC
> > It was more reliable in some ways and it was proven harmful in others.
> 
> It is clearly more reliable, and the perceived issues were just idle
> speculation that didn't stand up to scrutiny.

This is an unfair and inaccurate characterization.  The pmie based
machinery simply does not work remotely, and harms by misdirecting
signals to the central pmcd.  Even if running running locally,
imposes new load on its pmcd, and more so if one considers the
other pmie default configuration.  One may quibble about the exact
degrees of harm, but this is all indisputable (and observed).

Comment 5 Jan Kurik 2016-07-26 04:50:01 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 25 development cycle.
Changing version to '25'.

Comment 6 Nathan Scott 2016-09-26 06:11:36 UTC
(In reply to Frank Ch. Eigler from comment #4)
> [...] pmie based
> machinery simply does not work remotely, and harms by misdirecting
> signals to the central pmcd.

(the local mode - i.e. "primary" - of pmie operation was introduced resolving these aspects some time back)

>  Even if running running locally,
> imposes new load on its pmcd, and more so if one considers the
> other pmie default configuration.  One may quibble about the exact
> degrees of harm, but this is all indisputable (and observed).

FWLIW, measurements weren't presented to show this perceived load, and since no measurable impact is realistically expected (the kernel metrics fetched with the default pmie rules are a/ very few, b/ very cheap to sample and c/ infrequently sampled) ... there really is no expectation of problems from using a local mode pmie to provide the on-going verification for missed PMDA restarts too.


Back to the original problem - this BZ is not seen as a high priority (esp. with pmie solution not being affected), and noone in the RH PCP team is planning to hack on this corner case.  Hence, I'll reassign this one to you for now, Frank, as the author of the affected code.  If this is not something you plan to hack on, please mark this one as WONTFIX and we'll move on.  Thanks!

Comment 7 Fedora End Of Life 2017-11-16 19:00:59 UTC
This message is a reminder that Fedora 25 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 25. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '25'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 25 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 8 Fedora End Of Life 2018-02-20 15:28:25 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 28 development cycle.
Changing version to '28'.

Comment 9 Nathan Scott 2019-03-05 04:19:39 UTC
pmie(1) already offers a robust method of restarting PMDAs as described earlier, and as automated by pcp-zeroconf.  No further work is planned on this issue.


Note You need to log in before you can comment on or make changes to this bug.