Bug 1185760 - Default pmlogger config depends on pmcd but doesn't ensure it is running
Summary: Default pmlogger config depends on pmcd but doesn't ensure it is running
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: pcp
Version: 21
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Nathan Scott
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 1185740
TreeView+ depends on / blocked
 
Reported: 2015-01-26 08:53 UTC by Marius Vollmer
Modified: 2015-08-22 16:33 UTC (History)
7 users (show)

Fixed In Version: 3.10.6-1.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-08-13 16:57:06 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Marius Vollmer 2015-01-26 08:53:14 UTC
This is a guess from just looking at the files.

By default, "systemctl start pmlogger" will collect samples from pmcd running on the same machine.  However, I can't see anything in the systemd unit files that would make sure that pmcd is running and accepting connections before pmlogger tries to contact it.

If true, pmlogger might fail to start randomly during boot, depending on who wins the race.

Comment 1 Nathan Scott 2015-01-28 04:29:14 UTC
See comment #c1 in bz 1185764 - pmlogger deployments can be complex and do not necessarily actually depend on local pmcd service.

Comment 2 Marius Vollmer 2015-01-28 08:21:01 UTC
> pmlogger deployments can be complex and do not necessarily actually depend on local pmcd service.

But if they do, there is a race during boot between pmcd and pmlogger, right?

And if pmlogger loses it, a cron job will try to restart it not more than 30 minutes later, right?

I think it is OK to explicitly enable pmcd if your pmlogger configuration depends on it, but pmlogger.service should probably get a After=pmcd.service option.

Comment 3 Nathan Scott 2015-01-28 10:42:30 UTC
> --- Comment #2 from Marius Vollmer <mvollmer> ---
> > pmlogger deployments can be complex and do not necessarily actually depend
> > on local pmcd service.
> 
> But if they do, there is a race during boot between pmcd and pmlogger, right?

*nod*

> And if pmlogger loses it, a cron job will try to restart it not more than 30
> minutes later, right?

That's correct.  In the future I'm thinking we can drop that 30mins to
effectively zero via auto-reconnect, but for now its up to half an hour
yes.

> I think it is OK to explicitly enable pmcd if your pmlogger configuration
> depends on it, but pmlogger.service should probably get a After=pmcd.service
> option.

Yeah, as long as that doesn't mean it *requires* pmcd to start (we'll need
to verify that one, otherwise pmlogger start might hang - dya know?) - should be fine I expect.

cheers.

--
Nathan

Comment 4 Marius Vollmer 2015-01-28 11:19:46 UTC
> Yeah, as long as that doesn't mean it *requires* pmcd to start 

That is my understanding.

Comment 5 Marius Vollmer 2015-02-24 14:07:05 UTC
Here is one way to test this.

- Add a "sleep 5" at the top of /usr/share/pcp/lib/pmcd.
- systemctl enable pmlogger
- systemctl stop pmcd pmlogger
- systemctl start pmcd pmlogger

This should cause pmlogger to fail.

Adding "After=pmcd.service" to pmlogger.service will make this work again.

Comment 6 Frank Ch. Eigler 2015-02-24 15:08:38 UTC
Unfortunately, even "After=pmcd.service" cannot entirely solve this
race condition.  The pmcd process might not fully initialize by the
time that a subsequently-started pmlogger might start looking for it.

http://ewontfix.com/15/

Comment 7 Marius Vollmer 2015-02-25 07:32:16 UTC
(In reply to Frank Ch. Eigler from comment #6)
> Unfortunately, even "After=pmcd.service" cannot entirely solve this
> race condition.  The pmcd process might not fully initialize by the
> time that a subsequently-started pmlogger might start looking for it.

So "/usr/share/pcp/lib/pmcd start" returns before pmcd is ready to accept connections?  Let's fix that, too, then.  Maybe using socket activation for pmcd would be the best option.  What do you think?

> http://ewontfix.com/15/

FUD

Comment 8 Marius Vollmer 2015-02-25 07:36:13 UTC
(In reply to Marius Vollmer from comment #7)

> Maybe using socket activation for
> pmcd would be the best option.  What do you think?

Or the auto-reconnect feature for pmlogger?  Would that transparently apply to all users PM_CONTEXT_HOST?

Comment 9 Marius Vollmer 2015-02-25 12:18:25 UTC
(In reply to Marius Vollmer from comment #7)

> So "/usr/share/pcp/lib/pmcd start" returns before pmcd is ready to accept
> connections?

I see that it uses pmcd_wait, so I would assume that it does indeed wait until pmcd is ready to accept connections before returning.

Am I confused? Could you clarify?

Comment 10 Frank Ch. Eigler 2015-02-25 15:24:13 UTC
Marius, I don't see pmcd_wait being invoked during a
    sh -x /usr/share/pcp/lib/pmcd start
run, but I might just be missing it.  It sounds like a
reasonable addition to that script, after around line 489.

There is a bit of deferred computation after that point,
involving additional .NeedInstall PMDAs, which can cause
momentary stoppage/restarting of pmcd.  That too could 
leave pmlogger momentarily out of luck.  Perhaps that
_pmda_setup& stuff should be foregrounded, and then
followed by _start_pmcheck.  That looks like a pretty 
solid promise that after "service pmcd start", it'll
stay up awhile.

Comment 11 Marius Vollmer 2015-02-25 20:12:38 UTC
(In reply to Frank Ch. Eigler from comment #10)
> Marius, I don't see pmcd_wait being invoked during a
>     sh -x /usr/share/pcp/lib/pmcd start
> run, but I might just be missing it.

True, pmcd_wait is called by the _start_pmcheck function, which in turn is never called. This seems to be a regression introduced when splitting rc_pcp into rc_pmcd and rc_pmlogger, 855ca1137a.

Comment 12 Marius Vollmer 2015-02-26 08:08:55 UTC
(In reply to Frank Ch. Eigler from comment #10)

> There is a bit of deferred computation after that point,
> involving additional .NeedInstall PMDAs, which can cause
> momentary stoppage/restarting of pmcd.

This is an odd feature. Can we ignore and deprecate it and blame all breakage that it causes on the PMDAs that make use of it?

> That too could leave pmlogger momentarily out of luck.

A good fix would also be to make pmlogger, pmie, and maybe all users of PM_CONTEXT_HOST robust against loss of connection to pmcd.  This would remove the need to synchronize during startup as well.

> Perhaps that
> _pmda_setup& stuff should be foregrounded, and then
> followed by _start_pmcheck.

It was only recently backgrounded:

commit 739fdda0cb46c67812e3bbf5cf99c51e83f5d80c
Author: Nathan Scott <nathans>
Date:   Mon Dec 8 16:36:56 2014 +1100

    rc_pmcd: execute _pmda_setup in the background
    
    Amer reports that use of the .NeedInstall mechanism for
    PMDAs introduces longer image startup times due to the
    rc_pmcd script performing PMDA installation serially.
    There's no reason for that - this processing can be done
    in the background as soon as pmcd has started (just like
    we do with pmloggers in the rc_pmlogger script already).
    
    Test qa/300 is tweaked to give a little more time before
    verifying the .NeedInstall processing has / has not been
    done.

My first reaction is to say that the .NeedInstall processing shouldn't be done at all via rc_pmcd.  Is it a hack to get around some package system limitations?

Comment 13 Fedora Update System 2015-06-16 02:12:34 UTC
pcp-3.10.5-1.fc22 has been submitted as an update for Fedora 22.
https://admin.fedoraproject.org/updates/pcp-3.10.5-1.fc22

Comment 14 Fedora Update System 2015-06-16 02:13:46 UTC
pcp-3.10.5-1.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/pcp-3.10.5-1.fc21

Comment 15 Fedora Update System 2015-06-16 02:14:39 UTC
pcp-3.10.5-1.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/pcp-3.10.5-1.fc20

Comment 16 Fedora Update System 2015-06-16 02:16:41 UTC
pcp-3.10.5-1.el5 has been submitted as an update for Fedora EPEL 5.
https://admin.fedoraproject.org/updates/pcp-3.10.5-1.el5

Comment 17 Fedora Update System 2015-06-18 18:34:45 UTC
Package pcp-3.10.5-1.el5:
* should fix your issue,
* was pushed to the Fedora EPEL 5 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=epel-testing pcp-3.10.5-1.el5'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-EPEL-2015-6718/pcp-3.10.5-1.el5
then log in and leave karma (feedback).

Comment 18 Fedora Update System 2015-08-04 05:37:49 UTC
pcp-3.10.6-1.fc22 has been submitted as an update for Fedora 22.
https://admin.fedoraproject.org/updates/pcp-3.10.6-1.fc22

Comment 19 Fedora Update System 2015-08-04 05:38:33 UTC
pcp-3.10.6-1.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/pcp-3.10.6-1.fc21

Comment 20 Fedora Update System 2015-08-04 05:39:16 UTC
pcp-3.10.6-1.el5 has been submitted as an update for Fedora EPEL 5.
https://admin.fedoraproject.org/updates/pcp-3.10.6-1.el5

Comment 21 Fedora Update System 2015-08-13 16:57:06 UTC
pcp-3.10.6-1.fc21 has been pushed to the Fedora 21 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 22 Fedora Update System 2015-08-13 16:58:43 UTC
pcp-3.10.6-1.fc22 has been pushed to the Fedora 22 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 23 Fedora Update System 2015-08-22 16:33:34 UTC
pcp-3.10.6-1.el5 has been pushed to the Fedora EPEL 5 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.