This is a guess from just looking at the files. By default, "systemctl start pmlogger" will collect samples from pmcd running on the same machine. However, I can't see anything in the systemd unit files that would make sure that pmcd is running and accepting connections before pmlogger tries to contact it. If true, pmlogger might fail to start randomly during boot, depending on who wins the race.
See comment #c1 in bz 1185764 - pmlogger deployments can be complex and do not necessarily actually depend on local pmcd service.
> pmlogger deployments can be complex and do not necessarily actually depend on local pmcd service. But if they do, there is a race during boot between pmcd and pmlogger, right? And if pmlogger loses it, a cron job will try to restart it not more than 30 minutes later, right? I think it is OK to explicitly enable pmcd if your pmlogger configuration depends on it, but pmlogger.service should probably get a After=pmcd.service option.
> --- Comment #2 from Marius Vollmer <mvollmer> --- > > pmlogger deployments can be complex and do not necessarily actually depend > > on local pmcd service. > > But if they do, there is a race during boot between pmcd and pmlogger, right? *nod* > And if pmlogger loses it, a cron job will try to restart it not more than 30 > minutes later, right? That's correct. In the future I'm thinking we can drop that 30mins to effectively zero via auto-reconnect, but for now its up to half an hour yes. > I think it is OK to explicitly enable pmcd if your pmlogger configuration > depends on it, but pmlogger.service should probably get a After=pmcd.service > option. Yeah, as long as that doesn't mean it *requires* pmcd to start (we'll need to verify that one, otherwise pmlogger start might hang - dya know?) - should be fine I expect. cheers. -- Nathan
> Yeah, as long as that doesn't mean it *requires* pmcd to start That is my understanding.
Here is one way to test this. - Add a "sleep 5" at the top of /usr/share/pcp/lib/pmcd. - systemctl enable pmlogger - systemctl stop pmcd pmlogger - systemctl start pmcd pmlogger This should cause pmlogger to fail. Adding "After=pmcd.service" to pmlogger.service will make this work again.
Unfortunately, even "After=pmcd.service" cannot entirely solve this race condition. The pmcd process might not fully initialize by the time that a subsequently-started pmlogger might start looking for it. http://ewontfix.com/15/
(In reply to Frank Ch. Eigler from comment #6) > Unfortunately, even "After=pmcd.service" cannot entirely solve this > race condition. The pmcd process might not fully initialize by the > time that a subsequently-started pmlogger might start looking for it. So "/usr/share/pcp/lib/pmcd start" returns before pmcd is ready to accept connections? Let's fix that, too, then. Maybe using socket activation for pmcd would be the best option. What do you think? > http://ewontfix.com/15/ FUD
(In reply to Marius Vollmer from comment #7) > Maybe using socket activation for > pmcd would be the best option. What do you think? Or the auto-reconnect feature for pmlogger? Would that transparently apply to all users PM_CONTEXT_HOST?
(In reply to Marius Vollmer from comment #7) > So "/usr/share/pcp/lib/pmcd start" returns before pmcd is ready to accept > connections? I see that it uses pmcd_wait, so I would assume that it does indeed wait until pmcd is ready to accept connections before returning. Am I confused? Could you clarify?
Marius, I don't see pmcd_wait being invoked during a sh -x /usr/share/pcp/lib/pmcd start run, but I might just be missing it. It sounds like a reasonable addition to that script, after around line 489. There is a bit of deferred computation after that point, involving additional .NeedInstall PMDAs, which can cause momentary stoppage/restarting of pmcd. That too could leave pmlogger momentarily out of luck. Perhaps that _pmda_setup& stuff should be foregrounded, and then followed by _start_pmcheck. That looks like a pretty solid promise that after "service pmcd start", it'll stay up awhile.
(In reply to Frank Ch. Eigler from comment #10) > Marius, I don't see pmcd_wait being invoked during a > sh -x /usr/share/pcp/lib/pmcd start > run, but I might just be missing it. True, pmcd_wait is called by the _start_pmcheck function, which in turn is never called. This seems to be a regression introduced when splitting rc_pcp into rc_pmcd and rc_pmlogger, 855ca1137a.
(In reply to Frank Ch. Eigler from comment #10) > There is a bit of deferred computation after that point, > involving additional .NeedInstall PMDAs, which can cause > momentary stoppage/restarting of pmcd. This is an odd feature. Can we ignore and deprecate it and blame all breakage that it causes on the PMDAs that make use of it? > That too could leave pmlogger momentarily out of luck. A good fix would also be to make pmlogger, pmie, and maybe all users of PM_CONTEXT_HOST robust against loss of connection to pmcd. This would remove the need to synchronize during startup as well. > Perhaps that > _pmda_setup& stuff should be foregrounded, and then > followed by _start_pmcheck. It was only recently backgrounded: commit 739fdda0cb46c67812e3bbf5cf99c51e83f5d80c Author: Nathan Scott <nathans> Date: Mon Dec 8 16:36:56 2014 +1100 rc_pmcd: execute _pmda_setup in the background Amer reports that use of the .NeedInstall mechanism for PMDAs introduces longer image startup times due to the rc_pmcd script performing PMDA installation serially. There's no reason for that - this processing can be done in the background as soon as pmcd has started (just like we do with pmloggers in the rc_pmlogger script already). Test qa/300 is tweaked to give a little more time before verifying the .NeedInstall processing has / has not been done. My first reaction is to say that the .NeedInstall processing shouldn't be done at all via rc_pmcd. Is it a hack to get around some package system limitations?
pcp-3.10.5-1.fc22 has been submitted as an update for Fedora 22. https://admin.fedoraproject.org/updates/pcp-3.10.5-1.fc22
pcp-3.10.5-1.fc21 has been submitted as an update for Fedora 21. https://admin.fedoraproject.org/updates/pcp-3.10.5-1.fc21
pcp-3.10.5-1.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/pcp-3.10.5-1.fc20
pcp-3.10.5-1.el5 has been submitted as an update for Fedora EPEL 5. https://admin.fedoraproject.org/updates/pcp-3.10.5-1.el5
Package pcp-3.10.5-1.el5: * should fix your issue, * was pushed to the Fedora EPEL 5 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=epel-testing pcp-3.10.5-1.el5' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-EPEL-2015-6718/pcp-3.10.5-1.el5 then log in and leave karma (feedback).
pcp-3.10.6-1.fc22 has been submitted as an update for Fedora 22. https://admin.fedoraproject.org/updates/pcp-3.10.6-1.fc22
pcp-3.10.6-1.fc21 has been submitted as an update for Fedora 21. https://admin.fedoraproject.org/updates/pcp-3.10.6-1.fc21
pcp-3.10.6-1.el5 has been submitted as an update for Fedora EPEL 5. https://admin.fedoraproject.org/updates/pcp-3.10.6-1.el5
pcp-3.10.6-1.fc21 has been pushed to the Fedora 21 stable repository. If problems still persist, please make note of it in this bug report.
pcp-3.10.6-1.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report.
pcp-3.10.6-1.el5 has been pushed to the Fedora EPEL 5 stable repository. If problems still persist, please make note of it in this bug report.