Bug 1185760
Summary: | Default pmlogger config depends on pmcd but doesn't ensure it is running | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Marius Vollmer <mvollmer> |
Component: | pcp | Assignee: | Nathan Scott <nathans> |
Status: | CLOSED NEXTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 21 | CC: | brolley, fche, lberk, mgoodwin, nathans, pcp, scox |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | 3.10.6-1.el5 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-08-13 16:57:06 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1185740 |
Description
Marius Vollmer
2015-01-26 08:53:14 UTC
See comment #c1 in bz 1185764 - pmlogger deployments can be complex and do not necessarily actually depend on local pmcd service. > pmlogger deployments can be complex and do not necessarily actually depend on local pmcd service.
But if they do, there is a race during boot between pmcd and pmlogger, right?
And if pmlogger loses it, a cron job will try to restart it not more than 30 minutes later, right?
I think it is OK to explicitly enable pmcd if your pmlogger configuration depends on it, but pmlogger.service should probably get a After=pmcd.service option.
> --- Comment #2 from Marius Vollmer <mvollmer> --- > > pmlogger deployments can be complex and do not necessarily actually depend > > on local pmcd service. > > But if they do, there is a race during boot between pmcd and pmlogger, right? *nod* > And if pmlogger loses it, a cron job will try to restart it not more than 30 > minutes later, right? That's correct. In the future I'm thinking we can drop that 30mins to effectively zero via auto-reconnect, but for now its up to half an hour yes. > I think it is OK to explicitly enable pmcd if your pmlogger configuration > depends on it, but pmlogger.service should probably get a After=pmcd.service > option. Yeah, as long as that doesn't mean it *requires* pmcd to start (we'll need to verify that one, otherwise pmlogger start might hang - dya know?) - should be fine I expect. cheers. -- Nathan > Yeah, as long as that doesn't mean it *requires* pmcd to start
That is my understanding.
Here is one way to test this. - Add a "sleep 5" at the top of /usr/share/pcp/lib/pmcd. - systemctl enable pmlogger - systemctl stop pmcd pmlogger - systemctl start pmcd pmlogger This should cause pmlogger to fail. Adding "After=pmcd.service" to pmlogger.service will make this work again. Unfortunately, even "After=pmcd.service" cannot entirely solve this race condition. The pmcd process might not fully initialize by the time that a subsequently-started pmlogger might start looking for it. http://ewontfix.com/15/ (In reply to Frank Ch. Eigler from comment #6) > Unfortunately, even "After=pmcd.service" cannot entirely solve this > race condition. The pmcd process might not fully initialize by the > time that a subsequently-started pmlogger might start looking for it. So "/usr/share/pcp/lib/pmcd start" returns before pmcd is ready to accept connections? Let's fix that, too, then. Maybe using socket activation for pmcd would be the best option. What do you think? > http://ewontfix.com/15/ FUD (In reply to Marius Vollmer from comment #7) > Maybe using socket activation for > pmcd would be the best option. What do you think? Or the auto-reconnect feature for pmlogger? Would that transparently apply to all users PM_CONTEXT_HOST? (In reply to Marius Vollmer from comment #7) > So "/usr/share/pcp/lib/pmcd start" returns before pmcd is ready to accept > connections? I see that it uses pmcd_wait, so I would assume that it does indeed wait until pmcd is ready to accept connections before returning. Am I confused? Could you clarify? Marius, I don't see pmcd_wait being invoked during a sh -x /usr/share/pcp/lib/pmcd start run, but I might just be missing it. It sounds like a reasonable addition to that script, after around line 489. There is a bit of deferred computation after that point, involving additional .NeedInstall PMDAs, which can cause momentary stoppage/restarting of pmcd. That too could leave pmlogger momentarily out of luck. Perhaps that _pmda_setup& stuff should be foregrounded, and then followed by _start_pmcheck. That looks like a pretty solid promise that after "service pmcd start", it'll stay up awhile. (In reply to Frank Ch. Eigler from comment #10) > Marius, I don't see pmcd_wait being invoked during a > sh -x /usr/share/pcp/lib/pmcd start > run, but I might just be missing it. True, pmcd_wait is called by the _start_pmcheck function, which in turn is never called. This seems to be a regression introduced when splitting rc_pcp into rc_pmcd and rc_pmlogger, 855ca1137a. (In reply to Frank Ch. Eigler from comment #10) > There is a bit of deferred computation after that point, > involving additional .NeedInstall PMDAs, which can cause > momentary stoppage/restarting of pmcd. This is an odd feature. Can we ignore and deprecate it and blame all breakage that it causes on the PMDAs that make use of it? > That too could leave pmlogger momentarily out of luck. A good fix would also be to make pmlogger, pmie, and maybe all users of PM_CONTEXT_HOST robust against loss of connection to pmcd. This would remove the need to synchronize during startup as well. > Perhaps that > _pmda_setup& stuff should be foregrounded, and then > followed by _start_pmcheck. It was only recently backgrounded: commit 739fdda0cb46c67812e3bbf5cf99c51e83f5d80c Author: Nathan Scott <nathans> Date: Mon Dec 8 16:36:56 2014 +1100 rc_pmcd: execute _pmda_setup in the background Amer reports that use of the .NeedInstall mechanism for PMDAs introduces longer image startup times due to the rc_pmcd script performing PMDA installation serially. There's no reason for that - this processing can be done in the background as soon as pmcd has started (just like we do with pmloggers in the rc_pmlogger script already). Test qa/300 is tweaked to give a little more time before verifying the .NeedInstall processing has / has not been done. My first reaction is to say that the .NeedInstall processing shouldn't be done at all via rc_pmcd. Is it a hack to get around some package system limitations? pcp-3.10.5-1.fc22 has been submitted as an update for Fedora 22. https://admin.fedoraproject.org/updates/pcp-3.10.5-1.fc22 pcp-3.10.5-1.fc21 has been submitted as an update for Fedora 21. https://admin.fedoraproject.org/updates/pcp-3.10.5-1.fc21 pcp-3.10.5-1.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/pcp-3.10.5-1.fc20 pcp-3.10.5-1.el5 has been submitted as an update for Fedora EPEL 5. https://admin.fedoraproject.org/updates/pcp-3.10.5-1.el5 Package pcp-3.10.5-1.el5: * should fix your issue, * was pushed to the Fedora EPEL 5 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=epel-testing pcp-3.10.5-1.el5' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-EPEL-2015-6718/pcp-3.10.5-1.el5 then log in and leave karma (feedback). pcp-3.10.6-1.fc22 has been submitted as an update for Fedora 22. https://admin.fedoraproject.org/updates/pcp-3.10.6-1.fc22 pcp-3.10.6-1.fc21 has been submitted as an update for Fedora 21. https://admin.fedoraproject.org/updates/pcp-3.10.6-1.fc21 pcp-3.10.6-1.el5 has been submitted as an update for Fedora EPEL 5. https://admin.fedoraproject.org/updates/pcp-3.10.6-1.el5 pcp-3.10.6-1.fc21 has been pushed to the Fedora 21 stable repository. If problems still persist, please make note of it in this bug report. pcp-3.10.6-1.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report. pcp-3.10.6-1.el5 has been pushed to the Fedora EPEL 5 stable repository. If problems still persist, please make note of it in this bug report. |