Bug 1046131
Summary: | pacemaker LSB agent resources do not have recurring monitor operations canceled correctly. | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | David Vossel <dvossel> | ||||
Component: | pacemaker | Assignee: | David Vossel <dvossel> | ||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 6.6 | CC: | cluster-maint, djansa, dvossel, fdinitto, jkortus, jruemker, jsvarova, tlavigne | ||||
Target Milestone: | rc | Keywords: | ZStream | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | pacemaker-1.1.10-15.el6 | Doc Type: | Bug Fix | ||||
Doc Text: |
Previously, lsb scripts managed by Pacemaker did not cancel their recurring monitor operations correctly when stopping. Consequently, monitor operations failed after resources had successfully stopped. This bug has been fixed, and recurring monitor operations are now correctly canceled before the lsb resource is stopped.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2014-10-14 07:33:36 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1052345, 1052346 | ||||||
Attachments: |
|
Description
David Vossel
2013-12-23 19:13:51 UTC
This bug is well understood. There is already a fix upstream. https://github.com/ClusterLabs/pacemaker/commit/1c14b9d69470ff56fd814091867394cd0a1cf61d David: Did you confirm the upstream patch fixed the issue? (In reply to Andrew Beekhof from comment #5) > David: Did you confirm the upstream patch fixed the issue? Get me a build done of 6.5 + this patch ASAP and I can test. By ASAP I mean today because I'll need to return the hardware where I can reproduce end of this week. (In reply to Fabio Massimo Di Nitto from comment #6) > (In reply to Andrew Beekhof from comment #5) > > David: Did you confirm the upstream patch fixed the issue? > > Get me a build done of 6.5 + this patch ASAP and I can test. By ASAP I mean > today because I'll need to return the hardware where I can reproduce end of > this week. I made a scratch build you can test with. https://brewweb.devel.redhat.com/taskinfo?taskID=6827114 -- Vossel (In reply to David Vossel from comment #7) > (In reply to Fabio Massimo Di Nitto from comment #6) > > (In reply to Andrew Beekhof from comment #5) > > > David: Did you confirm the upstream patch fixed the issue? > > > > Get me a build done of 6.5 + this patch ASAP and I can test. By ASAP I mean > > today because I'll need to return the hardware where I can reproduce end of > > this week. > > I made a scratch build you can test with. > > https://brewweb.devel.redhat.com/taskinfo?taskID=6827114 > > -- Vossel Unless you forgot to apply the patch in the build, or the patch is not correct, I am still seeing the issue. Even on a service that was not disabled. (report in attachment) Created attachment 848023 [details]
report from fab2 node
(In reply to Fabio Massimo Di Nitto from comment #8) > (In reply to David Vossel from comment #7) > > (In reply to Fabio Massimo Di Nitto from comment #6) > > > (In reply to Andrew Beekhof from comment #5) > > > > David: Did you confirm the upstream patch fixed the issue? > > > > > > Get me a build done of 6.5 + this patch ASAP and I can test. By ASAP I mean > > > today because I'll need to return the hardware where I can reproduce end of > > > this week. > > > > I made a scratch build you can test with. > > > > https://brewweb.devel.redhat.com/taskinfo?taskID=6827114 > > > > -- Vossel > > Unless you forgot to apply the patch in the build, or the patch is not > correct, I am still seeing the issue. Even on a service that was not > disabled. > > (report in attachment) We talked on IRC, I mentioned that the rpm i provided did not have the patch applied. Forget that, I was wrong. The rpm is fine. Looking at your crm_report in detail, I don't actually see this issue occurring with the new rpms. The crm_report definitely shows the issue occurring before you update, but the failure after the update appears to be a real failure unrelated to these patches. pacemaker starts with the new rpms at ~ 04:58:42 After that, the only monitor failure is for the 'heat-engine' resource. That's a real failure, nothing related to this cancel recurring op bug. Jan 10 04:59:16 [8486] fab2.cloud.lab.eng.bos.redhat.com crmd: notice: process_lrm_event: LRM operation heat-engine_start_0 (call=221, rc=0, cib-update=66, confirmed=true) ok Jan 10 04:59:16 [8486] fab2.cloud.lab.eng.bos.redhat.com crmd: notice: te_rsc_command: Initiating action 208: monitor heat-engine_monitor_60000 on fab2 (local) Jan 10 04:59:16 [8486] fab2.cloud.lab.eng.bos.redhat.com crmd: notice: process_lrm_event: LRM operation heat-engine_monitor_60000 (call=235, rc=0, cib-update=68, confirmed=false) ok Jan 10 05:00:16 [8486] fab2.cloud.lab.eng.bos.redhat.com crmd: notice: process_lrm_event: LRM operation heat-engine_monitor_60000 (call=235, rc=7, cib-update=146, confirmed=false) not running Jan 10 05:00:16 [8486] fab2.cloud.lab.eng.bos.redhat.com crmd: notice: process_lrm_event: fab2-heat-engine_monitor_60000:235 [ openstack-heat-engine dead but pid file exists\n ] Jan 10 05:00:17 [8486] fab2.cloud.lab.eng.bos.redhat.com crmd: notice: te_rsc_command: Initiating action 11: stop heat-engine_stop_0 on fab2 (local) Jan 10 05:00:17 [8486] fab2.cloud.lab.eng.bos.redhat.com crmd: notice: process_lrm_event: fab2-heat-engine_monitor_60000:235 [ openstack-heat-engine dead but pid file exists\n ] Jan 10 05:00:17 [8486] fab2.cloud.lab.eng.bos.redhat.com crmd: notice: process_lrm_event: LRM operation heat-engine_stop_0 (call=477, rc=0, cib-update=150, confirmed=true) ok Please re-test with this new RPM https://brewweb.devel.redhat.com/taskinfo?taskID=6843956 I have include another fix related to cancelling recurring operations that is related to this issue. There are now two patches related to this issue upstream. https://github.com/ClusterLabs/pacemaker/commit/1c14b9d69470ff56fd814091867394cd0a1cf61d https://github.com/davidvossel/pacemaker/commit/390e425667152277fb1ea541a42708b3c4d23a94 I can confirm that with those 2 patches applied the problem doesn´t occur anymore. I have run a start/stop iteration for over 2 hours and no signs of issues. Generally I could spot the problem within a minute or two :) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1544.html |