Bug 1046131 - pacemaker LSB agent resources do not have recurring monitor operations canceled correctly.
Summary: pacemaker LSB agent resources do not have recurring monitor operations cancel...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: pacemaker
Version: 6.6
Hardware: All
OS: Unspecified
urgent
urgent
Target Milestone: rc
: ---
Assignee: David Vossel
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 1052345 1052346
TreeView+ depends on / blocked
 
Reported: 2013-12-23 19:13 UTC by David Vossel
Modified: 2014-10-14 07:33 UTC (History)
8 users (show)

Fixed In Version: pacemaker-1.1.10-15.el6
Doc Type: Bug Fix
Doc Text:
Previously, lsb scripts managed by Pacemaker did not cancel their recurring monitor operations correctly when stopping. Consequently, monitor operations failed after resources had successfully stopped. This bug has been fixed, and recurring monitor operations are now correctly canceled before the lsb resource is stopped.
Clone Of:
Environment:
Last Closed: 2014-10-14 07:33:36 UTC
Target Upstream Version:


Attachments (Terms of Use)
report from fab2 node (190.66 KB, application/x-bzip)
2014-01-10 05:04 UTC, Fabio Massimo Di Nitto
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 674333 None None None Never
Red Hat Product Errata RHBA-2014:1544 normal SHIPPED_LIVE pacemaker bug fix update 2014-10-14 01:21:31 UTC

Description David Vossel 2013-12-23 19:13:51 UTC
Description of problem:

Disabled lsb resource agents do not have their recurring monitor actions cancelled correctly.

How reproducible:

50%  This is going to be timing related. We understand how this happens, but it could be difficult to reproduce.

Steps to Reproduce:
1. create a lsb resource with a recurring monitor operation
2. wait for lsb resource to start in cluster
3. disable lsb resource

Actual results:

lsb resource stops, then monitor actions fail. When the lsb resource stops, the monitor action should be canceled. 


Expected results:

The lsb agent stops and the recurring monitor operations are canceled.

Comment 1 David Vossel 2013-12-23 19:14:51 UTC
This bug is well understood.  There is already a fix upstream.

https://github.com/ClusterLabs/pacemaker/commit/1c14b9d69470ff56fd814091867394cd0a1cf61d

Comment 5 Andrew Beekhof 2014-01-08 04:08:48 UTC
David: Did you confirm the upstream patch fixed the issue?

Comment 6 Fabio Massimo Di Nitto 2014-01-08 04:53:04 UTC
(In reply to Andrew Beekhof from comment #5)
> David: Did you confirm the upstream patch fixed the issue?

Get me a build done of 6.5 + this patch ASAP and I can test. By ASAP I mean today because I'll need to return the hardware where I can reproduce end of this week.

Comment 7 David Vossel 2014-01-08 16:16:59 UTC
(In reply to Fabio Massimo Di Nitto from comment #6)
> (In reply to Andrew Beekhof from comment #5)
> > David: Did you confirm the upstream patch fixed the issue?
> 
> Get me a build done of 6.5 + this patch ASAP and I can test. By ASAP I mean
> today because I'll need to return the hardware where I can reproduce end of
> this week.

I made a scratch build you can test with.

https://brewweb.devel.redhat.com/taskinfo?taskID=6827114

-- Vossel

Comment 8 Fabio Massimo Di Nitto 2014-01-10 05:03:52 UTC
(In reply to David Vossel from comment #7)
> (In reply to Fabio Massimo Di Nitto from comment #6)
> > (In reply to Andrew Beekhof from comment #5)
> > > David: Did you confirm the upstream patch fixed the issue?
> > 
> > Get me a build done of 6.5 + this patch ASAP and I can test. By ASAP I mean
> > today because I'll need to return the hardware where I can reproduce end of
> > this week.
> 
> I made a scratch build you can test with.
> 
> https://brewweb.devel.redhat.com/taskinfo?taskID=6827114
> 
> -- Vossel

Unless you forgot to apply the patch in the build, or the patch is not correct, I am still seeing the issue. Even on a service that was not disabled.

(report in attachment)

Comment 9 Fabio Massimo Di Nitto 2014-01-10 05:04:45 UTC
Created attachment 848023 [details]
report from fab2 node

Comment 10 David Vossel 2014-01-10 23:21:33 UTC
(In reply to Fabio Massimo Di Nitto from comment #8)
> (In reply to David Vossel from comment #7)
> > (In reply to Fabio Massimo Di Nitto from comment #6)
> > > (In reply to Andrew Beekhof from comment #5)
> > > > David: Did you confirm the upstream patch fixed the issue?
> > > 
> > > Get me a build done of 6.5 + this patch ASAP and I can test. By ASAP I mean
> > > today because I'll need to return the hardware where I can reproduce end of
> > > this week.
> > 
> > I made a scratch build you can test with.
> > 
> > https://brewweb.devel.redhat.com/taskinfo?taskID=6827114
> > 
> > -- Vossel
> 
> Unless you forgot to apply the patch in the build, or the patch is not
> correct, I am still seeing the issue. Even on a service that was not
> disabled.
> 
> (report in attachment)


We talked on IRC, I mentioned that the rpm i provided did not have the patch applied. Forget that, I was wrong.  The rpm is fine.

Looking at your crm_report in detail, I don't actually see this issue occurring with the new rpms.  The crm_report definitely shows the issue occurring before you update, but the failure after the update appears to be a real failure unrelated to these patches.

pacemaker starts with the new rpms at ~ 04:58:42 After that, the only monitor failure is for the 'heat-engine' resource. That's a real failure, nothing related to this cancel recurring op bug.

Jan 10 04:59:16 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: process_lrm_event: 	LRM operation heat-engine_start_0 (call=221, rc=0, cib-update=66, confirmed=true) ok
Jan 10 04:59:16 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: te_rsc_command: 	Initiating action 208: monitor heat-engine_monitor_60000 on fab2 (local)
Jan 10 04:59:16 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: process_lrm_event: 	LRM operation heat-engine_monitor_60000 (call=235, rc=0, cib-update=68, confirmed=false) ok
Jan 10 05:00:16 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: process_lrm_event: 	LRM operation heat-engine_monitor_60000 (call=235, rc=7, cib-update=146, confirmed=false) not running
Jan 10 05:00:16 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: process_lrm_event: 	fab2-heat-engine_monitor_60000:235 [ openstack-heat-engine dead but pid file exists\n ]
Jan 10 05:00:17 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: te_rsc_command: 	Initiating action 11: stop heat-engine_stop_0 on fab2 (local)
Jan 10 05:00:17 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: process_lrm_event: 	fab2-heat-engine_monitor_60000:235 [ openstack-heat-engine dead but pid file exists\n ]
Jan 10 05:00:17 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: process_lrm_event: 	LRM operation heat-engine_stop_0 (call=477, rc=0, cib-update=150, confirmed=true) ok

Comment 11 David Vossel 2014-01-10 23:24:23 UTC
Please re-test with this new RPM
https://brewweb.devel.redhat.com/taskinfo?taskID=6843956

I have include another fix related to cancelling recurring operations that is related to this issue.

There are now two patches related to this issue upstream.
https://github.com/ClusterLabs/pacemaker/commit/1c14b9d69470ff56fd814091867394cd0a1cf61d
https://github.com/davidvossel/pacemaker/commit/390e425667152277fb1ea541a42708b3c4d23a94

Comment 12 Fabio Massimo Di Nitto 2014-01-13 10:26:24 UTC
I can confirm that with those 2 patches applied the problem doesn´t occur anymore.

I have run a start/stop iteration for over 2 hours and no signs of issues. Generally I could spot the problem within a minute or two :)

Comment 17 errata-xmlrpc 2014-10-14 07:33:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1544.html


Note You need to log in before you can comment on or make changes to this bug.