Bug 1046131

Summary: pacemaker LSB agent resources do not have recurring monitor operations canceled correctly.
Product: Red Hat Enterprise Linux 6 Reporter: David Vossel <dvossel>
Component: pacemakerAssignee: David Vossel <dvossel>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.6CC: cluster-maint, djansa, dvossel, fdinitto, jkortus, jruemker, jsvarova, tlavigne
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-1.1.10-15.el6 Doc Type: Bug Fix
Doc Text:
Previously, lsb scripts managed by Pacemaker did not cancel their recurring monitor operations correctly when stopping. Consequently, monitor operations failed after resources had successfully stopped. This bug has been fixed, and recurring monitor operations are now correctly canceled before the lsb resource is stopped.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-14 07:33:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1052345, 1052346    
Attachments:
Description Flags
report from fab2 node none

Description David Vossel 2013-12-23 19:13:51 UTC
Description of problem:

Disabled lsb resource agents do not have their recurring monitor actions cancelled correctly.

How reproducible:

50%  This is going to be timing related. We understand how this happens, but it could be difficult to reproduce.

Steps to Reproduce:
1. create a lsb resource with a recurring monitor operation
2. wait for lsb resource to start in cluster
3. disable lsb resource

Actual results:

lsb resource stops, then monitor actions fail. When the lsb resource stops, the monitor action should be canceled. 


Expected results:

The lsb agent stops and the recurring monitor operations are canceled.

Comment 1 David Vossel 2013-12-23 19:14:51 UTC
This bug is well understood.  There is already a fix upstream.

https://github.com/ClusterLabs/pacemaker/commit/1c14b9d69470ff56fd814091867394cd0a1cf61d

Comment 5 Andrew Beekhof 2014-01-08 04:08:48 UTC
David: Did you confirm the upstream patch fixed the issue?

Comment 6 Fabio Massimo Di Nitto 2014-01-08 04:53:04 UTC
(In reply to Andrew Beekhof from comment #5)
> David: Did you confirm the upstream patch fixed the issue?

Get me a build done of 6.5 + this patch ASAP and I can test. By ASAP I mean today because I'll need to return the hardware where I can reproduce end of this week.

Comment 7 David Vossel 2014-01-08 16:16:59 UTC
(In reply to Fabio Massimo Di Nitto from comment #6)
> (In reply to Andrew Beekhof from comment #5)
> > David: Did you confirm the upstream patch fixed the issue?
> 
> Get me a build done of 6.5 + this patch ASAP and I can test. By ASAP I mean
> today because I'll need to return the hardware where I can reproduce end of
> this week.

I made a scratch build you can test with.

https://brewweb.devel.redhat.com/taskinfo?taskID=6827114

-- Vossel

Comment 8 Fabio Massimo Di Nitto 2014-01-10 05:03:52 UTC
(In reply to David Vossel from comment #7)
> (In reply to Fabio Massimo Di Nitto from comment #6)
> > (In reply to Andrew Beekhof from comment #5)
> > > David: Did you confirm the upstream patch fixed the issue?
> > 
> > Get me a build done of 6.5 + this patch ASAP and I can test. By ASAP I mean
> > today because I'll need to return the hardware where I can reproduce end of
> > this week.
> 
> I made a scratch build you can test with.
> 
> https://brewweb.devel.redhat.com/taskinfo?taskID=6827114
> 
> -- Vossel

Unless you forgot to apply the patch in the build, or the patch is not correct, I am still seeing the issue. Even on a service that was not disabled.

(report in attachment)

Comment 9 Fabio Massimo Di Nitto 2014-01-10 05:04:45 UTC
Created attachment 848023 [details]
report from fab2 node

Comment 10 David Vossel 2014-01-10 23:21:33 UTC
(In reply to Fabio Massimo Di Nitto from comment #8)
> (In reply to David Vossel from comment #7)
> > (In reply to Fabio Massimo Di Nitto from comment #6)
> > > (In reply to Andrew Beekhof from comment #5)
> > > > David: Did you confirm the upstream patch fixed the issue?
> > > 
> > > Get me a build done of 6.5 + this patch ASAP and I can test. By ASAP I mean
> > > today because I'll need to return the hardware where I can reproduce end of
> > > this week.
> > 
> > I made a scratch build you can test with.
> > 
> > https://brewweb.devel.redhat.com/taskinfo?taskID=6827114
> > 
> > -- Vossel
> 
> Unless you forgot to apply the patch in the build, or the patch is not
> correct, I am still seeing the issue. Even on a service that was not
> disabled.
> 
> (report in attachment)


We talked on IRC, I mentioned that the rpm i provided did not have the patch applied. Forget that, I was wrong.  The rpm is fine.

Looking at your crm_report in detail, I don't actually see this issue occurring with the new rpms.  The crm_report definitely shows the issue occurring before you update, but the failure after the update appears to be a real failure unrelated to these patches.

pacemaker starts with the new rpms at ~ 04:58:42 After that, the only monitor failure is for the 'heat-engine' resource. That's a real failure, nothing related to this cancel recurring op bug.

Jan 10 04:59:16 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: process_lrm_event: 	LRM operation heat-engine_start_0 (call=221, rc=0, cib-update=66, confirmed=true) ok
Jan 10 04:59:16 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: te_rsc_command: 	Initiating action 208: monitor heat-engine_monitor_60000 on fab2 (local)
Jan 10 04:59:16 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: process_lrm_event: 	LRM operation heat-engine_monitor_60000 (call=235, rc=0, cib-update=68, confirmed=false) ok
Jan 10 05:00:16 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: process_lrm_event: 	LRM operation heat-engine_monitor_60000 (call=235, rc=7, cib-update=146, confirmed=false) not running
Jan 10 05:00:16 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: process_lrm_event: 	fab2-heat-engine_monitor_60000:235 [ openstack-heat-engine dead but pid file exists\n ]
Jan 10 05:00:17 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: te_rsc_command: 	Initiating action 11: stop heat-engine_stop_0 on fab2 (local)
Jan 10 05:00:17 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: process_lrm_event: 	fab2-heat-engine_monitor_60000:235 [ openstack-heat-engine dead but pid file exists\n ]
Jan 10 05:00:17 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: process_lrm_event: 	LRM operation heat-engine_stop_0 (call=477, rc=0, cib-update=150, confirmed=true) ok

Comment 11 David Vossel 2014-01-10 23:24:23 UTC
Please re-test with this new RPM
https://brewweb.devel.redhat.com/taskinfo?taskID=6843956

I have include another fix related to cancelling recurring operations that is related to this issue.

There are now two patches related to this issue upstream.
https://github.com/ClusterLabs/pacemaker/commit/1c14b9d69470ff56fd814091867394cd0a1cf61d
https://github.com/davidvossel/pacemaker/commit/390e425667152277fb1ea541a42708b3c4d23a94

Comment 12 Fabio Massimo Di Nitto 2014-01-13 10:26:24 UTC
I can confirm that with those 2 patches applied the problem doesn´t occur anymore.

I have run a start/stop iteration for over 2 hours and no signs of issues. Generally I could spot the problem within a minute or two :)

Comment 17 errata-xmlrpc 2014-10-14 07:33:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1544.html