1046131 – pacemaker LSB agent resources do not have recurring monitor operations canceled correctly.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1046131 - pacemaker LSB agent resources do not have recurring monitor operations canceled correctly.

Summary: pacemaker LSB agent resources do not have recurring monitor operations cancel...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	6.6
Hardware:	All
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	David Vossel
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1052345 1052346
TreeView+	depends on / blocked

Reported:	2013-12-23 19:13 UTC by David Vossel
Modified:	2014-10-14 07:33 UTC (History)
CC List:	8 users (show)
Fixed In Version:	pacemaker-1.1.10-15.el6
Doc Type:	Bug Fix
Doc Text:	Previously, lsb scripts managed by Pacemaker did not cancel their recurring monitor operations correctly when stopping. Consequently, monitor operations failed after resources had successfully stopped. This bug has been fixed, and recurring monitor operations are now correctly canceled before the lsb resource is stopped.
Clone Of:
Environment:
Last Closed:	2014-10-14 07:33:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
report from fab2 node (190.66 KB, application/x-bzip) 2014-01-10 05:04 UTC, Fabio Massimo Di Nitto	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	674333	0	None	None	None	Never
Red Hat Product Errata	RHBA-2014:1544	0	normal	SHIPPED_LIVE	pacemaker bug fix update	2014-10-14 01:21:31 UTC

Description David Vossel 2013-12-23 19:13:51 UTC

Description of problem:

Disabled lsb resource agents do not have their recurring monitor actions cancelled correctly.

How reproducible:

50%  This is going to be timing related. We understand how this happens, but it could be difficult to reproduce.

Steps to Reproduce:
1. create a lsb resource with a recurring monitor operation
2. wait for lsb resource to start in cluster
3. disable lsb resource

Actual results:

lsb resource stops, then monitor actions fail. When the lsb resource stops, the monitor action should be canceled. 


Expected results:

The lsb agent stops and the recurring monitor operations are canceled.

Comment 1 David Vossel 2013-12-23 19:14:51 UTC

This bug is well understood.  There is already a fix upstream.

https://github.com/ClusterLabs/pacemaker/commit/1c14b9d69470ff56fd814091867394cd0a1cf61d

Comment 5 Andrew Beekhof 2014-01-08 04:08:48 UTC

David: Did you confirm the upstream patch fixed the issue?

Comment 6 Fabio Massimo Di Nitto 2014-01-08 04:53:04 UTC

(In reply to Andrew Beekhof from comment #5)
> David: Did you confirm the upstream patch fixed the issue?

Get me a build done of 6.5 + this patch ASAP and I can test. By ASAP I mean today because I'll need to return the hardware where I can reproduce end of this week.

Comment 7 David Vossel 2014-01-08 16:16:59 UTC

(In reply to Fabio Massimo Di Nitto from comment #6)
> (In reply to Andrew Beekhof from comment #5)
> > David: Did you confirm the upstream patch fixed the issue?
> 
> Get me a build done of 6.5 + this patch ASAP and I can test. By ASAP I mean
> today because I'll need to return the hardware where I can reproduce end of
> this week.

I made a scratch build you can test with.

https://brewweb.devel.redhat.com/taskinfo?taskID=6827114

-- Vossel

Comment 8 Fabio Massimo Di Nitto 2014-01-10 05:03:52 UTC

(In reply to David Vossel from comment #7)
> (In reply to Fabio Massimo Di Nitto from comment #6)
> > (In reply to Andrew Beekhof from comment #5)
> > > David: Did you confirm the upstream patch fixed the issue?
> > 
> > Get me a build done of 6.5 + this patch ASAP and I can test. By ASAP I mean
> > today because I'll need to return the hardware where I can reproduce end of
> > this week.
> 
> I made a scratch build you can test with.
> 
> https://brewweb.devel.redhat.com/taskinfo?taskID=6827114
> 
> -- Vossel

Unless you forgot to apply the patch in the build, or the patch is not correct, I am still seeing the issue. Even on a service that was not disabled.

(report in attachment)

Comment 9 Fabio Massimo Di Nitto 2014-01-10 05:04:45 UTC

Created attachment 848023 [details]
report from fab2 node

Comment 10 David Vossel 2014-01-10 23:21:33 UTC

(In reply to Fabio Massimo Di Nitto from comment #8)
> (In reply to David Vossel from comment #7)
> > (In reply to Fabio Massimo Di Nitto from comment #6)
> > > (In reply to Andrew Beekhof from comment #5)
> > > > David: Did you confirm the upstream patch fixed the issue?
> > > 
> > > Get me a build done of 6.5 + this patch ASAP and I can test. By ASAP I mean
> > > today because I'll need to return the hardware where I can reproduce end of
> > > this week.
> > 
> > I made a scratch build you can test with.
> > 
> > https://brewweb.devel.redhat.com/taskinfo?taskID=6827114
> > 
> > -- Vossel
> 
> Unless you forgot to apply the patch in the build, or the patch is not
> correct, I am still seeing the issue. Even on a service that was not
> disabled.
> 
> (report in attachment)


We talked on IRC, I mentioned that the rpm i provided did not have the patch applied. Forget that, I was wrong.  The rpm is fine.

Looking at your crm_report in detail, I don't actually see this issue occurring with the new rpms.  The crm_report definitely shows the issue occurring before you update, but the failure after the update appears to be a real failure unrelated to these patches.

pacemaker starts with the new rpms at ~ 04:58:42 After that, the only monitor failure is for the 'heat-engine' resource. That's a real failure, nothing related to this cancel recurring op bug.

Jan 10 04:59:16 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: process_lrm_event: 	LRM operation heat-engine_start_0 (call=221, rc=0, cib-update=66, confirmed=true) ok
Jan 10 04:59:16 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: te_rsc_command: 	Initiating action 208: monitor heat-engine_monitor_60000 on fab2 (local)
Jan 10 04:59:16 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: process_lrm_event: 	LRM operation heat-engine_monitor_60000 (call=235, rc=0, cib-update=68, confirmed=false) ok
Jan 10 05:00:16 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: process_lrm_event: 	LRM operation heat-engine_monitor_60000 (call=235, rc=7, cib-update=146, confirmed=false) not running
Jan 10 05:00:16 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: process_lrm_event: 	fab2-heat-engine_monitor_60000:235 [ openstack-heat-engine dead but pid file exists\n ]
Jan 10 05:00:17 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: te_rsc_command: 	Initiating action 11: stop heat-engine_stop_0 on fab2 (local)
Jan 10 05:00:17 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: process_lrm_event: 	fab2-heat-engine_monitor_60000:235 [ openstack-heat-engine dead but pid file exists\n ]
Jan 10 05:00:17 [8486] fab2.cloud.lab.eng.bos.redhat.com       crmd:   notice: process_lrm_event: 	LRM operation heat-engine_stop_0 (call=477, rc=0, cib-update=150, confirmed=true) ok

Comment 11 David Vossel 2014-01-10 23:24:23 UTC

Please re-test with this new RPM
https://brewweb.devel.redhat.com/taskinfo?taskID=6843956

I have include another fix related to cancelling recurring operations that is related to this issue.

There are now two patches related to this issue upstream.
https://github.com/ClusterLabs/pacemaker/commit/1c14b9d69470ff56fd814091867394cd0a1cf61d
https://github.com/davidvossel/pacemaker/commit/390e425667152277fb1ea541a42708b3c4d23a94

Comment 12 Fabio Massimo Di Nitto 2014-01-13 10:26:24 UTC

I can confirm that with those 2 patches applied the problem doesn´t occur anymore.

I have run a start/stop iteration for over 2 hours and no signs of issues. Generally I could spot the problem within a minute or two :)

Comment 17 errata-xmlrpc 2014-10-14 07:33:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1544.html

Note You need to log in before you can comment on or make changes to this bug.