846983 – Incorrect cancelling of monitoring operation on M/S resource when record-pending="true"

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 846983 - Incorrect cancelling of monitoring operation on M/S resource when record-pending="true"

Summary: Incorrect cancelling of monitoring operation on M/S resource when record-pend...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	6.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Andrew Beekhof
QA Contact:	Cluster QE
Docs Contact:
URL:	http://bugs.clusterlabs.org/show_bug....
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-08-09 11:31 UTC by Ales Nosek
Modified:	2013-02-21 09:51 UTC (History)
CC List:	2 users (show)
Fixed In Version:	pacemaker-1.1.8-4.el6
Doc Type:	Bug Fix
Doc Text:	Cause: "Cancel" operations were incorrectly stored in the cluster status. Consequence: The cluster detected the stale operations and tried to clean the up, causing additional logging and confusing GUIs. Fix: Exclude them from being stored in the cluster status. Result: There are no actions matching _cancel_ stored in the cluster status.
Clone Of:
Environment:
Last Closed:	2013-02-21 09:51:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2013:0375	0	normal	SHIPPED_LIVE	pacemaker bug fix and enhancement update	2013-02-20 20:52:23 UTC

Description Ales Nosek 2012-08-09 11:31:34 UTC

Version-Release number of selected component (if applicable):

Pacemaker 1.1.7-6 from the RHEL 6.3 High-Availability Add-on

Steps to Reproduce:
1. Configure a M/S resource and set record-pending="true"
2. Put the node where the Master is running in standby mode
3. Cluster will move the Master to the second node
4. You'll see errors in the log file as cluster was trying to cancel the monitoring operating on the Slave resource

$ crm configure show
node dem_vx01_cas27a \
        attributes standby="off"
node dem_vx01_cas27b \
        attributes standby="on"
primitive disk0 ocf:linbit:drbd \
        params drbd_resource="disk0" \
        operations $id="disk0-operations" \
        op start interval="0" timeout="240" \
        op stop interval="0" timeout="100" \
        op monitor interval="10" role="Master" timeout="20" \
        op monitor interval="20" role="Slave" timeout="20"
primitive stonith_dem_vx01_cas27a stonith:fence_ipmilan \
        operations $id="stonith_dem_vx01_cas27a-operations" \
        op monitor interval="15" timeout="15" start-delay="15" \
        op monitor interval="0" timeout="15" \
        params ipaddr="dem_vx01_cas27a_rsc" login="stonith" passwd="x" lanplus="true" pcmk_host_list="dem_vx01_cas27a"
primitive stonith_dem_vx01_cas27b stonith:fence_ipmilan \
        operations $id="stonith_dem_vx01_cas27b-operations" \
        op monitor interval="15" timeout="15" start-delay="15" \
        op monitor interval="0" timeout="15" \
        params ipaddr="dem_vx01_cas27b_rsc" login="stonith" passwd="x" lanplus="true" pcmk_host_list="dem_vx01_cas27b" \
        meta target-role="Started"
ms ms_disk0 disk0 \
        meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
location stonith_dem_vx01_cas27a_loc stonith_dem_vx01_cas27a -inf: dem_vx01_cas27a
location stonith_dem_vx01_cas27b_loc stonith_dem_vx01_cas27b -inf: dem_vx01_cas27b
property $id="cib-bootstrap-options" \
        dc-version="1.1.7-6-148fccfd5985c5590cc601123c6c16e966b85d14" \
        cluster-infrastructure="openais" \
        last-lrm-refresh="1344503667" \
        no-quorum-policy="ignore" \
        expected-quorum-votes="2" \
        startup-fencing="false"
rsc_defaults $id="rsc_defaults-options" \
        resource-stickiness="100" \
        failure-timeout="60"
op_defaults $id="op_defaults-options" \
        record-pending="true"

$ crm_mon -rf1
============
Last updated: Thu Aug  9 11:16:07 2012
Last change: Thu Aug  9 11:14:27 2012
Stack: openais
Current DC: dem_vx01_cas27a - partition with quorum
Version: 1.1.7-6-148fccfd5985c5590cc601123c6c16e966b85d14
2 Nodes configured, 2 expected votes
4 Resources configured.
============

Online: [ dem_vx01_cas27a dem_vx01_cas27b ]

Full list of resources:

 Master/Slave Set: ms_disk0 [disk0]
     Masters: [ dem_vx01_cas27b ]
     Slaves: [ dem_vx01_cas27a ]
 stonith_dem_vx01_cas27a        (stonith:fence_ipmilan):        Started dem_vx01_cas27b
 stonith_dem_vx01_cas27b        (stonith:fence_ipmilan):        Started dem_vx01_cas27a

Migration summary:
* Node dem_vx01_cas27b: 
* Node dem_vx01_cas27a:

$ crm node standby dem_vx01_cas27b

$ tail -f /var/log/messages
Aug  9 11:17:56 dem_vx01_cas27a crmd[17523]:     info: te_rsc_command: Initiating action 12: monitor disk0:0_monitor_10000 on dem_vx01_cas27a (local)
Aug  9 11:17:56 dem_vx01_cas27a lrmd: [17520]: info: rsc:disk0:0:74: monitor
Aug  9 11:17:56 dem_vx01_cas27a crmd[17523]:     info: te_rsc_command: Recording pending op disk0:0_monitor_10000 in the CIB
Aug  9 11:17:56 dem_vx01_cas27a crmd[17523]:     info: create_operation_update: cib_action_update: Updating resouce disk0:0 after pending monitor op (interval=10000)
Aug  9 11:17:56 dem_vx01_cas27a crmd[17523]:     info: te_rsc_command: Initiating action 4: cancel disk0:0_cancel_20000 on dem_vx01_cas27a (local)
Aug  9 11:17:56 dem_vx01_cas27a crmd[17523]:     info: cancel_op: No pending op found for disk0:0:-1
Aug  9 11:17:56 dem_vx01_cas27a lrmd: [17520]: info: on_msg_cancel_op: no operation with id -1
Aug  9 11:17:56 dem_vx01_cas27a crmd[17523]:     info: te_rsc_command: Recording pending op disk0:0_cancel_20000 in the CIB
Aug  9 11:17:56 dem_vx01_cas27a cib[17518]:     info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='dem_vx01_cas27a']//lrm_resource[@id='disk0:0']/lrm_rsc_op[@id='disk0:0_cancel_20000'] (origin=local/crmd/217, version=2.1428.6): ok (rc=0)
Aug  9 11:17:56 dem_vx01_cas27a crmd[17523]:     info: create_operation_update: cib_action_update: Updating resouce disk0:0 after pending cancel op (interval=20000)
 
Actual results:
Following snippet remains in the cib.xml:
$ cibadmin -Q
            <lrm_rsc_op id="disk0:0_cancel_20000" operation_key="disk0:0_cancel_20000" operation="cancel" crm-debug-origin="cib_action_update" crm_feature_set="3.0.6" transition-key="4:28:0:3c9a3839-a2cc-4a52-95c9-07da2f02f1e5" transition-magic="-1:14;4:28:0:3c9a3839-a2cc-4a52-95c9-07da2f02f1e5" call-id="-1" rc-code="14" op-status="-1" interval="20000" op-digest="10c845ea895312d505c0de998afa00f6"/>
          </lrm_resource>

Additional info:
When setting record-pending=false all works perfectly and there's no operation operation disk0:0_cancel_20000 in the transition graph at all.

Comment 2 Andrew Beekhof 2012-09-21 09:23:09 UTC

We'd need a crm_report for the time period covered by the testcase in order to be able to comment.

Comment 3 Andrew Beekhof 2012-10-12 01:40:51 UTC

Resolved upstream with patch:

   https://github.com/beekhof/pacemaker/commit/197275f

Will be included in 6.4

Comment 6 Jaroslav Kortus 2013-01-21 17:03:27 UTC

I have not observed the described behaviour with pacemaker-1.1.8-7.el6.x86_64.
cibadmin -Q | grep cancel shows no matches.

Comment 8 errata-xmlrpc 2013-02-21 09:51:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0375.html

Note You need to log in before you can comment on or make changes to this bug.