Bug 846983

Summary: Incorrect cancelling of monitoring operation on M/S resource when record-pending="true"
Product: Red Hat Enterprise Linux 6 Reporter: Ales Nosek <ales.nosek>
Component: pacemakerAssignee: Andrew Beekhof <abeekhof>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.3CC: cluster-maint, jkortus
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
URL: http://bugs.clusterlabs.org/show_bug.cgi?id=5109
Whiteboard:
Fixed In Version: pacemaker-1.1.8-4.el6 Doc Type: Bug Fix
Doc Text:
Cause: "Cancel" operations were incorrectly stored in the cluster status. Consequence: The cluster detected the stale operations and tried to clean the up, causing additional logging and confusing GUIs. Fix: Exclude them from being stored in the cluster status. Result: There are no actions matching *_cancel_* stored in the cluster status.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-21 09:51:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ales Nosek 2012-08-09 11:31:34 UTC
Version-Release number of selected component (if applicable):

Pacemaker 1.1.7-6 from the RHEL 6.3 High-Availability Add-on

Steps to Reproduce:
1. Configure a M/S resource and set record-pending="true"
2. Put the node where the Master is running in standby mode
3. Cluster will move the Master to the second node
4. You'll see errors in the log file as cluster was trying to cancel the monitoring operating on the Slave resource

$ crm configure show
node dem_vx01_cas27a \
        attributes standby="off"
node dem_vx01_cas27b \
        attributes standby="on"
primitive disk0 ocf:linbit:drbd \
        params drbd_resource="disk0" \
        operations $id="disk0-operations" \
        op start interval="0" timeout="240" \
        op stop interval="0" timeout="100" \
        op monitor interval="10" role="Master" timeout="20" \
        op monitor interval="20" role="Slave" timeout="20"
primitive stonith_dem_vx01_cas27a stonith:fence_ipmilan \
        operations $id="stonith_dem_vx01_cas27a-operations" \
        op monitor interval="15" timeout="15" start-delay="15" \
        op monitor interval="0" timeout="15" \
        params ipaddr="dem_vx01_cas27a_rsc" login="stonith" passwd="x" lanplus="true" pcmk_host_list="dem_vx01_cas27a"
primitive stonith_dem_vx01_cas27b stonith:fence_ipmilan \
        operations $id="stonith_dem_vx01_cas27b-operations" \
        op monitor interval="15" timeout="15" start-delay="15" \
        op monitor interval="0" timeout="15" \
        params ipaddr="dem_vx01_cas27b_rsc" login="stonith" passwd="x" lanplus="true" pcmk_host_list="dem_vx01_cas27b" \
        meta target-role="Started"
ms ms_disk0 disk0 \
        meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
location stonith_dem_vx01_cas27a_loc stonith_dem_vx01_cas27a -inf: dem_vx01_cas27a
location stonith_dem_vx01_cas27b_loc stonith_dem_vx01_cas27b -inf: dem_vx01_cas27b
property $id="cib-bootstrap-options" \
        dc-version="1.1.7-6-148fccfd5985c5590cc601123c6c16e966b85d14" \
        cluster-infrastructure="openais" \
        last-lrm-refresh="1344503667" \
        no-quorum-policy="ignore" \
        expected-quorum-votes="2" \
        startup-fencing="false"
rsc_defaults $id="rsc_defaults-options" \
        resource-stickiness="100" \
        failure-timeout="60"
op_defaults $id="op_defaults-options" \
        record-pending="true"

$ crm_mon -rf1
============
Last updated: Thu Aug  9 11:16:07 2012
Last change: Thu Aug  9 11:14:27 2012
Stack: openais
Current DC: dem_vx01_cas27a - partition with quorum
Version: 1.1.7-6-148fccfd5985c5590cc601123c6c16e966b85d14
2 Nodes configured, 2 expected votes
4 Resources configured.
============

Online: [ dem_vx01_cas27a dem_vx01_cas27b ]

Full list of resources:

 Master/Slave Set: ms_disk0 [disk0]
     Masters: [ dem_vx01_cas27b ]
     Slaves: [ dem_vx01_cas27a ]
 stonith_dem_vx01_cas27a        (stonith:fence_ipmilan):        Started dem_vx01_cas27b
 stonith_dem_vx01_cas27b        (stonith:fence_ipmilan):        Started dem_vx01_cas27a

Migration summary:
* Node dem_vx01_cas27b: 
* Node dem_vx01_cas27a:

$ crm node standby dem_vx01_cas27b

$ tail -f /var/log/messages
Aug  9 11:17:56 dem_vx01_cas27a crmd[17523]:     info: te_rsc_command: Initiating action 12: monitor disk0:0_monitor_10000 on dem_vx01_cas27a (local)
Aug  9 11:17:56 dem_vx01_cas27a lrmd: [17520]: info: rsc:disk0:0:74: monitor
Aug  9 11:17:56 dem_vx01_cas27a crmd[17523]:     info: te_rsc_command: Recording pending op disk0:0_monitor_10000 in the CIB
Aug  9 11:17:56 dem_vx01_cas27a crmd[17523]:     info: create_operation_update: cib_action_update: Updating resouce disk0:0 after pending monitor op (interval=10000)
Aug  9 11:17:56 dem_vx01_cas27a crmd[17523]:     info: te_rsc_command: Initiating action 4: cancel disk0:0_cancel_20000 on dem_vx01_cas27a (local)
Aug  9 11:17:56 dem_vx01_cas27a crmd[17523]:     info: cancel_op: No pending op found for disk0:0:-1
Aug  9 11:17:56 dem_vx01_cas27a lrmd: [17520]: info: on_msg_cancel_op: no operation with id -1
Aug  9 11:17:56 dem_vx01_cas27a crmd[17523]:     info: te_rsc_command: Recording pending op disk0:0_cancel_20000 in the CIB
Aug  9 11:17:56 dem_vx01_cas27a cib[17518]:     info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='dem_vx01_cas27a']//lrm_resource[@id='disk0:0']/lrm_rsc_op[@id='disk0:0_cancel_20000'] (origin=local/crmd/217, version=2.1428.6): ok (rc=0)
Aug  9 11:17:56 dem_vx01_cas27a crmd[17523]:     info: create_operation_update: cib_action_update: Updating resouce disk0:0 after pending cancel op (interval=20000)
 
Actual results:
Following snippet remains in the cib.xml:
$ cibadmin -Q
            <lrm_rsc_op id="disk0:0_cancel_20000" operation_key="disk0:0_cancel_20000" operation="cancel" crm-debug-origin="cib_action_update" crm_feature_set="3.0.6" transition-key="4:28:0:3c9a3839-a2cc-4a52-95c9-07da2f02f1e5" transition-magic="-1:14;4:28:0:3c9a3839-a2cc-4a52-95c9-07da2f02f1e5" call-id="-1" rc-code="14" op-status="-1" interval="20000" op-digest="10c845ea895312d505c0de998afa00f6"/>
          </lrm_resource>

Additional info:
When setting record-pending=false all works perfectly and there's no operation operation disk0:0_cancel_20000 in the transition graph at all.

Comment 2 Andrew Beekhof 2012-09-21 09:23:09 UTC
We'd need a crm_report for the time period covered by the testcase in order to be able to comment.

Comment 3 Andrew Beekhof 2012-10-12 01:40:51 UTC
Resolved upstream with patch:

   https://github.com/beekhof/pacemaker/commit/197275f

Will be included in 6.4

Comment 6 Jaroslav Kortus 2013-01-21 17:03:27 UTC
I have not observed the described behaviour with pacemaker-1.1.8-7.el6.x86_64.
cibadmin -Q | grep cancel shows no matches.

Comment 8 errata-xmlrpc 2013-02-21 09:51:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0375.html