Bug 846983 - Incorrect cancelling of monitoring operation on M/S resource when record-pending="true"
Incorrect cancelling of monitoring operation on M/S resource when record-pend...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: pacemaker (Show other bugs)
6.3
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Andrew Beekhof
Cluster QE
http://bugs.clusterlabs.org/show_bug....
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-08-09 07:31 EDT by Ales Nosek
Modified: 2013-02-21 04:51 EST (History)
2 users (show)

See Also:
Fixed In Version: pacemaker-1.1.8-4.el6
Doc Type: Bug Fix
Doc Text:
Cause: "Cancel" operations were incorrectly stored in the cluster status. Consequence: The cluster detected the stale operations and tried to clean the up, causing additional logging and confusing GUIs. Fix: Exclude them from being stored in the cluster status. Result: There are no actions matching *_cancel_* stored in the cluster status.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-02-21 04:51:18 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2013:0375 normal SHIPPED_LIVE pacemaker bug fix and enhancement update 2013-02-20 15:52:23 EST

  None (edit)
Description Ales Nosek 2012-08-09 07:31:34 EDT
Version-Release number of selected component (if applicable):

Pacemaker 1.1.7-6 from the RHEL 6.3 High-Availability Add-on

Steps to Reproduce:
1. Configure a M/S resource and set record-pending="true"
2. Put the node where the Master is running in standby mode
3. Cluster will move the Master to the second node
4. You'll see errors in the log file as cluster was trying to cancel the monitoring operating on the Slave resource

$ crm configure show
node dem_vx01_cas27a \
        attributes standby="off"
node dem_vx01_cas27b \
        attributes standby="on"
primitive disk0 ocf:linbit:drbd \
        params drbd_resource="disk0" \
        operations $id="disk0-operations" \
        op start interval="0" timeout="240" \
        op stop interval="0" timeout="100" \
        op monitor interval="10" role="Master" timeout="20" \
        op monitor interval="20" role="Slave" timeout="20"
primitive stonith_dem_vx01_cas27a stonith:fence_ipmilan \
        operations $id="stonith_dem_vx01_cas27a-operations" \
        op monitor interval="15" timeout="15" start-delay="15" \
        op monitor interval="0" timeout="15" \
        params ipaddr="dem_vx01_cas27a_rsc" login="stonith" passwd="x" lanplus="true" pcmk_host_list="dem_vx01_cas27a"
primitive stonith_dem_vx01_cas27b stonith:fence_ipmilan \
        operations $id="stonith_dem_vx01_cas27b-operations" \
        op monitor interval="15" timeout="15" start-delay="15" \
        op monitor interval="0" timeout="15" \
        params ipaddr="dem_vx01_cas27b_rsc" login="stonith" passwd="x" lanplus="true" pcmk_host_list="dem_vx01_cas27b" \
        meta target-role="Started"
ms ms_disk0 disk0 \
        meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
location stonith_dem_vx01_cas27a_loc stonith_dem_vx01_cas27a -inf: dem_vx01_cas27a
location stonith_dem_vx01_cas27b_loc stonith_dem_vx01_cas27b -inf: dem_vx01_cas27b
property $id="cib-bootstrap-options" \
        dc-version="1.1.7-6-148fccfd5985c5590cc601123c6c16e966b85d14" \
        cluster-infrastructure="openais" \
        last-lrm-refresh="1344503667" \
        no-quorum-policy="ignore" \
        expected-quorum-votes="2" \
        startup-fencing="false"
rsc_defaults $id="rsc_defaults-options" \
        resource-stickiness="100" \
        failure-timeout="60"
op_defaults $id="op_defaults-options" \
        record-pending="true"

$ crm_mon -rf1
============
Last updated: Thu Aug  9 11:16:07 2012
Last change: Thu Aug  9 11:14:27 2012
Stack: openais
Current DC: dem_vx01_cas27a - partition with quorum
Version: 1.1.7-6-148fccfd5985c5590cc601123c6c16e966b85d14
2 Nodes configured, 2 expected votes
4 Resources configured.
============

Online: [ dem_vx01_cas27a dem_vx01_cas27b ]

Full list of resources:

 Master/Slave Set: ms_disk0 [disk0]
     Masters: [ dem_vx01_cas27b ]
     Slaves: [ dem_vx01_cas27a ]
 stonith_dem_vx01_cas27a        (stonith:fence_ipmilan):        Started dem_vx01_cas27b
 stonith_dem_vx01_cas27b        (stonith:fence_ipmilan):        Started dem_vx01_cas27a

Migration summary:
* Node dem_vx01_cas27b: 
* Node dem_vx01_cas27a:

$ crm node standby dem_vx01_cas27b

$ tail -f /var/log/messages
Aug  9 11:17:56 dem_vx01_cas27a crmd[17523]:     info: te_rsc_command: Initiating action 12: monitor disk0:0_monitor_10000 on dem_vx01_cas27a (local)
Aug  9 11:17:56 dem_vx01_cas27a lrmd: [17520]: info: rsc:disk0:0:74: monitor
Aug  9 11:17:56 dem_vx01_cas27a crmd[17523]:     info: te_rsc_command: Recording pending op disk0:0_monitor_10000 in the CIB
Aug  9 11:17:56 dem_vx01_cas27a crmd[17523]:     info: create_operation_update: cib_action_update: Updating resouce disk0:0 after pending monitor op (interval=10000)
Aug  9 11:17:56 dem_vx01_cas27a crmd[17523]:     info: te_rsc_command: Initiating action 4: cancel disk0:0_cancel_20000 on dem_vx01_cas27a (local)
Aug  9 11:17:56 dem_vx01_cas27a crmd[17523]:     info: cancel_op: No pending op found for disk0:0:-1
Aug  9 11:17:56 dem_vx01_cas27a lrmd: [17520]: info: on_msg_cancel_op: no operation with id -1
Aug  9 11:17:56 dem_vx01_cas27a crmd[17523]:     info: te_rsc_command: Recording pending op disk0:0_cancel_20000 in the CIB
Aug  9 11:17:56 dem_vx01_cas27a cib[17518]:     info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='dem_vx01_cas27a']//lrm_resource[@id='disk0:0']/lrm_rsc_op[@id='disk0:0_cancel_20000'] (origin=local/crmd/217, version=2.1428.6): ok (rc=0)
Aug  9 11:17:56 dem_vx01_cas27a crmd[17523]:     info: create_operation_update: cib_action_update: Updating resouce disk0:0 after pending cancel op (interval=20000)
 
Actual results:
Following snippet remains in the cib.xml:
$ cibadmin -Q
            <lrm_rsc_op id="disk0:0_cancel_20000" operation_key="disk0:0_cancel_20000" operation="cancel" crm-debug-origin="cib_action_update" crm_feature_set="3.0.6" transition-key="4:28:0:3c9a3839-a2cc-4a52-95c9-07da2f02f1e5" transition-magic="-1:14;4:28:0:3c9a3839-a2cc-4a52-95c9-07da2f02f1e5" call-id="-1" rc-code="14" op-status="-1" interval="20000" op-digest="10c845ea895312d505c0de998afa00f6"/>
          </lrm_resource>

Additional info:
When setting record-pending=false all works perfectly and there's no operation operation disk0:0_cancel_20000 in the transition graph at all.
Comment 2 Andrew Beekhof 2012-09-21 05:23:09 EDT
We'd need a crm_report for the time period covered by the testcase in order to be able to comment.
Comment 3 Andrew Beekhof 2012-10-11 21:40:51 EDT
Resolved upstream with patch:

   https://github.com/beekhof/pacemaker/commit/197275f

Will be included in 6.4
Comment 6 Jaroslav Kortus 2013-01-21 12:03:27 EST
I have not observed the described behaviour with pacemaker-1.1.8-7.el6.x86_64.
cibadmin -Q | grep cancel shows no matches.
Comment 8 errata-xmlrpc 2013-02-21 04:51:18 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0375.html

Note You need to log in before you can comment on or make changes to this bug.