Bug 1085451 - property stonith-enabled=false bug behavior
Summary: property stonith-enabled=false bug behavior
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker
Version: 7.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Andrew Beekhof
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-04-08 15:52 UTC by Miroslav Lisik
Modified: 2018-07-25 10:17 UTC (History)
6 users (show)

Fixed In Version: pacemaker-1.1.12-4.el6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-03-05 09:59:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
crm_report (178.95 KB, application/x-bzip)
2014-04-08 15:52 UTC, Miroslav Lisik
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3541781 0 None None None 2018-07-25 10:17:18 UTC
Red Hat Product Errata RHBA-2015:0440 0 normal SHIPPED_LIVE pacemaker bug fix and enhancement update 2015-03-05 14:37:57 UTC

Description Miroslav Lisik 2014-04-08 15:52:12 UTC
Created attachment 884146 [details]
crm_report

Description of problem:
When cluster property stonith-enabled is set to "false" and cluster is running resources with attribute 'on_fail=fence' then fencing is made despite of property setting 'stonith-enabled=false'.

Version-Release number of selected component (if applicable):
pacemaker-1.1.10-29.el7.x86_64

How reproducible:
always

Steps to Reproduce:
1. Setup cluster with stonith device and cloned resource with attribute setting 'on_fail=fence' (Standard cluster-qa setup test do this)

2. Run 'pcs cluster enable' on all nodes

3. Set property: 
'pcs property set stonith-enabled=false'

4. Try to make some node unclean.
e.g. run 'reboot -f' on some node

5. Look into logs on the live nodes. Fencing is made despite of 'stonith-enabled=false'

6. After node comeback disable resource with attribute setting 'on_fail=fence'
 
7. Try to make some node unclean.
e.g. run 'reboot -f' on some node

8. Look into logs on the live nodes. No fencing is made. (proper behavior when stonith-enabled=false)

Actual results:
When cluster is runnig the resources with attribute 'on_fail=fence' and property stonith-enabled=false is set, then fencing is made.


Expected results:
I'm expecting that cluster property 'stonith-enabled=false' has higher priority then resource's attribute 'on_fail=fence' and no fencing is made.


Additional info:
crm_report attached.

Comment 2 Andrew Beekhof 2014-04-08 23:33:52 UTC
Although valid for pacemaker, stonith-enabled=false isn't supported (or supportable) by Red Hat.  In particular, and as jkortus noticed, pacemaker will report this configuration as invalid:

jkortus: pengine[12955]: error: unpack_operation: Specifying on_fail=fence and stonith-enabled=false makes no sense
jkortus: +1 for this message in syslog :-)

I'm inclined to close this as invalid.

Comment 3 Jaroslav Kortus 2014-04-09 08:16:18 UTC
while it is certainly not a clever thing to set on live instances, I would still like to see that it behaves consistently and according to user expectations. Can we have it so that stonith-enabled=false really disables fencing? :).

Comment 4 Andrew Beekhof 2014-04-15 06:10:38 UTC
We're not turning on fencing:

    } else if (safe_str_eq(value, "fence")) {
        action->on_fail = action_fail_fence;
        value = "node fencing";

        if (is_set(data_set->flags, pe_flag_stonith_enabled) == FALSE) {
            crm_config_err("Specifying on_fail=fence and" " stonith-enabled=false makes no sense");
            action->on_fail = action_fail_stop;
            action->fail_role = RSC_ROLE_STOPPED;
            value = "stop resource";
        }

It seems to be because of:

Apr  8 16:37:05 virt-078 pengine[19006]: warning: pe_fence_node: Node virt-080.cluster-qe.lab.eng.brq.redhat.com is unclean because it is partially and/or un-expectedly down

Which is the result of a botched cleanup 4 years ago :-(

The fix is: https://github.com/beekhof/pacemaker/commit/cfd845fc7

Comment 7 michal novacek 2014-12-01 13:54:18 UTC
I have verified that the fencing does not occur with resource set on-fail=fence when cluster property stonith-enabled is set to false with pacemaker-1.1.12-13.el7.x86_64.

---

[root@virt-069 ~]# pcs status
Cluster name: STSRHTS31212
Last updated: Mon Dec  1 14:51:53 2014
Last change: Mon Dec  1 14:43:19 2014
Stack: corosync
Current DC: virt-063 (1) - partition with quorum
Version: 1.1.12-a14efad
3 Nodes configured
10 Resources configured


Online: [ virt-063 virt-069 virt-072 ]

Full list of resources:

 fence-virt-063 (stonith:fence_xvm):    Started virt-063 
 fence-virt-069 (stonith:fence_xvm):    Started virt-072 
 fence-virt-072 (stonith:fence_xvm):    Started virt-069 
 Clone Set: dlm-clone [dlm]
     Stopped: [ virt-063 virt-069 virt-072 ]
 Clone Set: clvmd-clone [clvmd]
     Stopped: [ virt-063 virt-069 virt-072 ]
 le-dummy       (ocf::heartbeat:Dummy): Started virt-072 

Failed actions:

PCSD Status:
  virt-063: Online
  virt-069: Online
  virt-072: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/enabled
  pcsd: active/enabled

[root@virt-069 ~]# pcs resource show le-dummy
 Resource: le-dummy (class=ocf provider=heartbeat type=Dummy)
  Operations: start interval=0s timeout=20 (le-dummy-start-timeout-20)
              stop interval=0s timeout=20 (le-dummy-stop-timeout-20)
              monitor interval=60s on-fail=fence (le-dummy-monitor-on-fail-fence)

[root@virt-069 ~]# iptables -A INPUT ! -i lo -p udp -j DROP && \
iptables -A OUTPUT ! -o lo -p udp -j DROP


/var/log/messages:
Dec  1 14:46:04 virt-072 corosync[2695]: [TOTEM ] A processor failed, forming new configuration.
Dec  1 14:46:06 virt-072 corosync[2695]: [TOTEM ] A new membership (10.34.71.72:84) was formed. Members left: 1 2
Dec  1 14:46:06 virt-072 corosync[2695]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Dec  1 14:46:06 virt-072 corosync[2695]: [QUORUM] Members[1]: 3
Dec  1 14:46:06 virt-072 corosync[2695]: [MAIN  ] Completed service synchronization, ready to provide service.
Dec  1 14:46:06 virt-072 attrd[2714]: notice: crm_update_peer_state: attrd_peer_change_cb: Node virt-063[1] - state is now lost (was member)
Dec  1 14:46:06 virt-072 attrd[2714]: notice: attrd_peer_remove: Removing all virt-063 attributes for attrd_peer_change_cb
Dec  1 14:46:06 virt-072 attrd[2714]: notice: attrd_peer_change_cb: Lost attribute writer virt-063
Dec  1 14:46:06 virt-072 attrd[2714]: notice: crm_update_peer_state: attrd_peer_change_cb: Node virt-069[2] - state is now lost (was member)
Dec  1 14:46:06 virt-072 attrd[2714]: notice: attrd_peer_remove: Removing all virt-069 attributes for attrd_peer_change_cb
Dec  1 14:46:06 virt-072 crmd[2716]: notice: pcmk_quorum_notification: Membership 84: quorum lost (1)
Dec  1 14:46:06 virt-072 crmd[2716]: notice: crm_update_peer_state: pcmk_quorum_notification: Node virt-069[2] - state is now lost (was member)
Dec  1 14:46:06 virt-072 kernel: dlm: closing connection to node 1
Dec  1 14:46:06 virt-072 kernel: dlm: closing connection to node 2
Dec  1 14:46:06 virt-072 pacemakerd[2710]: notice: pcmk_quorum_notification: Membership 84: quorum lost (1)
Dec  1 14:46:06 virt-072 pacemakerd[2710]: notice: crm_update_peer_state: pcmk_quorum_notification: Node virt-063[1] - state is now lost (was member)
Dec  1 14:46:06 virt-072 pacemakerd[2710]: notice: crm_update_peer_state: pcmk_quorum_notification: Node virt-069[2] - state is now lost (was member)
Dec  1 14:46:06 virt-072 crmd[2716]: warning: match_down_event: No match for shutdown action on 2
Dec  1 14:46:06 virt-072 crmd[2716]: notice: peer_update_callback: Stonith/shutdown of virt-069 not matched
Dec  1 14:46:06 virt-072 crmd[2716]: notice: crm_update_peer_state: pcmk_quorum_notification: Node virt-063[1] - state is now lost (was member)
Dec  1 14:46:06 virt-072 crmd[2716]: warning: match_down_event: No match for shutdown action on 1
Dec  1 14:46:06 virt-072 crmd[2716]: notice: peer_update_callback: Stonith/shutdown of virt-063 not matched
Dec  1 14:46:06 virt-072 crmd[2716]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Dec  1 14:46:06 virt-072 crmd[2716]: warning: match_down_event: No match for shutdown action on 1
Dec  1 14:46:06 virt-072 crmd[2716]: notice: peer_update_callback: Stonith/shutdown of virt-063 not matched
Dec  1 14:46:06 virt-072 crmd[2716]: warning: match_down_event: No match for shutdown action on 2
Dec  1 14:46:06 virt-072 crmd[2716]: notice: peer_update_callback: Stonith/shutdown of virt-069 not matched
Dec  1 14:46:07 virt-072 pengine[2715]: notice: cluster_status: We do not have quorum - fencing and resource management disabled
Dec  1 14:46:07 virt-072 pengine[2715]: error: unpack_operation: Specifying on_fail=fence and stonith-enabled=false makes no sense
Dec  1 14:46:07 virt-072 pengine[2715]: error: unpack_operation: Specifying on_fail=fence and stonith-enabled=false makes no sense
Dec  1 14:46:07 virt-072 pengine[2715]: error: unpack_operation: Specifying on_fail=fence and stonith-enabled=false makes no sense
Dec  1 14:46:07 virt-072 pengine[2715]: error: unpack_operation: Specifying on_fail=fence and stonith-enabled=false makes no sense
Dec  1 14:46:07 virt-072 pengine[2715]: error: unpack_operation: Specifying on_fail=fence and stonith-enabled=false makes no sense
Dec  1 14:46:07 virt-072 pengine[2715]: notice: LogActions: Start   fence-virt-063      (virt-072 - blocked)
Dec  1 14:46:07 virt-072 pengine[2715]: notice: LogActions: Start   fence-virt-069      (virt-072 - blocked)
Dec  1 14:46:07 virt-072 pengine[2715]: notice: LogActions: Start   le-dummy    (virt-072 - blocked)
Dec  1 14:46:07 virt-072 pengine[2715]: notice: process_pe_message: Calculated Transition 16: /var/lib/pacemaker/pengine/pe-input-18.bz2
Dec  1 14:46:07 virt-072 pengine[2715]: notice: process_pe_message: Configuration ERRORs found during PE processing.  Please run "crm_verify -L" to identify issues.
Dec  1 14:46:07 virt-072 crmd[2716]: notice: run_graph: Transition 16 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-18.bz2): Complete
Dec  1 14:46:07 virt-072 crmd[2716]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Dec  1 14:46:11 virt-072 systemd: Starting Cleanup of Temporary Directories...
Dec  1 14:46:11 virt-072 systemd: Started Cleanup of Temporary Directories.

Comment 9 errata-xmlrpc 2015-03-05 09:59:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0440.html


Note You need to log in before you can comment on or make changes to this bug.