Created attachment 884146 [details] crm_report Description of problem: When cluster property stonith-enabled is set to "false" and cluster is running resources with attribute 'on_fail=fence' then fencing is made despite of property setting 'stonith-enabled=false'. Version-Release number of selected component (if applicable): pacemaker-1.1.10-29.el7.x86_64 How reproducible: always Steps to Reproduce: 1. Setup cluster with stonith device and cloned resource with attribute setting 'on_fail=fence' (Standard cluster-qa setup test do this) 2. Run 'pcs cluster enable' on all nodes 3. Set property: 'pcs property set stonith-enabled=false' 4. Try to make some node unclean. e.g. run 'reboot -f' on some node 5. Look into logs on the live nodes. Fencing is made despite of 'stonith-enabled=false' 6. After node comeback disable resource with attribute setting 'on_fail=fence' 7. Try to make some node unclean. e.g. run 'reboot -f' on some node 8. Look into logs on the live nodes. No fencing is made. (proper behavior when stonith-enabled=false) Actual results: When cluster is runnig the resources with attribute 'on_fail=fence' and property stonith-enabled=false is set, then fencing is made. Expected results: I'm expecting that cluster property 'stonith-enabled=false' has higher priority then resource's attribute 'on_fail=fence' and no fencing is made. Additional info: crm_report attached.
Although valid for pacemaker, stonith-enabled=false isn't supported (or supportable) by Red Hat. In particular, and as jkortus noticed, pacemaker will report this configuration as invalid: jkortus: pengine[12955]: error: unpack_operation: Specifying on_fail=fence and stonith-enabled=false makes no sense jkortus: +1 for this message in syslog :-) I'm inclined to close this as invalid.
while it is certainly not a clever thing to set on live instances, I would still like to see that it behaves consistently and according to user expectations. Can we have it so that stonith-enabled=false really disables fencing? :).
We're not turning on fencing: } else if (safe_str_eq(value, "fence")) { action->on_fail = action_fail_fence; value = "node fencing"; if (is_set(data_set->flags, pe_flag_stonith_enabled) == FALSE) { crm_config_err("Specifying on_fail=fence and" " stonith-enabled=false makes no sense"); action->on_fail = action_fail_stop; action->fail_role = RSC_ROLE_STOPPED; value = "stop resource"; } It seems to be because of: Apr 8 16:37:05 virt-078 pengine[19006]: warning: pe_fence_node: Node virt-080.cluster-qe.lab.eng.brq.redhat.com is unclean because it is partially and/or un-expectedly down Which is the result of a botched cleanup 4 years ago :-( The fix is: https://github.com/beekhof/pacemaker/commit/cfd845fc7
I have verified that the fencing does not occur with resource set on-fail=fence when cluster property stonith-enabled is set to false with pacemaker-1.1.12-13.el7.x86_64. --- [root@virt-069 ~]# pcs status Cluster name: STSRHTS31212 Last updated: Mon Dec 1 14:51:53 2014 Last change: Mon Dec 1 14:43:19 2014 Stack: corosync Current DC: virt-063 (1) - partition with quorum Version: 1.1.12-a14efad 3 Nodes configured 10 Resources configured Online: [ virt-063 virt-069 virt-072 ] Full list of resources: fence-virt-063 (stonith:fence_xvm): Started virt-063 fence-virt-069 (stonith:fence_xvm): Started virt-072 fence-virt-072 (stonith:fence_xvm): Started virt-069 Clone Set: dlm-clone [dlm] Stopped: [ virt-063 virt-069 virt-072 ] Clone Set: clvmd-clone [clvmd] Stopped: [ virt-063 virt-069 virt-072 ] le-dummy (ocf::heartbeat:Dummy): Started virt-072 Failed actions: PCSD Status: virt-063: Online virt-069: Online virt-072: Online Daemon Status: corosync: active/disabled pacemaker: active/enabled pcsd: active/enabled [root@virt-069 ~]# pcs resource show le-dummy Resource: le-dummy (class=ocf provider=heartbeat type=Dummy) Operations: start interval=0s timeout=20 (le-dummy-start-timeout-20) stop interval=0s timeout=20 (le-dummy-stop-timeout-20) monitor interval=60s on-fail=fence (le-dummy-monitor-on-fail-fence) [root@virt-069 ~]# iptables -A INPUT ! -i lo -p udp -j DROP && \ iptables -A OUTPUT ! -o lo -p udp -j DROP /var/log/messages: Dec 1 14:46:04 virt-072 corosync[2695]: [TOTEM ] A processor failed, forming new configuration. Dec 1 14:46:06 virt-072 corosync[2695]: [TOTEM ] A new membership (10.34.71.72:84) was formed. Members left: 1 2 Dec 1 14:46:06 virt-072 corosync[2695]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Dec 1 14:46:06 virt-072 corosync[2695]: [QUORUM] Members[1]: 3 Dec 1 14:46:06 virt-072 corosync[2695]: [MAIN ] Completed service synchronization, ready to provide service. Dec 1 14:46:06 virt-072 attrd[2714]: notice: crm_update_peer_state: attrd_peer_change_cb: Node virt-063[1] - state is now lost (was member) Dec 1 14:46:06 virt-072 attrd[2714]: notice: attrd_peer_remove: Removing all virt-063 attributes for attrd_peer_change_cb Dec 1 14:46:06 virt-072 attrd[2714]: notice: attrd_peer_change_cb: Lost attribute writer virt-063 Dec 1 14:46:06 virt-072 attrd[2714]: notice: crm_update_peer_state: attrd_peer_change_cb: Node virt-069[2] - state is now lost (was member) Dec 1 14:46:06 virt-072 attrd[2714]: notice: attrd_peer_remove: Removing all virt-069 attributes for attrd_peer_change_cb Dec 1 14:46:06 virt-072 crmd[2716]: notice: pcmk_quorum_notification: Membership 84: quorum lost (1) Dec 1 14:46:06 virt-072 crmd[2716]: notice: crm_update_peer_state: pcmk_quorum_notification: Node virt-069[2] - state is now lost (was member) Dec 1 14:46:06 virt-072 kernel: dlm: closing connection to node 1 Dec 1 14:46:06 virt-072 kernel: dlm: closing connection to node 2 Dec 1 14:46:06 virt-072 pacemakerd[2710]: notice: pcmk_quorum_notification: Membership 84: quorum lost (1) Dec 1 14:46:06 virt-072 pacemakerd[2710]: notice: crm_update_peer_state: pcmk_quorum_notification: Node virt-063[1] - state is now lost (was member) Dec 1 14:46:06 virt-072 pacemakerd[2710]: notice: crm_update_peer_state: pcmk_quorum_notification: Node virt-069[2] - state is now lost (was member) Dec 1 14:46:06 virt-072 crmd[2716]: warning: match_down_event: No match for shutdown action on 2 Dec 1 14:46:06 virt-072 crmd[2716]: notice: peer_update_callback: Stonith/shutdown of virt-069 not matched Dec 1 14:46:06 virt-072 crmd[2716]: notice: crm_update_peer_state: pcmk_quorum_notification: Node virt-063[1] - state is now lost (was member) Dec 1 14:46:06 virt-072 crmd[2716]: warning: match_down_event: No match for shutdown action on 1 Dec 1 14:46:06 virt-072 crmd[2716]: notice: peer_update_callback: Stonith/shutdown of virt-063 not matched Dec 1 14:46:06 virt-072 crmd[2716]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Dec 1 14:46:06 virt-072 crmd[2716]: warning: match_down_event: No match for shutdown action on 1 Dec 1 14:46:06 virt-072 crmd[2716]: notice: peer_update_callback: Stonith/shutdown of virt-063 not matched Dec 1 14:46:06 virt-072 crmd[2716]: warning: match_down_event: No match for shutdown action on 2 Dec 1 14:46:06 virt-072 crmd[2716]: notice: peer_update_callback: Stonith/shutdown of virt-069 not matched Dec 1 14:46:07 virt-072 pengine[2715]: notice: cluster_status: We do not have quorum - fencing and resource management disabled Dec 1 14:46:07 virt-072 pengine[2715]: error: unpack_operation: Specifying on_fail=fence and stonith-enabled=false makes no sense Dec 1 14:46:07 virt-072 pengine[2715]: error: unpack_operation: Specifying on_fail=fence and stonith-enabled=false makes no sense Dec 1 14:46:07 virt-072 pengine[2715]: error: unpack_operation: Specifying on_fail=fence and stonith-enabled=false makes no sense Dec 1 14:46:07 virt-072 pengine[2715]: error: unpack_operation: Specifying on_fail=fence and stonith-enabled=false makes no sense Dec 1 14:46:07 virt-072 pengine[2715]: error: unpack_operation: Specifying on_fail=fence and stonith-enabled=false makes no sense Dec 1 14:46:07 virt-072 pengine[2715]: notice: LogActions: Start fence-virt-063 (virt-072 - blocked) Dec 1 14:46:07 virt-072 pengine[2715]: notice: LogActions: Start fence-virt-069 (virt-072 - blocked) Dec 1 14:46:07 virt-072 pengine[2715]: notice: LogActions: Start le-dummy (virt-072 - blocked) Dec 1 14:46:07 virt-072 pengine[2715]: notice: process_pe_message: Calculated Transition 16: /var/lib/pacemaker/pengine/pe-input-18.bz2 Dec 1 14:46:07 virt-072 pengine[2715]: notice: process_pe_message: Configuration ERRORs found during PE processing. Please run "crm_verify -L" to identify issues. Dec 1 14:46:07 virt-072 crmd[2716]: notice: run_graph: Transition 16 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-18.bz2): Complete Dec 1 14:46:07 virt-072 crmd[2716]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Dec 1 14:46:11 virt-072 systemd: Starting Cleanup of Temporary Directories... Dec 1 14:46:11 virt-072 systemd: Started Cleanup of Temporary Directories.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0440.html