Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1085451

Summary:

property stonith-enabled=false bug behavior

Product:

Red Hat Enterprise Linux 7

Reporter:

Miroslav Lisik <mlisik>

Component:

pacemaker

Assignee:

Andrew Beekhof <abeekhof>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

7.0

CC:

cluster-maint, dvossel, fdinitto, jkortus, mnovacek, pzimek

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

pacemaker-1.1.12-4.el6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-03-05 09:59:59 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
crm_report	none

Description Miroslav Lisik 2014-04-08 15:52:12 UTC

Created attachment 884146 [details]
crm_report

Description of problem:
When cluster property stonith-enabled is set to "false" and cluster is running resources with attribute 'on_fail=fence' then fencing is made despite of property setting 'stonith-enabled=false'.

Version-Release number of selected component (if applicable):
pacemaker-1.1.10-29.el7.x86_64

How reproducible:
always

Steps to Reproduce:
1. Setup cluster with stonith device and cloned resource with attribute setting 'on_fail=fence' (Standard cluster-qa setup test do this)

2. Run 'pcs cluster enable' on all nodes

3. Set property: 
'pcs property set stonith-enabled=false'

4. Try to make some node unclean.
e.g. run 'reboot -f' on some node

5. Look into logs on the live nodes. Fencing is made despite of 'stonith-enabled=false'

6. After node comeback disable resource with attribute setting 'on_fail=fence'
 
7. Try to make some node unclean.
e.g. run 'reboot -f' on some node

8. Look into logs on the live nodes. No fencing is made. (proper behavior when stonith-enabled=false)

Actual results:
When cluster is runnig the resources with attribute 'on_fail=fence' and property stonith-enabled=false is set, then fencing is made.


Expected results:
I'm expecting that cluster property 'stonith-enabled=false' has higher priority then resource's attribute 'on_fail=fence' and no fencing is made.


Additional info:
crm_report attached.

Comment 2 Andrew Beekhof 2014-04-08 23:33:52 UTC

Although valid for pacemaker, stonith-enabled=false isn't supported (or supportable) by Red Hat.  In particular, and as jkortus noticed, pacemaker will report this configuration as invalid:

jkortus: pengine[12955]: error: unpack_operation: Specifying on_fail=fence and stonith-enabled=false makes no sense
jkortus: +1 for this message in syslog :-)

I'm inclined to close this as invalid.

Comment 3 Jaroslav Kortus 2014-04-09 08:16:18 UTC

while it is certainly not a clever thing to set on live instances, I would still like to see that it behaves consistently and according to user expectations. Can we have it so that stonith-enabled=false really disables fencing? :).

Comment 4 Andrew Beekhof 2014-04-15 06:10:38 UTC

We're not turning on fencing:

    } else if (safe_str_eq(value, "fence")) {
        action->on_fail = action_fail_fence;
        value = "node fencing";

        if (is_set(data_set->flags, pe_flag_stonith_enabled) == FALSE) {
            crm_config_err("Specifying on_fail=fence and" " stonith-enabled=false makes no sense");
            action->on_fail = action_fail_stop;
            action->fail_role = RSC_ROLE_STOPPED;
            value = "stop resource";
        }

It seems to be because of:

Apr  8 16:37:05 virt-078 pengine[19006]: warning: pe_fence_node: Node virt-080.cluster-qe.lab.eng.brq.redhat.com is unclean because it is partially and/or un-expectedly down

Which is the result of a botched cleanup 4 years ago :-(

The fix is: https://github.com/beekhof/pacemaker/commit/cfd845fc7

Comment 7 michal novacek 2014-12-01 13:54:18 UTC

I have verified that the fencing does not occur with resource set on-fail=fence when cluster property stonith-enabled is set to false with pacemaker-1.1.12-13.el7.x86_64.

---

[root@virt-069 ~]# pcs status
Cluster name: STSRHTS31212
Last updated: Mon Dec  1 14:51:53 2014
Last change: Mon Dec  1 14:43:19 2014
Stack: corosync
Current DC: virt-063 (1) - partition with quorum
Version: 1.1.12-a14efad
3 Nodes configured
10 Resources configured


Online: [ virt-063 virt-069 virt-072 ]

Full list of resources:

 fence-virt-063 (stonith:fence_xvm):    Started virt-063 
 fence-virt-069 (stonith:fence_xvm):    Started virt-072 
 fence-virt-072 (stonith:fence_xvm):    Started virt-069 
 Clone Set: dlm-clone [dlm]
     Stopped: [ virt-063 virt-069 virt-072 ]
 Clone Set: clvmd-clone [clvmd]
     Stopped: [ virt-063 virt-069 virt-072 ]
 le-dummy       (ocf::heartbeat:Dummy): Started virt-072 

Failed actions:

PCSD Status:
  virt-063: Online
  virt-069: Online
  virt-072: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/enabled
  pcsd: active/enabled

[root@virt-069 ~]# pcs resource show le-dummy
 Resource: le-dummy (class=ocf provider=heartbeat type=Dummy)
  Operations: start interval=0s timeout=20 (le-dummy-start-timeout-20)
              stop interval=0s timeout=20 (le-dummy-stop-timeout-20)
              monitor interval=60s on-fail=fence (le-dummy-monitor-on-fail-fence)

[root@virt-069 ~]# iptables -A INPUT ! -i lo -p udp -j DROP && \
iptables -A OUTPUT ! -o lo -p udp -j DROP


/var/log/messages:
Dec  1 14:46:04 virt-072 corosync[2695]: [TOTEM ] A processor failed, forming new configuration.
Dec  1 14:46:06 virt-072 corosync[2695]: [TOTEM ] A new membership (10.34.71.72:84) was formed. Members left: 1 2
Dec  1 14:46:06 virt-072 corosync[2695]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Dec  1 14:46:06 virt-072 corosync[2695]: [QUORUM] Members[1]: 3
Dec  1 14:46:06 virt-072 corosync[2695]: [MAIN  ] Completed service synchronization, ready to provide service.
Dec  1 14:46:06 virt-072 attrd[2714]: notice: crm_update_peer_state: attrd_peer_change_cb: Node virt-063[1] - state is now lost (was member)
Dec  1 14:46:06 virt-072 attrd[2714]: notice: attrd_peer_remove: Removing all virt-063 attributes for attrd_peer_change_cb
Dec  1 14:46:06 virt-072 attrd[2714]: notice: attrd_peer_change_cb: Lost attribute writer virt-063
Dec  1 14:46:06 virt-072 attrd[2714]: notice: crm_update_peer_state: attrd_peer_change_cb: Node virt-069[2] - state is now lost (was member)
Dec  1 14:46:06 virt-072 attrd[2714]: notice: attrd_peer_remove: Removing all virt-069 attributes for attrd_peer_change_cb
Dec  1 14:46:06 virt-072 crmd[2716]: notice: pcmk_quorum_notification: Membership 84: quorum lost (1)
Dec  1 14:46:06 virt-072 crmd[2716]: notice: crm_update_peer_state: pcmk_quorum_notification: Node virt-069[2] - state is now lost (was member)
Dec  1 14:46:06 virt-072 kernel: dlm: closing connection to node 1
Dec  1 14:46:06 virt-072 kernel: dlm: closing connection to node 2
Dec  1 14:46:06 virt-072 pacemakerd[2710]: notice: pcmk_quorum_notification: Membership 84: quorum lost (1)
Dec  1 14:46:06 virt-072 pacemakerd[2710]: notice: crm_update_peer_state: pcmk_quorum_notification: Node virt-063[1] - state is now lost (was member)
Dec  1 14:46:06 virt-072 pacemakerd[2710]: notice: crm_update_peer_state: pcmk_quorum_notification: Node virt-069[2] - state is now lost (was member)
Dec  1 14:46:06 virt-072 crmd[2716]: warning: match_down_event: No match for shutdown action on 2
Dec  1 14:46:06 virt-072 crmd[2716]: notice: peer_update_callback: Stonith/shutdown of virt-069 not matched
Dec  1 14:46:06 virt-072 crmd[2716]: notice: crm_update_peer_state: pcmk_quorum_notification: Node virt-063[1] - state is now lost (was member)
Dec  1 14:46:06 virt-072 crmd[2716]: warning: match_down_event: No match for shutdown action on 1
Dec  1 14:46:06 virt-072 crmd[2716]: notice: peer_update_callback: Stonith/shutdown of virt-063 not matched
Dec  1 14:46:06 virt-072 crmd[2716]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Dec  1 14:46:06 virt-072 crmd[2716]: warning: match_down_event: No match for shutdown action on 1
Dec  1 14:46:06 virt-072 crmd[2716]: notice: peer_update_callback: Stonith/shutdown of virt-063 not matched
Dec  1 14:46:06 virt-072 crmd[2716]: warning: match_down_event: No match for shutdown action on 2
Dec  1 14:46:06 virt-072 crmd[2716]: notice: peer_update_callback: Stonith/shutdown of virt-069 not matched
Dec  1 14:46:07 virt-072 pengine[2715]: notice: cluster_status: We do not have quorum - fencing and resource management disabled
Dec  1 14:46:07 virt-072 pengine[2715]: error: unpack_operation: Specifying on_fail=fence and stonith-enabled=false makes no sense
Dec  1 14:46:07 virt-072 pengine[2715]: error: unpack_operation: Specifying on_fail=fence and stonith-enabled=false makes no sense
Dec  1 14:46:07 virt-072 pengine[2715]: error: unpack_operation: Specifying on_fail=fence and stonith-enabled=false makes no sense
Dec  1 14:46:07 virt-072 pengine[2715]: error: unpack_operation: Specifying on_fail=fence and stonith-enabled=false makes no sense
Dec  1 14:46:07 virt-072 pengine[2715]: error: unpack_operation: Specifying on_fail=fence and stonith-enabled=false makes no sense
Dec  1 14:46:07 virt-072 pengine[2715]: notice: LogActions: Start   fence-virt-063      (virt-072 - blocked)
Dec  1 14:46:07 virt-072 pengine[2715]: notice: LogActions: Start   fence-virt-069      (virt-072 - blocked)
Dec  1 14:46:07 virt-072 pengine[2715]: notice: LogActions: Start   le-dummy    (virt-072 - blocked)
Dec  1 14:46:07 virt-072 pengine[2715]: notice: process_pe_message: Calculated Transition 16: /var/lib/pacemaker/pengine/pe-input-18.bz2
Dec  1 14:46:07 virt-072 pengine[2715]: notice: process_pe_message: Configuration ERRORs found during PE processing.  Please run "crm_verify -L" to identify issues.
Dec  1 14:46:07 virt-072 crmd[2716]: notice: run_graph: Transition 16 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-18.bz2): Complete
Dec  1 14:46:07 virt-072 crmd[2716]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Dec  1 14:46:11 virt-072 systemd: Starting Cleanup of Temporary Directories...
Dec  1 14:46:11 virt-072 systemd: Started Cleanup of Temporary Directories.

Comment 9 errata-xmlrpc 2015-03-05 09:59:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0440.html