1566720 – corosync/pacemaker fences a node without a working stonith device

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1566720 - corosync/pacemaker fences a node without a working stonith device

Summary: corosync/pacemaker fences a node without a working stonith device

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	doc-High_Availability_Add-On_Reference
Sub Component:
Version:	7.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Steven J. Levine
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-12 20:50 UTC by Strahil Nikolov
Modified:	2019-03-06 01:10 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-08-27 21:00:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Strahil Nikolov 2018-04-12 20:50:10 UTC

Description of problem:
When all members of a cluster are in standby mode (all resources including stonith are stopped) , crmd can still fence even when the resource which has to do it is stopped.

Version-Release number of selected component (if applicable):
corosync-2.4.3-2.el7.x86_64
corosynclib-2.4.3-2.el7.x86_64
pacemaker-1.1.18-11.el7.x86_64
pacemaker-cli-1.1.18-11.el7.x86_64
pacemaker-cluster-libs-1.1.18-11.el7.x86_64
pacemaker-doc-1.1.18-11.el7.x86_64
pacemaker-libs-1.1.18-11.el7.x86_64
pcs-0.9.162-5.el7_5.1.x86_64


How reproducible:
Always

Steps to Reproduce:
1.Setup a 2 node cluster (2 VMs on same Host) with fence_xvm stonith resources.
2.Test stonith via:
#pcs stonith fence nodeA
3.Set all nodes to standby:
pcs node stanby --all
4.Crash one of the machines:
echo 1 > /proc/sys/kernel/sysrq
echo 1 > /proc/sysrq-trigger

Actual results:
The crashed node is being fenced even when the stonith resource is down


Expected results:
The crashed node to create a kdump image and then reboot (as per /etc/kdump.conf)

Additional info:
Partial Logs from the partner node:
Apr 12 23:39:35 iscsi2 stonith-ng[1393]:  notice: Node iscsi1 state is now lost
Apr 12 23:39:35 iscsi2 stonith-ng[1393]:  notice: Purged 1 peer with id=1 and/or uname=iscsi1 from the membership cache
Apr 12 23:39:35 iscsi2 crmd[1397]:  notice: Stonith/shutdown of iscsi1 not matched
Apr 12 23:39:35 iscsi2 crmd[1397]:  notice: Stonith/shutdown of iscsi1 not matched
Apr 12 23:39:36 iscsi2 pengine[1396]: warning: Cluster node iscsi1 will be fenced: peer is no longer part of the cluster
Apr 12 23:39:36 iscsi2 pengine[1396]: warning: Scheduling Node iscsi1 for STONITH
Apr 12 23:39:36 iscsi2 pengine[1396]:  notice:  * Fence (reboot) iscsi1 'peer is no longer part of the cluster'
Apr 12 23:39:36 iscsi2 stonith-ng[1393]:  notice: Client crmd.1397.eec4355e wants to fence (reboot) 'iscsi1' with device '(any)'
Apr 12 23:39:36 iscsi2 stonith-ng[1393]:  notice: Requesting peer fencing (reboot) of iscsi1
Apr 12 23:39:36 iscsi2 stonith-ng[1393]:  notice: XVM can fence (reboot) iscsi1: dynamic-list
Apr 12 23:39:37 iscsi2 stonith-ng[1393]:  notice: Operation 'reboot' [1783] (call 3 from crmd.1397) for host 'iscsi1' with device 'XVM' returned: 0 (OK)
Apr 12 23:39:37 iscsi2 stonith-ng[1393]:  notice: Operation reboot of iscsi1 by iscsi2 for crmd.1397: OK
Apr 12 23:39:37 iscsi2 crmd[1397]:  notice: Stonith operation 3/1:3:0:ede2610b-a8cb-4de0-afb6-c8dadb8de42e: OK (0)
Apr 12 23:39:59 iscsi2 stonith-ng[1393]:  notice: Node iscsi1 state is now member

Comment 2 Ken Gaillot 2018-04-12 21:30:14 UTC

Hi,

It is intentional behavior that any cluster node can fence any other cluster node with any fence device, regardless of whether the fence resource is started or stopped. Whether the resource is started controls only the recurring monitor for the device, not whether it can be used. The exceptions are:

* Configuring stonith-enabled=false will disable fencing altogether (note that Red Hat does not support clusters when fencing is disabled, as it is not suitable for a production environment)

* Stopping the fence device via disabling it (pcs stonith disable) will prevent any node from using that device, and banning a fence device from a node (pcs constraint location ... avoids) will prevent that node from using the device.

How was the fence device stopped in this case?

Comment 3 Strahil Nikolov 2018-04-13 16:15:55 UTC

Dear Ken,

thank you very much for your detailed clarification.
I put both nodes in "standby" which stopped all cluster resources including the fence_xvm resource.
As I didn't explicitly stop (disable) the stonith resource - the cluster had the ability to fence the unresponsive node and it seems it did it's job properly.

I think that this clarification is very suitable for "5.2. General Properties of Fencing Devices" section of the documentation.

Let's mark this as NOTABUG.

Comment 4 Ken Gaillot 2018-04-13 22:35:17 UTC

(In reply to Strahil Nikolov from comment #3)
> I think that this clarification is very suitable for "5.2. General
> Properties of Fencing Devices" section of the documentation.

Good idea, reassigning this as a documentation bug.

Docs: We want to clarify when a fence device can be used, per Comment 2.

Comment 7 Steven J. Levine 2018-08-27 21:00:36 UTC

Updated description is on the Portal in Section 5.2:

https://doc-stage.usersys.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/high_availability_add-on_reference/#s1-genfenceprops-HAAR

Note You need to log in before you can comment on or make changes to this bug.