Bug 1566720 - corosync/pacemaker fences a node without a working stonith device
Summary: corosync/pacemaker fences a node without a working stonith device
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: doc-High_Availability_Add-On_Reference
Version: 7.5
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: Steven J. Levine
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-12 20:50 UTC by Strahil Nikolov
Modified: 2019-03-06 01:10 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-08-27 21:00:36 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Strahil Nikolov 2018-04-12 20:50:10 UTC
Description of problem:
When all members of a cluster are in standby mode (all resources including stonith are stopped) , crmd can still fence even when the resource which has to do it is stopped.

Version-Release number of selected component (if applicable):
corosync-2.4.3-2.el7.x86_64
corosynclib-2.4.3-2.el7.x86_64
pacemaker-1.1.18-11.el7.x86_64
pacemaker-cli-1.1.18-11.el7.x86_64
pacemaker-cluster-libs-1.1.18-11.el7.x86_64
pacemaker-doc-1.1.18-11.el7.x86_64
pacemaker-libs-1.1.18-11.el7.x86_64
pcs-0.9.162-5.el7_5.1.x86_64


How reproducible:
Always

Steps to Reproduce:
1.Setup a 2 node cluster (2 VMs on same Host) with fence_xvm stonith resources.
2.Test stonith via:
#pcs stonith fence nodeA
3.Set all nodes to standby:
pcs node stanby --all
4.Crash one of the machines:
echo 1 > /proc/sys/kernel/sysrq
echo 1 > /proc/sysrq-trigger

Actual results:
The crashed node is being fenced even when the stonith resource is down


Expected results:
The crashed node to create a kdump image and then reboot (as per /etc/kdump.conf)

Additional info:
Partial Logs from the partner node:
Apr 12 23:39:35 iscsi2 stonith-ng[1393]:  notice: Node iscsi1 state is now lost
Apr 12 23:39:35 iscsi2 stonith-ng[1393]:  notice: Purged 1 peer with id=1 and/or uname=iscsi1 from the membership cache
Apr 12 23:39:35 iscsi2 crmd[1397]:  notice: Stonith/shutdown of iscsi1 not matched
Apr 12 23:39:35 iscsi2 crmd[1397]:  notice: Stonith/shutdown of iscsi1 not matched
Apr 12 23:39:36 iscsi2 pengine[1396]: warning: Cluster node iscsi1 will be fenced: peer is no longer part of the cluster
Apr 12 23:39:36 iscsi2 pengine[1396]: warning: Scheduling Node iscsi1 for STONITH
Apr 12 23:39:36 iscsi2 pengine[1396]:  notice:  * Fence (reboot) iscsi1 'peer is no longer part of the cluster'
Apr 12 23:39:36 iscsi2 stonith-ng[1393]:  notice: Client crmd.1397.eec4355e wants to fence (reboot) 'iscsi1' with device '(any)'
Apr 12 23:39:36 iscsi2 stonith-ng[1393]:  notice: Requesting peer fencing (reboot) of iscsi1
Apr 12 23:39:36 iscsi2 stonith-ng[1393]:  notice: XVM can fence (reboot) iscsi1: dynamic-list
Apr 12 23:39:37 iscsi2 stonith-ng[1393]:  notice: Operation 'reboot' [1783] (call 3 from crmd.1397) for host 'iscsi1' with device 'XVM' returned: 0 (OK)
Apr 12 23:39:37 iscsi2 stonith-ng[1393]:  notice: Operation reboot of iscsi1 by iscsi2 for crmd.1397@iscsi2.af42fbe3: OK
Apr 12 23:39:37 iscsi2 crmd[1397]:  notice: Stonith operation 3/1:3:0:ede2610b-a8cb-4de0-afb6-c8dadb8de42e: OK (0)
Apr 12 23:39:59 iscsi2 stonith-ng[1393]:  notice: Node iscsi1 state is now member

Comment 2 Ken Gaillot 2018-04-12 21:30:14 UTC
Hi,

It is intentional behavior that any cluster node can fence any other cluster node with any fence device, regardless of whether the fence resource is started or stopped. Whether the resource is started controls only the recurring monitor for the device, not whether it can be used. The exceptions are:

* Configuring stonith-enabled=false will disable fencing altogether (note that Red Hat does not support clusters when fencing is disabled, as it is not suitable for a production environment)

* Stopping the fence device via disabling it (pcs stonith disable) will prevent any node from using that device, and banning a fence device from a node (pcs constraint location ... avoids) will prevent that node from using the device.

How was the fence device stopped in this case?

Comment 3 Strahil Nikolov 2018-04-13 16:15:55 UTC
Dear Ken,

thank you very much for your detailed clarification.
I put both nodes in "standby" which stopped all cluster resources including the fence_xvm resource.
As I didn't explicitly stop (disable) the stonith resource - the cluster had the ability to fence the unresponsive node and it seems it did it's job properly.

I think that this clarification is very suitable for "5.2. General Properties of Fencing Devices" section of the documentation.

Let's mark this as NOTABUG.

Comment 4 Ken Gaillot 2018-04-13 22:35:17 UTC
(In reply to Strahil Nikolov from comment #3)
> I think that this clarification is very suitable for "5.2. General
> Properties of Fencing Devices" section of the documentation.

Good idea, reassigning this as a documentation bug.

Docs: We want to clarify when a fence device can be used, per Comment 2.


Note You need to log in before you can comment on or make changes to this bug.