Bug 1449982

Summary:	SBD with Storage Integration Must NEVER Fall-Back to Quorum-Based-Watchdog-Self-Fencing on 2-node clusters: Data loss risk
Product:	Red Hat Enterprise Linux 7	Reporter:	Daniel Peess <dpeess>
Component:	pacemaker	Assignee:	Ken Gaillot <kgaillot>
Status:	CLOSED DUPLICATE	QA Contact:	cluster-qe <cluster-qe>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	7.4	CC:	abeekhof, cfeist, cluster-maint, kwenning, mlisik, mreinke
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-10-19 14:11:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Daniel Peess 2017-05-11 09:47:21 UTC

Description of problem:

In #1449155 Klaus Wenninger warned me that SBD might fail-back to quorum-based-watchdog-self-fencing if poison-pill-fencing fails and 'stonith-watchdog-timeout' is (still) set.

For 2-node clusters without auto-tie-breaker, this is a serious data loss risk:
2-node clusters without additional arbitrators never lose quorum,
so both might fail to send poison-pills,
wait for 'stonith-watchdog-timeout',
and both take-over exclusive resources anyway: data loss.

even after i set:
$ pcs property unset stonith-watchdog-timeout;

i still get 'Relying on watchdog integration for fencing' in corosync.log,
luckily my SBD fencing agents work properly and fence via poison-pill before that happens.

can we enforce that 2-node clusters never ever activate/fallback-to watchdog-only-self-fencing behaviour no matter if stonith-watchdog-timeout is set or not?

Version-Release number of selected component (if applicable):
RHEL 7.3 with RHEL 7.4 dev SBD packages with poison-pill storage integration.

How to reproduce:
-) Setup SBD with poison-pill-fencing.
-) Falsely set stonith-watchdog-timeout (because you do not know exactly what's the difference between those 2 SBD modes).
-) Put both SBD agents into stopped mode.
-) Check if both nodes start their RAs on split-brain.

Comment 9 Klaus Wenninger 2017-10-19 14:11:43 UTC

If there would be done any implementation solely for that issue it would probably have to be done on the pacemaker-side.
But as said before this is probably most effectively taken care of by making the mechanism described in bz1443666 automatically remove all cluster-nodes (2) from the list of nodes that are fenced via watchdog-fencing if 2-node is enabled.

*** This bug has been marked as a duplicate of bug 1443666 ***