Bug 1449982

Summary: SBD with Storage Integration Must NEVER Fall-Back to Quorum-Based-Watchdog-Self-Fencing on 2-node clusters: Data loss risk
Product: Red Hat Enterprise Linux 7 Reporter: Daniel Peess <dpeess>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED DUPLICATE QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.4CC: abeekhof, cfeist, cluster-maint, kwenning, mlisik, mreinke
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-19 14:11:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Daniel Peess 2017-05-11 09:47:21 UTC
Description of problem:

In #1449155 Klaus Wenninger warned me that SBD might fail-back to quorum-based-watchdog-self-fencing if poison-pill-fencing fails and 'stonith-watchdog-timeout' is (still) set.

For 2-node clusters without auto-tie-breaker, this is a serious data loss risk:
2-node clusters without additional arbitrators never lose quorum,
so both might fail to send poison-pills,
wait for 'stonith-watchdog-timeout',
and both take-over exclusive resources anyway: data loss.

even after i set:
$ pcs property unset stonith-watchdog-timeout;

i still get 'Relying on watchdog integration for fencing' in corosync.log,
luckily my SBD fencing agents work properly and fence via poison-pill before that happens.

can we enforce that 2-node clusters never ever activate/fallback-to watchdog-only-self-fencing behaviour no matter if stonith-watchdog-timeout is set or not?

Version-Release number of selected component (if applicable):
RHEL 7.3 with RHEL 7.4 dev SBD packages with poison-pill storage integration.

How to reproduce:
-) Setup SBD with poison-pill-fencing.
-) Falsely set stonith-watchdog-timeout (because you do not know exactly what's the difference between those 2 SBD modes).
-) Put both SBD agents into stopped mode.
-) Check if both nodes start their RAs on split-brain.

Comment 9 Klaus Wenninger 2017-10-19 14:11:43 UTC
If there would be done any implementation solely for that issue it would probably have to be done on the pacemaker-side.
But as said before this is probably most effectively taken care of by making the mechanism described in bz1443666 automatically remove all cluster-nodes (2) from the list of nodes that are fenced via watchdog-fencing if 2-node is enabled.

*** This bug has been marked as a duplicate of bug 1443666 ***