RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1449982 - SBD with Storage Integration Must NEVER Fall-Back to Quorum-Based-Watchdog-Self-Fencing on 2-node clusters: Data loss risk
Summary: SBD with Storage Integration Must NEVER Fall-Back to Quorum-Based-Watchdog-Se...
Keywords:
Status: CLOSED DUPLICATE of bug 1443666
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker
Version: 7.4
Hardware: All
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: Ken Gaillot
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-11 09:47 UTC by Daniel Peess
Modified: 2020-12-14 08:39 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-10-19 14:11:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Daniel Peess 2017-05-11 09:47:21 UTC
Description of problem:

In #1449155 Klaus Wenninger warned me that SBD might fail-back to quorum-based-watchdog-self-fencing if poison-pill-fencing fails and 'stonith-watchdog-timeout' is (still) set.

For 2-node clusters without auto-tie-breaker, this is a serious data loss risk:
2-node clusters without additional arbitrators never lose quorum,
so both might fail to send poison-pills,
wait for 'stonith-watchdog-timeout',
and both take-over exclusive resources anyway: data loss.

even after i set:
$ pcs property unset stonith-watchdog-timeout;

i still get 'Relying on watchdog integration for fencing' in corosync.log,
luckily my SBD fencing agents work properly and fence via poison-pill before that happens.

can we enforce that 2-node clusters never ever activate/fallback-to watchdog-only-self-fencing behaviour no matter if stonith-watchdog-timeout is set or not?

Version-Release number of selected component (if applicable):
RHEL 7.3 with RHEL 7.4 dev SBD packages with poison-pill storage integration.

How to reproduce:
-) Setup SBD with poison-pill-fencing.
-) Falsely set stonith-watchdog-timeout (because you do not know exactly what's the difference between those 2 SBD modes).
-) Put both SBD agents into stopped mode.
-) Check if both nodes start their RAs on split-brain.

Comment 9 Klaus Wenninger 2017-10-19 14:11:43 UTC
If there would be done any implementation solely for that issue it would probably have to be done on the pacemaker-side.
But as said before this is probably most effectively taken care of by making the mechanism described in bz1443666 automatically remove all cluster-nodes (2) from the list of nodes that are fenced via watchdog-fencing if 2-node is enabled.

*** This bug has been marked as a duplicate of bug 1443666 ***


Note You need to log in before you can comment on or make changes to this bug.