Bug 1702727

Summary:	sbd doesn't detect non-responsive corosync-daemon
Product:	Red Hat Enterprise Linux 8	Reporter:	Klaus Wenninger <kwenning>
Component:	sbd	Assignee:	Klaus Wenninger <kwenning>
Status:	CLOSED ERRATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	8.0	CC:	cfeist, cluster-maint, jfriesse, kgaillot, mlisik, mnovacek
Target Milestone:	rc	Flags:	pm-rhel: mirror+
Target Release:	8.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	sbd-1.4.0-10.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-11-05 20:46:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Klaus Wenninger 2019-04-24 15:04:40 UTC

Description of problem:

Sbd connects to corosync via cpg-protocol pacemaker is using
at startup but afterwards doesn't detect corosync becoming
unresponsive. (Other nodes disconnecting is detected though.)

This behaviour is critical as the node where corosync has
become non-responsive still believes it is properly connected
and everything is fine. Thus it doesn't see any reason to
shut down resources or suicide.
The other nodes on the other side see the first node disappear
and expect it to suicide within stonith-watchdog-timeout seconds.
After that timeout they will start taking over resources and
thus create a split-brain-situation.

This behaviour should be both critical in setups with watchdog-fencing
and as well with a shared-disk.
While the danger as described above is obvious with watchdog-fencing
in case of a shared disk simultaneous failure of access to the
disk and corosync becoming non-responsive will lead to split-brain
as well (local pacemaker isn't properly informed via corosync that
the node itself has issues and non-updated cib is passed to sbd).

Version-Release number of selected component (if applicable):

1.3.1-18

How reproducible:

100%

Steps to Reproduce:
1. Setup a 3-node-cluster using sbd with watchdog-fencing
2. Issue 'killall -STOP corosync' on 1st cluster-node
3. Check status of cluster using crm_mon on 1st cluster-node
4. Check status of cluster using crm_mon on 2nd cluster-node

Actual results:

On 1st node all nodes do appear fine while 2nd node
shows the true state of the cluster that 1st node is gone.

Expected results:

1st node should detect that corosync isn't responsive anymore
and thus suicide.


Additional info:

Comment 3 michal novacek 2019-09-06 14:15:58 UTC

I have verified tht sbd fencing works correctly when corosync dies in sbd-1.4.0-15.el8.x86_64.

---

Before the fix: sbd-1.4.0-8.el8.x86_64

> virt-148$ killall -STOP corosync
> virt-148$ sleep 60 && crm_mon 1
Cluster name: STSRHTS27505
Stack: corosync
Current DC: virt-151.ipv6 (version 2.0.2-3.el8-744a30d655) - partition with quorum
Last updated: Fri Sep  6 16:13:17 2019
Last change: Fri Sep  6 15:50:11 2019 by root via cibadmin on virt-148

3 nodes configured
6 resources configured

Online: [ virt-148 virt-150 virt-151.ipv6 ]

Full list of resources:

 Clone Set: locking-clone [locking]
     Started: [ virt-148 virt-150 virt-151.ipv6 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
  sbd: active/enabled

> virt-150$ crm_mon -1
Stack: corosync
Current DC: virt-151.ipv6 (version 2.0.2-3.el8-744a30d655) - partition with quorum
Last updated: Fri Sep  6 16:13:49 2019
Last change: Fri Sep  6 15:50:11 2019 by root via cibadmin on virt-148

3 nodes configured
6 resources configured

Node virt-148: UNCLEAN (offline)
Online: [ virt-150 virt-151.ipv6 ]

Active resources:

 Clone Set: locking-clone [locking]
     Resource Group: locking:1
         dlm	(ocf::pacemaker:controld):	Started virt-148 (UNCLEAN)
         lvmlockd	(ocf::heartbeat:lvmlockd):	Started virt-148 (UNCLEAN)
     Started: [ virt-150 virt-151.ipv6 ]

Failed Fencing Actions:
* reboot of virt-148 failed: delegate=, client=stonith-api.8855, origin=virt-150,
    last-failed='Fri Sep  6 16:13:49 2019'

---

Fixed version: sbd-1.4.0-15.el8.x86_64

> virt-148$ date
Fri Sep  6 15:47:49 CEST 2019
> virt-148$ killall -STOP corosync

> virt-150$ tail -f /var/log/cluster/corosync.log
...

Sep 06 15:48:24 [25933] virt-150 corosync info    [KNET  ] link: host: 1 link: 0 is down
Sep 06 15:48:24 [25933] virt-150 corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 06 15:48:24 [25933] virt-150 corosync warning [KNET  ] host: host: 1 has no active links
Sep 06 15:48:24 [25933] virt-150 corosync notice  [TOTEM ] Token has not been received in 1237 ms
Sep 06 15:48:25 [25933] virt-150 corosync notice  [TOTEM ] A processor failed, forming new configuration.
Sep 06 15:48:27 [25933] virt-150 corosync notice  [TOTEM ] A new membership (2:28) was formed. Members left: 1
Sep 06 15:48:27 [25933] virt-150 corosync notice  [TOTEM ] Failed to receive the leave message. failed: 1
Sep 06 15:48:27 [25933] virt-150 corosync warning [CPG   ] downlist left_list: 1 received
Sep 06 15:48:27 [25933] virt-150 corosync warning [CPG   ] downlist left_list: 1 received
Sep 06 15:48:27 [25933] virt-150 corosync notice  [QUORUM] Members[2]: 2 3
Sep 06 15:48:27 [25933] virt-150 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
Sep 06 15:50:08 [25933] virt-150 corosync info    [KNET  ] rx: host: 1 link: 0 is up
Sep 06 15:50:08 [25933] virt-150 corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 06 15:50:08 [25933] virt-150 corosync notice  [TOTEM ] A new membership (1:32) was formed. Members joined: 1
Sep 06 15:50:08 [25933] virt-150 corosync warning [CPG   ] downlist left_list: 0 received
Sep 06 15:50:08 [25933] virt-150 corosync warning [CPG   ] downlist left_list: 0 received
Sep 06 15:50:08 [25933] virt-150 corosync warning [CPG   ] downlist left_list: 0 received
Sep 06 15:50:08 [25933] virt-150 corosync notice  [QUORUM] Members[3]: 1 2 3
Sep 06 15:50:08 [25933] virt-150 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.

Comment 5 errata-xmlrpc 2019-11-05 20:46:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3344