1702727 – sbd doesn't detect non-responsive corosync-daemon

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1702727 - sbd doesn't detect non-responsive corosync-daemon

Summary: sbd doesn't detect non-responsive corosync-daemon

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	sbd
Sub Component:
Version:	8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	8.1
Assignee:	Klaus Wenninger
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-24 15:04 UTC by Klaus Wenninger
Modified:	2020-11-14 11:04 UTC (History)
CC List:	6 users (show)
Fixed In Version:	sbd-1.4.0-10.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-11-05 20:46:42 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:3344	0	None	None	None	2019-11-05 20:46:55 UTC

Description Klaus Wenninger 2019-04-24 15:04:40 UTC

Description of problem:

Sbd connects to corosync via cpg-protocol pacemaker is using
at startup but afterwards doesn't detect corosync becoming
unresponsive. (Other nodes disconnecting is detected though.)

This behaviour is critical as the node where corosync has
become non-responsive still believes it is properly connected
and everything is fine. Thus it doesn't see any reason to
shut down resources or suicide.
The other nodes on the other side see the first node disappear
and expect it to suicide within stonith-watchdog-timeout seconds.
After that timeout they will start taking over resources and
thus create a split-brain-situation.

This behaviour should be both critical in setups with watchdog-fencing
and as well with a shared-disk.
While the danger as described above is obvious with watchdog-fencing
in case of a shared disk simultaneous failure of access to the
disk and corosync becoming non-responsive will lead to split-brain
as well (local pacemaker isn't properly informed via corosync that
the node itself has issues and non-updated cib is passed to sbd).

Version-Release number of selected component (if applicable):

1.3.1-18

How reproducible:

100%

Steps to Reproduce:
1. Setup a 3-node-cluster using sbd with watchdog-fencing
2. Issue 'killall -STOP corosync' on 1st cluster-node
3. Check status of cluster using crm_mon on 1st cluster-node
4. Check status of cluster using crm_mon on 2nd cluster-node

Actual results:

On 1st node all nodes do appear fine while 2nd node
shows the true state of the cluster that 1st node is gone.

Expected results:

1st node should detect that corosync isn't responsive anymore
and thus suicide.


Additional info:

Comment 3 michal novacek 2019-09-06 14:15:58 UTC

I have verified tht sbd fencing works correctly when corosync dies in sbd-1.4.0-15.el8.x86_64.

---

Before the fix: sbd-1.4.0-8.el8.x86_64

> virt-148$ killall -STOP corosync
> virt-148$ sleep 60 && crm_mon 1
Cluster name: STSRHTS27505
Stack: corosync
Current DC: virt-151.ipv6 (version 2.0.2-3.el8-744a30d655) - partition with quorum
Last updated: Fri Sep  6 16:13:17 2019
Last change: Fri Sep  6 15:50:11 2019 by root via cibadmin on virt-148

3 nodes configured
6 resources configured

Online: [ virt-148 virt-150 virt-151.ipv6 ]

Full list of resources:

 Clone Set: locking-clone [locking]
     Started: [ virt-148 virt-150 virt-151.ipv6 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
  sbd: active/enabled

> virt-150$ crm_mon -1
Stack: corosync
Current DC: virt-151.ipv6 (version 2.0.2-3.el8-744a30d655) - partition with quorum
Last updated: Fri Sep  6 16:13:49 2019
Last change: Fri Sep  6 15:50:11 2019 by root via cibadmin on virt-148

3 nodes configured
6 resources configured

Node virt-148: UNCLEAN (offline)
Online: [ virt-150 virt-151.ipv6 ]

Active resources:

 Clone Set: locking-clone [locking]
     Resource Group: locking:1
         dlm	(ocf::pacemaker:controld):	Started virt-148 (UNCLEAN)
         lvmlockd	(ocf::heartbeat:lvmlockd):	Started virt-148 (UNCLEAN)
     Started: [ virt-150 virt-151.ipv6 ]

Failed Fencing Actions:
* reboot of virt-148 failed: delegate=, client=stonith-api.8855, origin=virt-150,
    last-failed='Fri Sep  6 16:13:49 2019'

---

Fixed version: sbd-1.4.0-15.el8.x86_64

> virt-148$ date
Fri Sep  6 15:47:49 CEST 2019
> virt-148$ killall -STOP corosync

> virt-150$ tail -f /var/log/cluster/corosync.log
...

Sep 06 15:48:24 [25933] virt-150 corosync info    [KNET  ] link: host: 1 link: 0 is down
Sep 06 15:48:24 [25933] virt-150 corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 06 15:48:24 [25933] virt-150 corosync warning [KNET  ] host: host: 1 has no active links
Sep 06 15:48:24 [25933] virt-150 corosync notice  [TOTEM ] Token has not been received in 1237 ms
Sep 06 15:48:25 [25933] virt-150 corosync notice  [TOTEM ] A processor failed, forming new configuration.
Sep 06 15:48:27 [25933] virt-150 corosync notice  [TOTEM ] A new membership (2:28) was formed. Members left: 1
Sep 06 15:48:27 [25933] virt-150 corosync notice  [TOTEM ] Failed to receive the leave message. failed: 1
Sep 06 15:48:27 [25933] virt-150 corosync warning [CPG   ] downlist left_list: 1 received
Sep 06 15:48:27 [25933] virt-150 corosync warning [CPG   ] downlist left_list: 1 received
Sep 06 15:48:27 [25933] virt-150 corosync notice  [QUORUM] Members[2]: 2 3
Sep 06 15:48:27 [25933] virt-150 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
Sep 06 15:50:08 [25933] virt-150 corosync info    [KNET  ] rx: host: 1 link: 0 is up
Sep 06 15:50:08 [25933] virt-150 corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 06 15:50:08 [25933] virt-150 corosync notice  [TOTEM ] A new membership (1:32) was formed. Members joined: 1
Sep 06 15:50:08 [25933] virt-150 corosync warning [CPG   ] downlist left_list: 0 received
Sep 06 15:50:08 [25933] virt-150 corosync warning [CPG   ] downlist left_list: 0 received
Sep 06 15:50:08 [25933] virt-150 corosync warning [CPG   ] downlist left_list: 0 received
Sep 06 15:50:08 [25933] virt-150 corosync notice  [QUORUM] Members[3]: 1 2 3
Sep 06 15:50:08 [25933] virt-150 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.

Comment 5 errata-xmlrpc 2019-11-05 20:46:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3344

Note You need to log in before you can comment on or make changes to this bug.