1221680 – trying to configure cluster with sbd and watchdog fencing cause endless reset

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1221680 - trying to configure cluster with sbd and watchdog fencing cause endless reset

Summary: trying to configure cluster with sbd and watchdog fencing cause endless reset

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	sbd
Sub Component:
Version:	7.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Andrew Beekhof
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-05-14 14:32 UTC by michal novacek
Modified:	2019-03-06 01:11 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-08-12 06:05:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
pcs cluster report (430.12 KB, application/x-bzip) 2015-05-14 14:32 UTC, michal novacek	no flags	Details
'pcs cluster report' after several reboots (560.91 KB, application/x-bzip) 2015-05-14 14:33 UTC, michal novacek	no flags	Details
pcs cluster report (284.60 KB, application/x-bzip) 2015-06-02 14:38 UTC, michal novacek	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1134245	1	None	None	None	2021-01-20 06:05:38 UTC
Red Hat Bugzilla	1233590	1	None	None	None	2023-09-14 03:00:54 UTC

Internal Links: 1134245 1233590

Description michal novacek 2015-05-14 14:32:34 UTC

Created attachment 1025456 [details]
pcs cluster report

Description of problem:
I have a running pacemaker two node quorate cluster with auto tie breaker. SBD
is configured but disabled. There is no fencing configured. (1)

I end up with both systems being reset endlessly. I only can get from that
situation by disabling sbd.

I thought the problem might be too short time for cluster to come up before it
is reset by watchdog but setting stonith-watchdog-timeout to 30s (which is
plenty for the cluster on virtual machine to come up) did not change anything.

I'm attaching cluster 'pcs cluster report' before and after in a hope there can
be found something relevant.


Version-Release number of selected component (if applicable):
corosync-2.3.4-5.el7.x86_64
pacemaker-1.1.12-22.el7_1.2.x86_64

How reproducible: always

Steps to reproduce:
1/ Have a working cluster with no fencing.
2/ systemctl enable sbd
3/ systemctl restart corosync

Actual result:
Endless reboot loop.

Expected results:
Cluster becoming quorate with no reboots.

Additional info:
(1)
[root@virt-050 ~]# corosync-quorumtool 
Quorum information
------------------
Date:             Thu May 14 15:06:42 2015
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          260
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2  
Flags:            Quorate AutoTieBreaker 

Membership information
----------------------
    Nodeid      Votes Name
         1          1 virt-050 (local)
         2          1 virt-051

[root@virt-050 ~]# pcs status
Cluster name: STSRHTS2609
Last updated: Thu May 14 15:05:39 2015
Last change: Thu May 14 14:58:57 2015
Stack: corosync
Current DC: virt-051 (2) - partition with quorum
Version: 1.1.12-a14efad
2 Nodes configured
6 Resources configured


Online: [ virt-050 virt-051 ]

Full list of resources:

 Clone Set: dlm-clone [dlm]
     Started: [ virt-050 virt-051 ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ virt-050 virt-051 ]

PCSD Status:
  virt-050: Online
  virt-051: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

[root@virt-050 ~]# pcs property 
Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: STSRHTS2609
 dc-version: 1.1.12-a14efad
 have-watchdog: false
 no-quorum-policy: freeze

[root@virt-050 ~]# systemctl status sbd
sbd.service - Shared-storage based fencing daemon
   Loaded: loaded (/usr/lib/systemd/system/sbd.service; disabled)
   Active: inactive (dead)


[root@virt-050 ~]# grep -v \# /etc/sysconfig/sbd | sort | uniq

SBD_DELAY_START=no
SBD_OPTS=-W
SBD_PACEMAKER=yes
SBD_STARTMODE=clean
SBD_WATCHDOG_DEV=/dev/watchdog
SBD_WATCHDOG_TIMEOUT=5

Comment 1 michal novacek 2015-05-14 14:33:28 UTC

Created attachment 1025457 [details]
'pcs cluster report' after several reboots

Comment 3 Andrew Beekhof 2015-05-20 01:39:08 UTC

Two things:

> have-watchdog: false

This implies that sbd is not running or not properly configured.
This is backed up by:

>  Active: inactive (dead)

Probably because you've set 

> SBD_OPTS=-W

which disables the system watchdog. Don't do that.


Also:

> controld(dlm)[4556]:	2015/05/14_16:03:15 ERROR: Uncontrolled lockspace exists, system must reboot. Executing suicide fencing

This is why the node is rebooting.
Make sure the cluster is working before adding the dlm.

Comment 4 michal novacek 2015-06-02 14:38:38 UTC

Created attachment 1033837 [details]
pcs cluster report

This report is taken after both nodes are rebooted by sbd.

Comment 5 michal novacek 2015-06-02 14:41:07 UTC

I deleted stonith devices and dlm/clvmd devices from the cluster and
removed -W from /etc/sysconfig/sbd. Both nodes are still rebooted after a
short while after starting cluster with 'pcs cluster start --all'.

----

virt-050# grep -v \# /etc/sysconfig/sbd' | sort | uniq
SBD_DELAY_START=no
SBD_OPTS=
SBD_PACEMAKER=yes
SBD_STARTMODE=clean
SBD_WATCHDOG_DEV=/dev/watchdog
SBD_WATCHDOG_TIMEOUT=5

virt-050# pcs config
Cluster Name: STSRHTS16341
Corosync Nodes:
 virt-050 virt-051 
Pacemaker Nodes:
 virt-050 virt-051 

Resources: 

Stonith Devices: 
Fencing Levels: 

Location Constraints:
Ordering Constraints:
Colocation Constraints:

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: STSRHTS16341
 dc-version: 1.1.12-a14efad
 have-watchdog: false
 no-quorum-policy: freeze
 stonith-watchdog-timeout: 0s

[root@virt-050 ~]# systemctl status sbd
sbd.service - Shared-storage based fencing daemon
   Loaded: loaded (/usr/lib/systemd/system/sbd.service; disabled)
   Active: inactive (dead)

[root@virt-050 ~]# pcs cluster stop --all --wait
virt-051: Stopping Cluster (pacemaker)...
virt-050: Stopping Cluster (pacemaker)...
virt-050: Stopping Cluster (corosync)...
virt-051: Stopping Cluster (corosync)...

[root@virt-050 ~]# rpm -q corosync pacemaker
corosync-2.3.4-5.el7.x86_64
pacemaker-1.1.12-22.el7_1.2.x86_64
[root@virt-050 ~]# pcs cluster status
Error: cluster is not currently running on this node

[root@virt-050 ~]# systemctl status corosync
corosync.service - Corosync Cluster Engine
   Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled)
   Active: inactive (dead)

...
[root@virt-050 ~]# systemctl enable sbd
ln -s '/usr/lib/systemd/system/sbd.service' '/etc/systemd/system/corosync.service.requires/sbd.service'
[root@virt-050 ~]# ssh virt-051 systemctl enable sbd
Warning: Permanently added 'virt-051,10.34.71.51' (ECDSA) to the list of known hosts.
ln -s '/usr/lib/systemd/system/sbd.service' '/etc/systemd/system/corosync.service.requires/sbd.service'

[root@virt-050 ~]# pcs cluster start --all --wait
virt-051: Starting Cluster...
virt-050: Starting Cluster...

[root@virt-050 ~]# pcs config
Cluster Name: STSRHTS16341
Corosync Nodes:
 virt-050 virt-051 
Pacemaker Nodes:
 virt-050 virt-051 

Resources: 

Stonith Devices: 
Fencing Levels: 

Location Constraints:
Ordering Constraints:
Colocation Constraints:

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: STSRHTS16341
 dc-version: 1.1.12-a14efad
 have-watchdog: true
 no-quorum-policy: freeze
 stonith-watchdog-timeout: 0s

[root@virt-050 ~]# pcs status
Cluster name: STSRHTS16341
WARNING: no stonith devices and stonith-enabled is not false
Last updated: Tue Jun  2 16:32:34 2015
Last change: Tue Jun  2 16:29:19 2015
Stack: corosync
Current DC: virt-051 (2) - partition with quorum
Version: 1.1.12-a14efad
2 Nodes configured
0 Resources configured

Online: [ virt-050 virt-051 ]

Full list of resources:

PCSD Status:
  virt-050: Online
  virt-051: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

<reboot occurs after ~1s>

Comment 7 Andrew Beekhof 2015-07-15 03:39:46 UTC

Does /dev/watchdog exist?

Comment 8 Andrew Beekhof 2015-07-15 06:00:15 UTC

It seems you have:

        <nvpair id="cib-bootstrap-options-stonith-watchdog-timeout" name="stonith-watchdog-timeout" value="0s"/>

Fairly sure that wouldn't be helping.

Comment 9 Andrew Beekhof 2015-07-15 23:24:31 UTC

I cannot reproduce this, even with stonith-watchdog-timeout: 0s
There are also no logs from the nodes at the time of the reported symptoms.

Can I get access to the nodes?

Comment 11 Andrew Beekhof 2015-07-31 02:10:41 UTC

By adding this to the sysconfig file: 

SBD_OPTS="-v -v"

and tailing /var/log/messages, I was able to see that we were hitting this log message:

        LOGONCE(pcmk_health_unknown, LOG_WARNING, "Node state: UNKNOWN");

Essentially there is a mis-match between what the cluster knows the node as, and what sbd knows it as.

Setting SBD_OPTS="-n virt-073" allowed the node to start normally.
I will change this message in a future version to be:

        LOGONCE(pcmk_health_unknown, LOG_WARNING, "Node state: %s is UNKNOWN", local_uname);

Comment 12 michal novacek 2015-08-06 08:25:42 UTC

To keep this information for future use: the names pacemaker use for nodes
needs to match the names sbd uses.

To achieve that either use $(uname -n) for node names (which is sbd default) OR
use '-n <node name' parameter in /etc/sysconfig/sbd SBD_OPTS parameter to tell 
sbd which names you use for nodes.

I believe this can be closed as NOTABUG.

Comment 13 Andrew Beekhof 2015-08-12 06:05:19 UTC

Ack. Closing

Note You need to log in before you can comment on or make changes to this bug.