Bug 1221680

Summary: trying to configure cluster with sbd and watchdog fencing cause endless reset
Product: Red Hat Enterprise Linux 7 Reporter: michal novacek <mnovacek>
Component: sbdAssignee: Andrew Beekhof <abeekhof>
Status: CLOSED NOTABUG QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.1CC: abeekhof, cfeist, djansa
Target Milestone: rcKeywords: TestBlocker
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-08-12 06:05:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
pcs cluster report
none
'pcs cluster report' after several reboots
none
pcs cluster report none

Description michal novacek 2015-05-14 14:32:34 UTC
Created attachment 1025456 [details]
pcs cluster report

Description of problem:
I have a running pacemaker two node quorate cluster with auto tie breaker. SBD
is configured but disabled. There is no fencing configured. (1)

I end up with both systems being reset endlessly. I only can get from that
situation by disabling sbd.

I thought the problem might be too short time for cluster to come up before it
is reset by watchdog but setting stonith-watchdog-timeout to 30s (which is
plenty for the cluster on virtual machine to come up) did not change anything.

I'm attaching cluster 'pcs cluster report' before and after in a hope there can
be found something relevant.


Version-Release number of selected component (if applicable):
corosync-2.3.4-5.el7.x86_64
pacemaker-1.1.12-22.el7_1.2.x86_64

How reproducible: always

Steps to reproduce:
1/ Have a working cluster with no fencing.
2/ systemctl enable sbd
3/ systemctl restart corosync

Actual result:
Endless reboot loop.

Expected results:
Cluster becoming quorate with no reboots.

Additional info:
(1)
[root@virt-050 ~]# corosync-quorumtool 
Quorum information
------------------
Date:             Thu May 14 15:06:42 2015
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          260
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2  
Flags:            Quorate AutoTieBreaker 

Membership information
----------------------
    Nodeid      Votes Name
         1          1 virt-050 (local)
         2          1 virt-051

[root@virt-050 ~]# pcs status
Cluster name: STSRHTS2609
Last updated: Thu May 14 15:05:39 2015
Last change: Thu May 14 14:58:57 2015
Stack: corosync
Current DC: virt-051 (2) - partition with quorum
Version: 1.1.12-a14efad
2 Nodes configured
6 Resources configured


Online: [ virt-050 virt-051 ]

Full list of resources:

 Clone Set: dlm-clone [dlm]
     Started: [ virt-050 virt-051 ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ virt-050 virt-051 ]

PCSD Status:
  virt-050: Online
  virt-051: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

[root@virt-050 ~]# pcs property 
Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: STSRHTS2609
 dc-version: 1.1.12-a14efad
 have-watchdog: false
 no-quorum-policy: freeze

[root@virt-050 ~]# systemctl status sbd
sbd.service - Shared-storage based fencing daemon
   Loaded: loaded (/usr/lib/systemd/system/sbd.service; disabled)
   Active: inactive (dead)


[root@virt-050 ~]# grep -v \# /etc/sysconfig/sbd | sort | uniq

SBD_DELAY_START=no
SBD_OPTS=-W
SBD_PACEMAKER=yes
SBD_STARTMODE=clean
SBD_WATCHDOG_DEV=/dev/watchdog
SBD_WATCHDOG_TIMEOUT=5

Comment 1 michal novacek 2015-05-14 14:33:28 UTC
Created attachment 1025457 [details]
'pcs cluster report' after several reboots

Comment 3 Andrew Beekhof 2015-05-20 01:39:08 UTC
Two things:

> have-watchdog: false

This implies that sbd is not running or not properly configured.
This is backed up by:

>  Active: inactive (dead)

Probably because you've set 

> SBD_OPTS=-W

which disables the system watchdog. Don't do that.


Also:

> controld(dlm)[4556]:	2015/05/14_16:03:15 ERROR: Uncontrolled lockspace exists, system must reboot. Executing suicide fencing

This is why the node is rebooting.
Make sure the cluster is working before adding the dlm.

Comment 4 michal novacek 2015-06-02 14:38:38 UTC
Created attachment 1033837 [details]
pcs cluster report

This report is taken after both nodes are rebooted by sbd.

Comment 5 michal novacek 2015-06-02 14:41:07 UTC
I deleted stonith devices and dlm/clvmd devices from the cluster and
removed -W from /etc/sysconfig/sbd. Both nodes are still rebooted after a
short while after starting cluster with 'pcs cluster start --all'.

----

virt-050# grep -v \# /etc/sysconfig/sbd' | sort | uniq
SBD_DELAY_START=no
SBD_OPTS=
SBD_PACEMAKER=yes
SBD_STARTMODE=clean
SBD_WATCHDOG_DEV=/dev/watchdog
SBD_WATCHDOG_TIMEOUT=5

virt-050# pcs config
Cluster Name: STSRHTS16341
Corosync Nodes:
 virt-050 virt-051 
Pacemaker Nodes:
 virt-050 virt-051 

Resources: 

Stonith Devices: 
Fencing Levels: 

Location Constraints:
Ordering Constraints:
Colocation Constraints:

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: STSRHTS16341
 dc-version: 1.1.12-a14efad
 have-watchdog: false
 no-quorum-policy: freeze
 stonith-watchdog-timeout: 0s

[root@virt-050 ~]# systemctl status sbd
sbd.service - Shared-storage based fencing daemon
   Loaded: loaded (/usr/lib/systemd/system/sbd.service; disabled)
   Active: inactive (dead)

[root@virt-050 ~]# pcs cluster stop --all --wait
virt-051: Stopping Cluster (pacemaker)...
virt-050: Stopping Cluster (pacemaker)...
virt-050: Stopping Cluster (corosync)...
virt-051: Stopping Cluster (corosync)...

[root@virt-050 ~]# rpm -q corosync pacemaker
corosync-2.3.4-5.el7.x86_64
pacemaker-1.1.12-22.el7_1.2.x86_64
[root@virt-050 ~]# pcs cluster status
Error: cluster is not currently running on this node

[root@virt-050 ~]# systemctl status corosync
corosync.service - Corosync Cluster Engine
   Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled)
   Active: inactive (dead)

...
[root@virt-050 ~]# systemctl enable sbd
ln -s '/usr/lib/systemd/system/sbd.service' '/etc/systemd/system/corosync.service.requires/sbd.service'
[root@virt-050 ~]# ssh virt-051 systemctl enable sbd
Warning: Permanently added 'virt-051,10.34.71.51' (ECDSA) to the list of known hosts.
ln -s '/usr/lib/systemd/system/sbd.service' '/etc/systemd/system/corosync.service.requires/sbd.service'

[root@virt-050 ~]# pcs cluster start --all --wait
virt-051: Starting Cluster...
virt-050: Starting Cluster...

[root@virt-050 ~]# pcs config
Cluster Name: STSRHTS16341
Corosync Nodes:
 virt-050 virt-051 
Pacemaker Nodes:
 virt-050 virt-051 

Resources: 

Stonith Devices: 
Fencing Levels: 

Location Constraints:
Ordering Constraints:
Colocation Constraints:

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: STSRHTS16341
 dc-version: 1.1.12-a14efad
 have-watchdog: true
 no-quorum-policy: freeze
 stonith-watchdog-timeout: 0s

[root@virt-050 ~]# pcs status
Cluster name: STSRHTS16341
WARNING: no stonith devices and stonith-enabled is not false
Last updated: Tue Jun  2 16:32:34 2015
Last change: Tue Jun  2 16:29:19 2015
Stack: corosync
Current DC: virt-051 (2) - partition with quorum
Version: 1.1.12-a14efad
2 Nodes configured
0 Resources configured

Online: [ virt-050 virt-051 ]

Full list of resources:

PCSD Status:
  virt-050: Online
  virt-051: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

<reboot occurs after ~1s>

Comment 7 Andrew Beekhof 2015-07-15 03:39:46 UTC
Does /dev/watchdog exist?

Comment 8 Andrew Beekhof 2015-07-15 06:00:15 UTC
It seems you have:

        <nvpair id="cib-bootstrap-options-stonith-watchdog-timeout" name="stonith-watchdog-timeout" value="0s"/>

Fairly sure that wouldn't be helping.

Comment 9 Andrew Beekhof 2015-07-15 23:24:31 UTC
I cannot reproduce this, even with stonith-watchdog-timeout: 0s
There are also no logs from the nodes at the time of the reported symptoms.

Can I get access to the nodes?

Comment 11 Andrew Beekhof 2015-07-31 02:10:41 UTC
By adding this to the sysconfig file: 

SBD_OPTS="-v -v"

and tailing /var/log/messages, I was able to see that we were hitting this log message:

        LOGONCE(pcmk_health_unknown, LOG_WARNING, "Node state: UNKNOWN");

Essentially there is a mis-match between what the cluster knows the node as, and what sbd knows it as.

Setting SBD_OPTS="-n virt-073" allowed the node to start normally.
I will change this message in a future version to be:

        LOGONCE(pcmk_health_unknown, LOG_WARNING, "Node state: %s is UNKNOWN", local_uname);

Comment 12 michal novacek 2015-08-06 08:25:42 UTC
To keep this information for future use: the names pacemaker use for nodes
needs to match the names sbd uses.

To achieve that either use $(uname -n) for node names (which is sbd default) OR
use '-n <node name' parameter in /etc/sysconfig/sbd SBD_OPTS parameter to tell 
sbd which names you use for nodes.

I believe this can be closed as NOTABUG.

Comment 13 Andrew Beekhof 2015-08-12 06:05:19 UTC
Ack. Closing