Bug 1221680
Summary: | trying to configure cluster with sbd and watchdog fencing cause endless reset | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | michal novacek <mnovacek> | ||||||||
Component: | sbd | Assignee: | Andrew Beekhof <abeekhof> | ||||||||
Status: | CLOSED NOTABUG | QA Contact: | cluster-qe <cluster-qe> | ||||||||
Severity: | unspecified | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 7.1 | CC: | abeekhof, cfeist, djansa | ||||||||
Target Milestone: | rc | Keywords: | TestBlocker | ||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2015-08-12 06:05:19 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Created attachment 1025457 [details]
'pcs cluster report' after several reboots
Two things: > have-watchdog: false This implies that sbd is not running or not properly configured. This is backed up by: > Active: inactive (dead) Probably because you've set > SBD_OPTS=-W which disables the system watchdog. Don't do that. Also: > controld(dlm)[4556]: 2015/05/14_16:03:15 ERROR: Uncontrolled lockspace exists, system must reboot. Executing suicide fencing This is why the node is rebooting. Make sure the cluster is working before adding the dlm. Created attachment 1033837 [details]
pcs cluster report
This report is taken after both nodes are rebooted by sbd.
I deleted stonith devices and dlm/clvmd devices from the cluster and removed -W from /etc/sysconfig/sbd. Both nodes are still rebooted after a short while after starting cluster with 'pcs cluster start --all'. ---- virt-050# grep -v \# /etc/sysconfig/sbd' | sort | uniq SBD_DELAY_START=no SBD_OPTS= SBD_PACEMAKER=yes SBD_STARTMODE=clean SBD_WATCHDOG_DEV=/dev/watchdog SBD_WATCHDOG_TIMEOUT=5 virt-050# pcs config Cluster Name: STSRHTS16341 Corosync Nodes: virt-050 virt-051 Pacemaker Nodes: virt-050 virt-051 Resources: Stonith Devices: Fencing Levels: Location Constraints: Ordering Constraints: Colocation Constraints: Cluster Properties: cluster-infrastructure: corosync cluster-name: STSRHTS16341 dc-version: 1.1.12-a14efad have-watchdog: false no-quorum-policy: freeze stonith-watchdog-timeout: 0s [root@virt-050 ~]# systemctl status sbd sbd.service - Shared-storage based fencing daemon Loaded: loaded (/usr/lib/systemd/system/sbd.service; disabled) Active: inactive (dead) [root@virt-050 ~]# pcs cluster stop --all --wait virt-051: Stopping Cluster (pacemaker)... virt-050: Stopping Cluster (pacemaker)... virt-050: Stopping Cluster (corosync)... virt-051: Stopping Cluster (corosync)... [root@virt-050 ~]# rpm -q corosync pacemaker corosync-2.3.4-5.el7.x86_64 pacemaker-1.1.12-22.el7_1.2.x86_64 [root@virt-050 ~]# pcs cluster status Error: cluster is not currently running on this node [root@virt-050 ~]# systemctl status corosync corosync.service - Corosync Cluster Engine Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled) Active: inactive (dead) ... [root@virt-050 ~]# systemctl enable sbd ln -s '/usr/lib/systemd/system/sbd.service' '/etc/systemd/system/corosync.service.requires/sbd.service' [root@virt-050 ~]# ssh virt-051 systemctl enable sbd Warning: Permanently added 'virt-051,10.34.71.51' (ECDSA) to the list of known hosts. ln -s '/usr/lib/systemd/system/sbd.service' '/etc/systemd/system/corosync.service.requires/sbd.service' [root@virt-050 ~]# pcs cluster start --all --wait virt-051: Starting Cluster... virt-050: Starting Cluster... [root@virt-050 ~]# pcs config Cluster Name: STSRHTS16341 Corosync Nodes: virt-050 virt-051 Pacemaker Nodes: virt-050 virt-051 Resources: Stonith Devices: Fencing Levels: Location Constraints: Ordering Constraints: Colocation Constraints: Cluster Properties: cluster-infrastructure: corosync cluster-name: STSRHTS16341 dc-version: 1.1.12-a14efad have-watchdog: true no-quorum-policy: freeze stonith-watchdog-timeout: 0s [root@virt-050 ~]# pcs status Cluster name: STSRHTS16341 WARNING: no stonith devices and stonith-enabled is not false Last updated: Tue Jun 2 16:32:34 2015 Last change: Tue Jun 2 16:29:19 2015 Stack: corosync Current DC: virt-051 (2) - partition with quorum Version: 1.1.12-a14efad 2 Nodes configured 0 Resources configured Online: [ virt-050 virt-051 ] Full list of resources: PCSD Status: virt-050: Online virt-051: Online Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled <reboot occurs after ~1s> Does /dev/watchdog exist? It seems you have: <nvpair id="cib-bootstrap-options-stonith-watchdog-timeout" name="stonith-watchdog-timeout" value="0s"/> Fairly sure that wouldn't be helping. I cannot reproduce this, even with stonith-watchdog-timeout: 0s There are also no logs from the nodes at the time of the reported symptoms. Can I get access to the nodes? By adding this to the sysconfig file: SBD_OPTS="-v -v" and tailing /var/log/messages, I was able to see that we were hitting this log message: LOGONCE(pcmk_health_unknown, LOG_WARNING, "Node state: UNKNOWN"); Essentially there is a mis-match between what the cluster knows the node as, and what sbd knows it as. Setting SBD_OPTS="-n virt-073" allowed the node to start normally. I will change this message in a future version to be: LOGONCE(pcmk_health_unknown, LOG_WARNING, "Node state: %s is UNKNOWN", local_uname); To keep this information for future use: the names pacemaker use for nodes needs to match the names sbd uses. To achieve that either use $(uname -n) for node names (which is sbd default) OR use '-n <node name' parameter in /etc/sysconfig/sbd SBD_OPTS parameter to tell sbd which names you use for nodes. I believe this can be closed as NOTABUG. Ack. Closing |
Created attachment 1025456 [details] pcs cluster report Description of problem: I have a running pacemaker two node quorate cluster with auto tie breaker. SBD is configured but disabled. There is no fencing configured. (1) I end up with both systems being reset endlessly. I only can get from that situation by disabling sbd. I thought the problem might be too short time for cluster to come up before it is reset by watchdog but setting stonith-watchdog-timeout to 30s (which is plenty for the cluster on virtual machine to come up) did not change anything. I'm attaching cluster 'pcs cluster report' before and after in a hope there can be found something relevant. Version-Release number of selected component (if applicable): corosync-2.3.4-5.el7.x86_64 pacemaker-1.1.12-22.el7_1.2.x86_64 How reproducible: always Steps to reproduce: 1/ Have a working cluster with no fencing. 2/ systemctl enable sbd 3/ systemctl restart corosync Actual result: Endless reboot loop. Expected results: Cluster becoming quorate with no reboots. Additional info: (1) [root@virt-050 ~]# corosync-quorumtool Quorum information ------------------ Date: Thu May 14 15:06:42 2015 Quorum provider: corosync_votequorum Nodes: 2 Node ID: 1 Ring ID: 260 Quorate: Yes Votequorum information ---------------------- Expected votes: 2 Highest expected: 2 Total votes: 2 Quorum: 2 Flags: Quorate AutoTieBreaker Membership information ---------------------- Nodeid Votes Name 1 1 virt-050 (local) 2 1 virt-051 [root@virt-050 ~]# pcs status Cluster name: STSRHTS2609 Last updated: Thu May 14 15:05:39 2015 Last change: Thu May 14 14:58:57 2015 Stack: corosync Current DC: virt-051 (2) - partition with quorum Version: 1.1.12-a14efad 2 Nodes configured 6 Resources configured Online: [ virt-050 virt-051 ] Full list of resources: Clone Set: dlm-clone [dlm] Started: [ virt-050 virt-051 ] Clone Set: clvmd-clone [clvmd] Started: [ virt-050 virt-051 ] PCSD Status: virt-050: Online virt-051: Online Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@virt-050 ~]# pcs property Cluster Properties: cluster-infrastructure: corosync cluster-name: STSRHTS2609 dc-version: 1.1.12-a14efad have-watchdog: false no-quorum-policy: freeze [root@virt-050 ~]# systemctl status sbd sbd.service - Shared-storage based fencing daemon Loaded: loaded (/usr/lib/systemd/system/sbd.service; disabled) Active: inactive (dead) [root@virt-050 ~]# grep -v \# /etc/sysconfig/sbd | sort | uniq SBD_DELAY_START=no SBD_OPTS=-W SBD_PACEMAKER=yes SBD_STARTMODE=clean SBD_WATCHDOG_DEV=/dev/watchdog SBD_WATCHDOG_TIMEOUT=5