Bug 1377724

Summary: pacemaker-remote restart cause watchdog-reboot with sbd and pacemaker-watcher
Product: Red Hat Enterprise Linux 7 Reporter: Klaus Wenninger <kwenning>
Component: sbdAssignee: Klaus Wenninger <kwenning>
Status: CLOSED WONTFIX QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.3CC: cfeist, fdinitto, kgaillot, kwenning, mlisik
Target Milestone: rc   
Target Release: 7.9   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1693262 (view as bug list) Environment:
Last Closed: 2020-12-15 07:46:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1693262    

Description Klaus Wenninger 2016-09-20 13:04:05 UTC
Description of problem:
When running pacemaker-remote with sbd and pacemaker-watcher
once cluster node is connected a

  systemctl restart pacemaker_remote

triggers a watchdog reboot.

Version-Release number of selected component (if applicable):
sbd-1.2.1-21.el7

How reproducible:
100%

Steps to Reproduce:
1. Setup pacemaker-remote with sbd and pacemaker-watcher
2. wait till cluster-node is connected
3. issue 'systemctl restart pacemaker_remote'

Actual results:
watchdog-reboot

Expected results:
pacemaker-remote and sbd should both restart and 
cluster-node should be able to reconnect

Additional info:
this behaviour is due to how the sbd-remote unit file is configured
to just wait for the inquisitor-process of sbd to die before allowing
systemd to restart pacemaker-remote

As a workaround you can do:

systemctl stop pacemaker_remote
sleep 10
systemctl start pacemaker_remote

This is not the reason why package update in bz1372009 fails

setting the KillMethod=mixed in sbd-remote-unit-file fixes the issue

Comment 6 Klaus Wenninger 2017-11-03 12:23:34 UTC
Using upstream pacemaker & sbd packages with systemd from rhel-7.4 setting KillMode=mixed definitely doesn't solve the issue.

Using partof in the systemd unit to make sbd_remote start with pacemaker_remote leads to uncoordinated restarts of sbd_remote & pacemaker_remote (systemctl restart pacemaker_remote).
The restart of sbd is so quick that it still sees the pacemaker_remote-instance from before the restart just to immediately afterwards loosing the connection to the restarted pacemaker_remote and as it doesn't (and shouldn't) automatically reconnect to the new instance a reboot is triggered.

Possible ways out would be to specify sbd_remote to be started after pacemaker_remote.
That leads to stopping happening in the opposite order and thus to the problems above not happening.
But on the other hand when stopped before stopping pacemaker_remote sbd_remote can't monitor the shutdown of pacemaker_remote and all the services running under  control of pacemaker_remote anymore.

Better solutions would be:
- make systemd start sbd_remote after pacemaker_remote while still stopping 
  it after pacemaker_remote has been stopped
- make systemd when restating a service first stop the service + 
  partof-services and just afterwards start them all up again
- make sbd_remote watch out for a running pacemaker_remote (the one it's pid
  it has grabbed before already) and just stop once that is gone
  (quick test-implementation with probably issues found under
  https://github.com/ClusterLabs/sbd/pull/33)

Comment 8 Klaus Wenninger 2018-06-20 11:53:35 UTC
BZ1593254 is dealing with the orchestration of startup/stop/restart of sbd-remote & pacemaker-remote as well.
Thus the 2 BZs should have an orchestrated solution instead of going e.g. a route as described in the PR above that takes just care of the restart issue.

Comment 11 RHEL Program Management 2020-12-15 07:46:09 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.