Hide Forgot
Description of problem: When running pacemaker-remote with sbd and pacemaker-watcher once cluster node is connected a systemctl restart pacemaker_remote triggers a watchdog reboot. Version-Release number of selected component (if applicable): sbd-1.2.1-21.el7 How reproducible: 100% Steps to Reproduce: 1. Setup pacemaker-remote with sbd and pacemaker-watcher 2. wait till cluster-node is connected 3. issue 'systemctl restart pacemaker_remote' Actual results: watchdog-reboot Expected results: pacemaker-remote and sbd should both restart and cluster-node should be able to reconnect Additional info: this behaviour is due to how the sbd-remote unit file is configured to just wait for the inquisitor-process of sbd to die before allowing systemd to restart pacemaker-remote As a workaround you can do: systemctl stop pacemaker_remote sleep 10 systemctl start pacemaker_remote This is not the reason why package update in bz1372009 fails setting the KillMethod=mixed in sbd-remote-unit-file fixes the issue
Using upstream pacemaker & sbd packages with systemd from rhel-7.4 setting KillMode=mixed definitely doesn't solve the issue. Using partof in the systemd unit to make sbd_remote start with pacemaker_remote leads to uncoordinated restarts of sbd_remote & pacemaker_remote (systemctl restart pacemaker_remote). The restart of sbd is so quick that it still sees the pacemaker_remote-instance from before the restart just to immediately afterwards loosing the connection to the restarted pacemaker_remote and as it doesn't (and shouldn't) automatically reconnect to the new instance a reboot is triggered. Possible ways out would be to specify sbd_remote to be started after pacemaker_remote. That leads to stopping happening in the opposite order and thus to the problems above not happening. But on the other hand when stopped before stopping pacemaker_remote sbd_remote can't monitor the shutdown of pacemaker_remote and all the services running under control of pacemaker_remote anymore. Better solutions would be: - make systemd start sbd_remote after pacemaker_remote while still stopping it after pacemaker_remote has been stopped - make systemd when restating a service first stop the service + partof-services and just afterwards start them all up again - make sbd_remote watch out for a running pacemaker_remote (the one it's pid it has grabbed before already) and just stop once that is gone (quick test-implementation with probably issues found under https://github.com/ClusterLabs/sbd/pull/33)
BZ1593254 is dealing with the orchestration of startup/stop/restart of sbd-remote & pacemaker-remote as well. Thus the 2 BZs should have an orchestrated solution instead of going e.g. a route as described in the PR above that takes just care of the restart issue.
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.