Bug 1525981 - systemd failing to start sbd doesn't prevent pacemaker-start
Summary: systemd failing to start sbd doesn't prevent pacemaker-start
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: sbd
Version: 7.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: 7.5
Assignee: Klaus Wenninger
QA Contact: cluster-qe@redhat.com
Steven J. Levine
URL:
Whiteboard:
Depends On:
Blocks: 1593254 1693266
TreeView+ depends on / blocked
 
Reported: 2017-12-14 14:21 UTC by Klaus Wenninger
Modified: 2019-03-27 12:27 UTC (History)
7 users (show)

Fixed In Version: sbd-1.3.1-7.el7
Doc Type: Bug Fix
Doc Text:
Pacemaker no longer starts up when "sbd" is enabled but not started successfully by "systemd" Previously, if "sbd" did not start properly, "systemd" would still start Pacemaker. This would lead to "sbd" poison pill triggered reboots not being performed without this being detected by "fence_sbd" and, in the case of quorum-based watchdog fencing, the nodes losing quorum would not self-fence either. With this fix, if "sbd" does not come up properly Pacemaker is not started. This should prevent all sources of data curruption due to "sbd" not coming up.
Clone Of:
: 1593254 (view as bug list)
Environment:
Last Closed: 2018-04-10 16:55:50 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:0924 None None None 2018-04-10 16:56:11 UTC

Description Klaus Wenninger 2017-12-14 14:21:35 UTC
Description of problem:

If for some reason systemd fails to start sbd it will still start up pacemaker leading to pacemaker not being observed by sbd, watchdog possibly not being engaged and poison-pill not read from disk.

Version-Release number of selected component (if applicable):

sbd-1.3.1-2.el7

How reproducible:

Steps to Reproduce:

1. start with an sbd-config with shared disk

2. make sbd wait for msgwait on startup

/etc/sysconfig/sbd:
SBD_DELAY_START=yes

2. choose a short (5s) startup-timeout for sbd-service

/etc/systemd/system/sbd.service.d/sbd.conf
[Service]
TimeoutStartSec=5

3. configure msgwait=10s

[root@node2 ~]# sbd -d /dev/vdb dump
==Dumping header on disk /dev/vdb
Header version     : 2.1
UUID               : ad2116f8-fa24-431b-a2e0-fa3f373e049e
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 45
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 10
==Header on disk /dev/vdb is dumped

4. check that pacemaker, corosync & sbd are down

5. start pacemaker (systemctl start pacemaker)


Actual results:

sbd times out on being started by systemd (expected)
pacemaker still comes up (not expected)


Expected results:

pacemaker shouldn't come up if sbd doesn't


Additional info:

Having

/etc/systemd/system/pacemaker.service.d/sbd.conf:
[Unit]
Requires=sbd.service

solves the issue.
sbd-service being 'PartOf' corosync is just not enough to make
systemd wait/check for sbd-start before starting pacemaker.

As installation of the sbd-package shouldn't automatically enable it
or require it to be enabled just adding this file as is to the sbd-package
is no solution.
Though adding templates for both files
/etc/systemd/system/sbd.service.d/sbd.conf
/etc/systemd/system/pacemaker.service.d/sbd.conf
with the actual content being commented out might be useful.


Another, possibly cleaner solution, would be the introduction of an additional
target - containing corosync and optionally sbd - that is required by pacemaker.
What is again open is the distribution of the target and the dependencies on it to on the 3 packages corosync, sbd and pacemaker.

Cleanest would be modification of all 3 adding the target to corosync.
Make corosync have an [Install] section that adds it to the new target.
Make sbd have an [Install] section that adds it to the new target.
And make pacemkaker depend on the new target. 
All directly in the 3 unit-files ...

Though it should be possible to achieve the modifications for corosync and pacemaker by sbd delivering the appropriate snippets.
So sbd would be the only package to be modified - with the drawback of being a little bit more confusing.

Comment 2 Klaus Wenninger 2017-12-14 15:48:35 UTC
Even simpler fix:

sbd.service:
...
[Install]
...
RequiredBy=pacemaker.service

Comment 3 Ken Gaillot 2017-12-14 17:24:30 UTC
(In reply to Klaus Wenninger from comment #2)
> Even simpler fix:
> 
> sbd.service:
> ...
> [Install]
> ...
> RequiredBy=pacemaker.service

Sounds good :)

However the target idea is interesting on its own ... perhaps a 'cluster-layer.target' that includes any components below pacemaker in the cluster stack. That might simplify supporting multiple cluster-layer implementations again in the future.

Comment 4 Klaus Wenninger 2017-12-14 17:38:39 UTC
True - we should think the target-idea over.

But for now I would suggest to go with the solution suggested in
https://github.com/ClusterLabs/sbd/pull/39
as it solves the issue with a change in just a single package.
A solution touching multiple packages would as well introduce strict version dependencies to work properly.

Comment 10 Klaus Wenninger 2018-01-15 16:57:46 UTC
With

https://github.com/ClusterLabs/sbd/pull/42

such a conditional reenabling would be added.
I was kind of surprised that this isn't being taken care of somehow automatically or that systemd-rpm-scriptlet-macros under /lib/rpm/macros.d/macros.systemd wouldn't at least provide a macro for that purpose.

Comment 11 Klaus Wenninger 2018-01-15 18:56:43 UTC
(In reply to Klaus Wenninger from comment #10)
See a controversal discussion on the pull-request.
Personally I'm not so convinced that an automatized reenable is that good an idea.
Actually the missing update of the dependency-links is a general issue.
I don't even think there is a need to document it - maybe in a release note that states the bug fixes.
Actually the man-page of systemctl is quite explicit on that topic already:

       reenable NAME...
           Reenable one or more unit files, as specified on the command line.
           This is a combination of disable and enable and is
           useful to reset the symlinks a unit is enabled with to the defaults
           configured in the "[Install]" section of the unit file.

Comment 15 Klaus Wenninger 2018-02-15 00:30:46 UTC
Steven:

Yes, changed a lot but I guess it still says what I meant.
But it leaves the impression that sbd is always needed by pacemaker.
Guess if we alter the first sentence a little bit that should clarify the issue.

Comment 18 errata-xmlrpc 2018-04-10 16:55:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0924


Note You need to log in before you can comment on or make changes to this bug.