Bug 1064991 - host rebooted on sanlock restart
Summary: host rebooted on sanlock restart
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: sanlock
Version: 21
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: David Teigland
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-02-13 16:34 UTC by Sandro Bonazzola
Modified: 2015-01-12 15:20 UTC (History)
8 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2015-01-12 15:20:58 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Sandro Bonazzola 2014-02-13 16:34:27 UTC
While installing oVirt hosted engine, as part of the process sanlock is reconfigured by vdsm-tool.

looking at the logs: http://ur1.ca/gmf4p

I can see:

Feb 13 15:24:53 localhost systemd-sanlock[8979]: Waiting for sanlock (3615) to stop:[FAILED] 

Feb 13 15:26:23 localhost systemd[1]: sanlock.service stopping timed out. Killing.
Feb 13 15:26:23 localhost wdmd[625]: client dead ci 2 fd 9 pid 3615 renewal 6122 expire 6202 sanlock_83b03f85-5e6c-426d-8fc3-7626ff181d90:1

Feb 13 15:27:27 localhost kernel: [ 6200.608165] watchdog watchdog0: watchdog did not stop! Feb 13 15:27:27 localhost wdmd[625]: /dev/watchdog0 closed unclean

So systemd is killing sanlock process causing the watchdog to issue a reboot.

This should be avoided.

Comment 1 David Teigland 2014-02-13 17:22:31 UTC
OK, I'll search through the systemd documentation for an option to prevent that.

It is probably worth checking if vdsm could clean up all its sanlock lockspaces before it tries to stop sanlock.

Comment 2 David Teigland 2014-02-14 18:49:29 UTC
The systemd documentation here:
http://www.freedesktop.org/software/systemd/man/systemd.service.html#TimeoutStopSec=

seems to say that setting

TimeoutStopSec=0

will do what we want.  However, it doesn't work in practice.
systemctl stop sanlock still eventually sends SIGKILL.

I'll plan add this setting to both
/usr/lib/systemd/system/sanlock.service
/usr/lib/systemd/system/wdmd.service

and file a bug against systemd.

Comment 3 David Teigland 2014-02-14 18:51:21 UTC
I was testing this on RHEL7 with systemd-207-11.el7.x86_64

Comment 4 David Teigland 2014-02-14 18:59:28 UTC
opened bug 1065493 against RHEL7 systemd.

Comment 5 Antoni Segura Puimedon 2014-03-24 13:41:55 UTC
I'm submitting a patch for having wdmd.service and sanlock.service be plain unit files without relying much on the sysV scripts. They'll be configured with SendSIGKILL=No so that the error above doesn't happen.

Comment 7 Fedora End Of Life 2015-01-09 22:24:07 UTC
This message is a notice that Fedora 19 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 19. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained. Approximately 4 (four) weeks from now this bug will
be closed as EOL if it remains open with a Fedora 'version' of '19'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 19 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 8 Allon Mureinik 2015-01-11 07:01:16 UTC
(In reply to David Teigland from comment #4)
> opened bug 1065493 against RHEL7 systemd.
This was closed as NOTABUG

(In reply to Antoni Segura Puimedon from comment #5)
> I'm submitting a patch for having wdmd.service and sanlock.service be plain
> unit files without relying much on the sysV scripts. They'll be configured
> with SendSIGKILL=No so that the error above doesn't happen.
There's no external tracker attached to this BZ, and I did not find any such patch in git (which doesn't necessarily mean it's not there, of course).

Nir/David - what are our next steps?

(Also, tentatively moving this bug to F21 so it isn't mistakenly closed)

Comment 9 David Teigland 2015-01-12 15:20:58 UTC
I believe the addition of "SendSIGKILL=no" to the systemd unit file fixed this.


Note You need to log in before you can comment on or make changes to this bug.