Bug 982379 - Watchdog failure doesn't stop existing process
Summary: Watchdog failure doesn't stop existing process
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: 19
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: systemd-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-07-08 20:54 UTC by smfairley
Modified: 2013-12-09 02:32 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-12-09 02:32:38 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Simple watchdog process service file (104 bytes, text/plain)
2013-07-08 20:56 UTC, smfairley
no flags Details
Executable that notifies the watchdog too slowly (287 bytes, text/x-csrc)
2013-07-08 20:57 UTC, smfairley
no flags Details

Description smfairley 2013-07-08 20:54:34 UTC
Description of problem:

systemd watchdog doesn't actually stop a process that fails to notify the watchdog quickly enough. The watchdog failure runs through the restart state, and starts a new process, but when in the restart state, the existing process isn't actually stopped (around line 2541 in service.c, line 19 of service_stop).

Version-Release number of selected component (if applicable):
systemd-204-9.fc19.x86_64

How reproducible:

Run a simple service that doesn't notify the watchdog quickly enough - copy attached minimal-service.service into /lib/systemd/system/, compile attached test program: gcc minimal-service.c -o /tmp/minimal-service -l systemd-daemon

Steps to Reproduce:
1. cp minimal-service.service /lib/systemd/system/
2. gcc minimal-service.c -o /tmp/minimal-service -l systemd-daemon
3. Start the service: systemctl start minimal-service.service

Actual results:
Multiple minimal-service processes are started:
systemctl status minimal-service.service
minimal-service.service - Test service
   Loaded: loaded (/usr/lib/systemd/system/minimal-service.service; static)
   Active: active (running) since Mon 2013-07-08 15:45:32 CDT; 707ms ago
 Main PID: 3151 (minimal-service)
   CGroup: name=systemd:/system/minimal-service.service
           ├─3118 /tmp/minimal-service
           ├─3122 /tmp/minimal-service
           ├─3127 /tmp/minimal-service
           ├─3132 /tmp/minimal-service
           ├─3145 /tmp/minimal-service
           ├─3146 /tmp/minimal-service
           ├─3147 /tmp/minimal-service
           ├─3148 /tmp/minimal-service
           ├─3149 /tmp/minimal-service
           ├─3150 /tmp/minimal-service
           └─3151 /tmp/minimal-service


Expected results:
A single minimal-service executable is running - original gets killed and a new one started.

Additional info:

Adding systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M to the kernel command line gives some more details in dmesg:
[  358.174559] systemd[1]: Starting Test service...
[  358.174790] systemd[1]: About to execute: /tmp/minimal-service
[  358.177418] systemd[1]: Forked /tmp/minimal-service as 2252
[  358.178227] systemd[2252]: Executing: /tmp/minimal-service
[  358.178571] systemd[1]: minimal-service.service changed dead -> running
[  358.178602] systemd[1]: Job minimal-service.service/start finished, result=done
[  358.178622] systemd[1]: Started Test service.
[  358.179508] systemd[1]: Got notification message for unit minimal-service.service
[  358.179516] systemd[1]: minimal-service.service: Got message
[  358.179522] systemd[1]: minimal-service.service: got WATCHDOG=1
[  358.597560] systemd[1]: Accepted connection on private bus.
[  358.598282] systemd[1]: Got D-Bus request: org.freedesktop.DBus.Properties.GetAll() on /org/freedesktop/systemd1/unit/minimal_2dservice_2eservice
[  358.598681] systemd[1]: SELinux access check scon=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 tcon=unconfined_u:object_r:systemd_unit_file_t:s0 tclass=service perm=status path=/usr/lib/systemd/system/minimal-service.service cmdline=systemctl status minimal-service.service: 0
[  358.600549] systemd[1]: Got D-Bus request: org.freedesktop.DBus.Local.Disconnected() on /org/freedesktop/DBus/Local
[  360.149983] systemd[1]: Accepted connection on private bus.
[  360.151527] systemd[1]: Got D-Bus request: org.freedesktop.DBus.Properties.GetAll() on /org/freedesktop/systemd1/unit/minimal_2dservice_2eservice
[  360.152003] systemd[1]: SELinux access check scon=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 tcon=unconfined_u:object_r:systemd_unit_file_t:s0 tclass=service perm=status path=/usr/lib/systemd/system/minimal-service.service cmdline=systemctl status minimal-service.service: 0
[  360.156743] systemd[1]: Got D-Bus request: org.freedesktop.DBus.Local.Disconnected() on /org/freedesktop/DBus/Local
[  361.179685] systemd[1]: minimal-service.service watchdog timeout!
[  361.179707] systemd[1]: minimal-service.service changed running -> failed
[  361.179936] systemd[1]: Unit minimal-service.service entered failed state.
[  361.180334] systemd[1]: minimal-service.service changed failed -> auto-restart
[  361.280498] systemd[1]: minimal-service.service holdoff time over, scheduling restart.
[  361.280511] systemd[1]: Trying to enqueue job minimal-service.service/restart/fail
[  361.280684] systemd[1]: Installed new job minimal-service.service/restart as 2352
[  361.280701] systemd[1]: Enqueued job minimal-service.service/restart as 2352
[  361.280707] systemd[1]: minimal-service.service scheduled restart job.
[  361.280725] systemd[1]: Stopping Test service...
[  361.280760] systemd[1]: minimal-service.service changed auto-restart -> dead
[  361.280884] systemd[1]: Job minimal-service.service/restart finished, result=done
[  361.280901] systemd[1]: Converting job minimal-service.service/restart -> minimal-service.service/start
[  361.281001] systemd[1]: Starting Test service...
[  361.281190] systemd[1]: About to execute: /tmp/minimal-service
[  361.281587] systemd[1]: Forked /tmp/minimal-service as 2261
[  361.281980] systemd[1]: minimal-service.service changed dead -> running
[  361.282001] systemd[1]: Job minimal-service.service/start finished, result=done
[  361.282120] systemd[1]: Started Test service.
[  361.282315] systemd[2261]: Executing: /tmp/minimal-service
[  361.283873] systemd[1]: Got notification message for unit minimal-service.service
[  361.283887] systemd[1]: minimal-service.service: Got message
[  361.283897] systemd[1]: minimal-service.service: got WATCHDOG=1

A manual stop (which does work as expected):
[  920.715878] systemd[1]: Stopping Test service...
[  920.716621] systemd[1]: minimal-service.service changed running -> stop-sigterm
[  920.716687] systemd[1]: Set up jobs progress timerfd.
[  920.716930] systemd[1]: Got D-Bus request: org.freedesktop.systemd1.Manager.GetUnit() on /org/freedesktop/systemd1
[  920.719187] systemd[1]: SELinux access check scon=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 tcon=unconfined_u:object_r:systemd_unit_file_t:s0 tclass=service perm=status path=/usr/lib/systemd/system/minimal-service.service cmdline=systemctl stop minimal-service.service: 0
[  920.719408] systemd[1]: Received SIGCHLD from PID 2464 (minimal-service).
[  920.719561] systemd[1]: Got SIGCHLD for process 2464 (minimal-service)
[  920.719868] systemd[1]: Child 2464 died (code=killed, status=15/TERM)
[  920.719976] systemd[1]: Got SIGCHLD for process 2468 (minimal-service)
[  920.721346] systemd[1]: Child 2468 died (code=killed, status=15/TERM)
[  920.721372] systemd[1]: Child 2468 belongs to minimal-service.service
[  920.721412] systemd[1]: minimal-service.service: main process exited, code=killed, status=15/TERM
[  920.721745] systemd[1]: minimal-service.service changed stop-sigterm -> dead
[  920.722363] systemd[1]: Job minimal-service.service/stop finished, result=done
[  920.722397] systemd[1]: Stopped Test service.

Comment 1 smfairley 2013-07-08 20:56:00 UTC
Created attachment 770676 [details]
Simple watchdog process service file

Comment 2 smfairley 2013-07-08 20:57:06 UTC
Created attachment 770677 [details]
Executable that notifies the watchdog too slowly

Comment 3 Lennart Poettering 2013-07-26 01:05:21 UTC
Fixed in 205/206.

Comment 4 Zbigniew Jędrzejewski-Szmek 2013-12-09 02:32:38 UTC
This was actually included in the last release. In the next update I'm going to push a follow-up commit which runs ExecStartPost= on watchdog failure.


Note You need to log in before you can comment on or make changes to this bug.