Description of problem: systemd watchdog doesn't actually stop a process that fails to notify the watchdog quickly enough. The watchdog failure runs through the restart state, and starts a new process, but when in the restart state, the existing process isn't actually stopped (around line 2541 in service.c, line 19 of service_stop). Version-Release number of selected component (if applicable): systemd-204-9.fc19.x86_64 How reproducible: Run a simple service that doesn't notify the watchdog quickly enough - copy attached minimal-service.service into /lib/systemd/system/, compile attached test program: gcc minimal-service.c -o /tmp/minimal-service -l systemd-daemon Steps to Reproduce: 1. cp minimal-service.service /lib/systemd/system/ 2. gcc minimal-service.c -o /tmp/minimal-service -l systemd-daemon 3. Start the service: systemctl start minimal-service.service Actual results: Multiple minimal-service processes are started: systemctl status minimal-service.service minimal-service.service - Test service Loaded: loaded (/usr/lib/systemd/system/minimal-service.service; static) Active: active (running) since Mon 2013-07-08 15:45:32 CDT; 707ms ago Main PID: 3151 (minimal-service) CGroup: name=systemd:/system/minimal-service.service ├─3118 /tmp/minimal-service ├─3122 /tmp/minimal-service ├─3127 /tmp/minimal-service ├─3132 /tmp/minimal-service ├─3145 /tmp/minimal-service ├─3146 /tmp/minimal-service ├─3147 /tmp/minimal-service ├─3148 /tmp/minimal-service ├─3149 /tmp/minimal-service ├─3150 /tmp/minimal-service └─3151 /tmp/minimal-service Expected results: A single minimal-service executable is running - original gets killed and a new one started. Additional info: Adding systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M to the kernel command line gives some more details in dmesg: [ 358.174559] systemd[1]: Starting Test service... [ 358.174790] systemd[1]: About to execute: /tmp/minimal-service [ 358.177418] systemd[1]: Forked /tmp/minimal-service as 2252 [ 358.178227] systemd[2252]: Executing: /tmp/minimal-service [ 358.178571] systemd[1]: minimal-service.service changed dead -> running [ 358.178602] systemd[1]: Job minimal-service.service/start finished, result=done [ 358.178622] systemd[1]: Started Test service. [ 358.179508] systemd[1]: Got notification message for unit minimal-service.service [ 358.179516] systemd[1]: minimal-service.service: Got message [ 358.179522] systemd[1]: minimal-service.service: got WATCHDOG=1 [ 358.597560] systemd[1]: Accepted connection on private bus. [ 358.598282] systemd[1]: Got D-Bus request: org.freedesktop.DBus.Properties.GetAll() on /org/freedesktop/systemd1/unit/minimal_2dservice_2eservice [ 358.598681] systemd[1]: SELinux access check scon=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 tcon=unconfined_u:object_r:systemd_unit_file_t:s0 tclass=service perm=status path=/usr/lib/systemd/system/minimal-service.service cmdline=systemctl status minimal-service.service: 0 [ 358.600549] systemd[1]: Got D-Bus request: org.freedesktop.DBus.Local.Disconnected() on /org/freedesktop/DBus/Local [ 360.149983] systemd[1]: Accepted connection on private bus. [ 360.151527] systemd[1]: Got D-Bus request: org.freedesktop.DBus.Properties.GetAll() on /org/freedesktop/systemd1/unit/minimal_2dservice_2eservice [ 360.152003] systemd[1]: SELinux access check scon=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 tcon=unconfined_u:object_r:systemd_unit_file_t:s0 tclass=service perm=status path=/usr/lib/systemd/system/minimal-service.service cmdline=systemctl status minimal-service.service: 0 [ 360.156743] systemd[1]: Got D-Bus request: org.freedesktop.DBus.Local.Disconnected() on /org/freedesktop/DBus/Local [ 361.179685] systemd[1]: minimal-service.service watchdog timeout! [ 361.179707] systemd[1]: minimal-service.service changed running -> failed [ 361.179936] systemd[1]: Unit minimal-service.service entered failed state. [ 361.180334] systemd[1]: minimal-service.service changed failed -> auto-restart [ 361.280498] systemd[1]: minimal-service.service holdoff time over, scheduling restart. [ 361.280511] systemd[1]: Trying to enqueue job minimal-service.service/restart/fail [ 361.280684] systemd[1]: Installed new job minimal-service.service/restart as 2352 [ 361.280701] systemd[1]: Enqueued job minimal-service.service/restart as 2352 [ 361.280707] systemd[1]: minimal-service.service scheduled restart job. [ 361.280725] systemd[1]: Stopping Test service... [ 361.280760] systemd[1]: minimal-service.service changed auto-restart -> dead [ 361.280884] systemd[1]: Job minimal-service.service/restart finished, result=done [ 361.280901] systemd[1]: Converting job minimal-service.service/restart -> minimal-service.service/start [ 361.281001] systemd[1]: Starting Test service... [ 361.281190] systemd[1]: About to execute: /tmp/minimal-service [ 361.281587] systemd[1]: Forked /tmp/minimal-service as 2261 [ 361.281980] systemd[1]: minimal-service.service changed dead -> running [ 361.282001] systemd[1]: Job minimal-service.service/start finished, result=done [ 361.282120] systemd[1]: Started Test service. [ 361.282315] systemd[2261]: Executing: /tmp/minimal-service [ 361.283873] systemd[1]: Got notification message for unit minimal-service.service [ 361.283887] systemd[1]: minimal-service.service: Got message [ 361.283897] systemd[1]: minimal-service.service: got WATCHDOG=1 A manual stop (which does work as expected): [ 920.715878] systemd[1]: Stopping Test service... [ 920.716621] systemd[1]: minimal-service.service changed running -> stop-sigterm [ 920.716687] systemd[1]: Set up jobs progress timerfd. [ 920.716930] systemd[1]: Got D-Bus request: org.freedesktop.systemd1.Manager.GetUnit() on /org/freedesktop/systemd1 [ 920.719187] systemd[1]: SELinux access check scon=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 tcon=unconfined_u:object_r:systemd_unit_file_t:s0 tclass=service perm=status path=/usr/lib/systemd/system/minimal-service.service cmdline=systemctl stop minimal-service.service: 0 [ 920.719408] systemd[1]: Received SIGCHLD from PID 2464 (minimal-service). [ 920.719561] systemd[1]: Got SIGCHLD for process 2464 (minimal-service) [ 920.719868] systemd[1]: Child 2464 died (code=killed, status=15/TERM) [ 920.719976] systemd[1]: Got SIGCHLD for process 2468 (minimal-service) [ 920.721346] systemd[1]: Child 2468 died (code=killed, status=15/TERM) [ 920.721372] systemd[1]: Child 2468 belongs to minimal-service.service [ 920.721412] systemd[1]: minimal-service.service: main process exited, code=killed, status=15/TERM [ 920.721745] systemd[1]: minimal-service.service changed stop-sigterm -> dead [ 920.722363] systemd[1]: Job minimal-service.service/stop finished, result=done [ 920.722397] systemd[1]: Stopped Test service.
Created attachment 770676 [details] Simple watchdog process service file
Created attachment 770677 [details] Executable that notifies the watchdog too slowly
Fixed in 205/206.
This was actually included in the last release. In the next update I'm going to push a follow-up commit which runs ExecStartPost= on watchdog failure.