Bug 917075 - Sometimes systemd can't restart sendmail service
Summary: Sometimes systemd can't restart sendmail service
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: 18
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: systemd-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-03-01 15:52 UTC by Konstantin Volkov
Modified: 2013-05-16 02:58 UTC (History)
10 users (show)

Fixed In Version: systemd-201-2.fc18.6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-05-16 02:58:28 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
patch to fix race issues with pid files (6.57 KB, patch)
2013-03-03 14:23 UTC, Zbigniew Jędrzejewski-Szmek
no flags Details | Diff

Description Konstantin Volkov 2013-03-01 15:52:32 UTC
Description of problem:

Sometimes systemd can't restart sendmail service: there the race between inotify_add_watch calls and /run/sendmail.pid creation/write.

Version-Release number of selected component (if applicable):
systemd-197-1.fc18.1.x86_64

How reproducible:
Depends on host CPU, approximately 1 of 1000 for Intel Core i5.


Steps to Reproduce:
1. Run the

while [ 1 -eq 1 ]; do systemctl restart sendmail; [ $? -ne 0 ] && break; sleep 2; done

script
2. Wait for a hour
3. See for

---
Job for sendmail.service failed. See 'systemctl status sendmail.service' and 'journalctl -xn' for details.
---
  
Actual results:
Fail

Expected results:
Starts OK

Additional info:

systemd debug logs diff for success/fail restart:

---
 sendmail.service changed start-pre -> start
-Child 927 belongs to sendmail.service
+Child 872 belongs to sendmail.service
 sendmail.service: control process exited, code=exited status=0
 sendmail.service got final SIGCHLD for state start
 PID file /run/sendmail.pid not readable (yet?) after start.
 Setting watch for sendmail.service's PID file /run/sendmail.pid
-inotify event for sendmail.service
-Main PID loaded: 928
+Got notification message for unit httpd.service
+httpd.service: Got message
+httpd.service: got STATUS=Total requests: 0; Current requests/sec: 0; Current traffic:   0 B/sec
+Got notification message for unit httpd.service
+httpd.service: Got message
+httpd.service: got STATUS=Total requests: 0; Current requests/sec: 0; Current traffic:   0 B/sec
+Got notification message for unit httpd.service
+httpd.service: Got message
+httpd.service: got STATUS=Total requests: 0; Current requests/sec: 0; Current traffic:   0 B/sec
+Got notification message for unit httpd.service
+httpd.service: Got message
+httpd.service: got STATUS=Total requests: 0; Current requests/sec: 0; Current traffic:   0 B/sec
+Got notification message for unit httpd.service
+httpd.service: Got message
+httpd.service: got STATUS=Total requests: 0; Current requests/sec: 0; Current traffic:   0 B/sec
+Got notification message for unit httpd.service
+httpd.service: Got message
+httpd.service: got STATUS=Total requests: 0; Current requests/sec: 0; Current traffic:   0 B/sec
+Got notification message for unit httpd.service
+httpd.service: Got message
+httpd.service: got STATUS=Total requests: 0; Current requests/sec: 0; Current traffic:   0 B/sec
+Got notification message for unit httpd.service
+httpd.service: Got message
+httpd.service: got STATUS=Total requests: 0; Current requests/sec: 0; Current traffic:   0 B/sec
+Got notification message for unit httpd.service
+httpd.service: Got message
+httpd.service: got STATUS=Total requests: 0; Current requests/sec: 0; Current traffic:   0 B/sec
+sendmail.service operation timed out. Terminating.
 Stopping watch for sendmail.service's PID file /run/sendmail.pid
-sendmail.service changed start -> running
-Job sendmail.service/start finished, result=done
---

As you can see, the inotify event for sendmail.pid is lost.

This caused by race in systemd path_spec_watch() function:

sendmail create the /run/sendmail.pid late and keeps it open. So, if the file created after the first inotify_add_watch() call from path_spec_watch(), but before second inotify_add_watch() and sendmail don't write the pid before path_spec_watch() and service_retry_pid_file() it will never catched.

Here is the patch, that fix it:

---
--- path.c.orig 2013-03-01 15:28:51.312575945 -0500
+++ path.c      2013-03-01 15:37:46.991450760 -0500
@@ -92,6 +92,9 @@
                 if (!exists)
                         flags |= IN_DELETE_SELF | IN_ATTRIB | IN_CREATE | IN_MOVED_TO;
 
+                if (s->type == PATH_MODIFIED)
+                        flags |= IN_MODIFY;
+
                 if (inotify_add_watch(s->inotify_fd, k, flags) >= 0)
                         exists = true;
         } while (slash != k);
---

Please, apply.

Comment 1 Zbigniew Jędrzejewski-Szmek 2013-03-03 14:23:04 UTC
Created attachment 704566 [details]
patch to fix race issues with pid files

Just RFC at the moment

Comment 2 Zbigniew Jędrzejewski-Szmek 2013-03-03 14:23:46 UTC
Yep, there's indeed a race condition. It takes a few minutes to reproduce
if burst limits are removed. I used:
  StartLimitInterval=0
for sm-client.service and sendmail.service, and:
  RateLimitInterval=0
  RateLimitBurst=0
for journald.conf.

Sequence that you encountered:
PID is checked -> EEXIST
inotify_add_watch(dir/PID) -> EEXIST
daemons opens file
inotify_add_watch(dir, IN_MOVE_SELF | IN_DELETE_SELF | IN_ATTRIB | IN_CREATE | IN_MOVED_TO)
PID is checked -> EINVAL
daemon writes to file
-- hang --

Your proposal:
PID is checked -> EEXIST
inotify_add_watch(dir/PID) -> EEXIST
daemons opens file
inotify_add_watch(dir, IN_MOVE_SELF | IN_DELETE_SELF | IN_ATTRIB | IN_CREATE | IN_MOVED_TO | IN_MODIFY)
PID is checked -> EINVAL
daemon writes to file
inotify event
PID is checked -> OK

Scenario where it is not enough: dir1/dir2/PID
PID is checked -> EEXIST
inotify_add_watch(dir1/dir2/PID) -> EEXIST
inotify_add_watch(dir1/dir2) -> EEXIST
daemon creates dir1/dir2
inotify_add_watch(dir, IN_MOVE_SELF | IN_DELETE_SELF | IN_ATTRIB | IN_CREATE | IN_MOVED_TO)
PID is checked -> EINVAL
daemon opens file
daemon writes to file
-- hang --

I've prepared a patch which should make the watches robust. I'll give it some testing and if everything passes, I'll apply it in mainline. A release is imminent, so it should land in a package near you soon.

Comment 3 Zbigniew Jędrzejewski-Szmek 2013-03-03 23:01:54 UTC
I let 'while true; do sudo systemctl restart sendmail; [ $? -ne 0 ] && break; echo -n .; done' run in a loop since my last comment, and it didn't fail. With approx. 5 iterations per second, this gives 120000 iterations.

Comment 4 Zbigniew Jędrzejewski-Szmek 2013-03-04 07:11:15 UTC
Should be fixed in http://cgit.freedesktop.org/systemd/systemd/commit/?id=bc41f93e9.

Comment 5 Konstantin Volkov 2013-03-04 12:48:25 UTC
I checked with fix from:

http://cgit.freedesktop.org/systemd/systemd/commit/?id=bc41f93e9

Seems like it's fixed. Thanx!

Comment 6 Michal Schmidt 2013-03-04 12:59:20 UTC
Good work Zbigniew.

Comment 7 Fedora Update System 2013-04-10 20:13:11 UTC
systemd-201-2.fc18.1 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/systemd-201-2.fc18.1

Comment 8 Fedora Update System 2013-04-11 23:26:13 UTC
Package systemd-201-2.fc18.2:
* should fix your issue,
* was pushed to the Fedora 18 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing systemd-201-2.fc18.2'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-5452/systemd-201-2.fc18.2
then log in and leave karma (feedback).

Comment 9 Fedora Update System 2013-04-16 00:00:38 UTC
Package systemd-201-2.fc18.4:
* should fix your issue,
* was pushed to the Fedora 18 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing systemd-201-2.fc18.4'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-5452/systemd-201-2.fc18.4
then log in and leave karma (feedback).

Comment 10 Fedora Update System 2013-04-18 02:38:28 UTC
Package systemd-201-2.fc18.5:
* should fix your issue,
* was pushed to the Fedora 18 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing systemd-201-2.fc18.5'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-5452/systemd-201-2.fc18.5
then log in and leave karma (feedback).

Comment 11 Fedora Update System 2013-05-07 13:40:23 UTC
systemd-201-2.fc18.6 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/FEDORA-2013-5452/systemd-201-2.fc18.6

Comment 12 Fedora Update System 2013-05-09 10:02:34 UTC
Package systemd-201-2.fc18.6:
* should fix your issue,
* was pushed to the Fedora 18 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing systemd-201-2.fc18.6'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-5452/systemd-201-2.fc18.6
then log in and leave karma (feedback).

Comment 13 Fedora Update System 2013-05-16 02:58:28 UTC
systemd-201-2.fc18.6 has been pushed to the Fedora 18 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.