Bug 1556931

Summary: "systemctl start" sits forever waiting when issuing systemd-reload at the same time
Product: Red Hat Enterprise Linux 7 Reporter: Renaud Métrich <rmetrich>
Component: systemdAssignee: systemd-maint
Status: CLOSED DUPLICATE QA Contact: qe-baseos-daemons
Severity: medium Docs Contact:
Priority: medium    
Version: 7.4CC: dtardon, systemd-maint-list, usrleon
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-22 08:08:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Reproducer
none
strace of working/non-working "systemctl start" command none

Description Renaud Métrich 2018-03-15 14:53:05 UTC
Description of problem:

When starting an already started service using "systemctl start <service>", and that <service> requires another <service2> to be started, and finally issuing a systemctl daemon-reload, we can see that the "systemctl start <service1>" command sits forever polling the private systemd socket (/run/systemd/private).


Version-Release number of selected component (if applicable):

systemd-219-42.el7_4.10.x86_64


How reproducible:

Always


Steps to Reproduce:
1. Install the "test" and "test-init" services from attached reproducer

  # cd / && tar xf /root/reproducer.tar

2. Issue a daemon-reload and Start the "test" service

  # systemctl daemon-reload
  # systemctl start test.service

3. Run the scenario

  # systemctl start test & sleep 0.1; systemctl daemon-reload; sleep 10; ps -eaf | grep -i systemctl; sleep 1; pkill systemctl


Actual results:

root      2371  1101  0 15:34 pts/0    00:00:00 systemctl start test
root      2400  1101  0 15:34 pts/0    00:00:00 grep --color=auto -i systemctl


Expected results:

No "systemctl start test" after 10 seconds


Additional info:

In the reproducer.tar archive, there is the journal at that time (debug mode, "systemctl.journal") and strace of the command ("systemctl.log")

Comment 2 Renaud Métrich 2018-03-15 14:54:10 UTC
Created attachment 1408455 [details]
Reproducer

Comment 3 Renaud Métrich 2018-03-15 15:00:57 UTC
I cannot reproduce the issue if the "test" service is inactive.
The "test" service requires a stateless "test-init" service.
Having the "test-init" with RemainAfterExit=yes (stateful service) and "test-init" already started doesn't make the issue happen.

This seems the issue is due to having the "test-init" service be spawned at the time of the systemd-reload.

Comment 4 Renaud Métrich 2018-03-15 15:15:56 UTC
Using strace without "-x" (no hex dump), we can indeed see the query on test-init.service:

3683  16:12:11.210692 recvmsg(3<socket:[40648]>, {msg_name(0)=NULL, msg_iov(1)=[{"/org/freedesktop/systemd1\0\0\0\0\0\0\0\2\1s\0 \0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\3\1s\0\n\0\0\0JobRemoved\0\0\0\0\0\0\10\1g\0\4uoss\0\0\0\0\0\0\0\265\f\0\0\"\0\0\0/org/freedesktop/systemd1/job/3253\0\0\21\0\0\0test-init.service\0\0\0\4\0\0\0done\0", 197}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, {pid=1, uid=0, gid=0}}], msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 197 <0.000009>

Then the status (inactive/dead):

3683  16:12:11.210861 recvmsg(3<socket:[40648]>, {msg_name(0)=NULL, msg_iov(1)=[{"/org/freedesktop/systemd1/unit/test_2dinit_2eservice\0\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\3\1s\0\21\0\0\0PropertiesChanged\0\0\0\0\0\0\0\10\1g\0\10sa{sv}as\0\0\0\35\0\0\0org.freedesktop.systemd1.Unit\0\0\0\240\2\0\0\v\0\0\0ActiveState\0\1s\0\0\10\0\0\0inactive\0\0\0\0\0\0\0\0\10\0\0\0SubState\0\1s\0\4\0\0\0dead\0\0\0\0\0\0\0\0\25\0\0\0InactiveExitTimestamp\0\1t\0\0\0\0\227Oi\352tg\5\0\36\0\0\0InactiveExitTimestampMonotonic\0\1t\0\0\0\272\311\242\315\0\0\0\0\24\0\0\0ActiveEnterTimestamp\0\1t\0\0\0\0\0\313\24\272\272tg\5\0\35\0\0\0ActiveEnterTimestampMonotonic\0\1t\0\0\0\0\356\216\363\235\0\0\0\0\23\0\0\0ActiveExitTimestamp\0\1t\0\0\0\0\0\0\2021\253\347tg\5\0\34\0\0\0ActiveExitTimestampMonotonic\0\1t\0\0\0\0\0\244\253\344\312\0\0\0\0\26\0\0\0InactiveEnterTimestamp\0\1t\0\0\0\352\323x\352tg\5\0\37\0\0\0InactiveEnterTimestampMonotonic\0\1t\0\0\rN\262\315\0\0\0\0\3\0\0\0Job\0\4(uo)\0\0\0\0\0\0\0\1\0\0\0/\0\0\0\0\0\0\0\17\0\0\0ConditionResult\0\1b\0\0\1\0\0\0\0\0\0\0\f\0\0\0AssertResult\0\1b\0\1\0\0\0\22\0\0\0ConditionTimestamp\0\1t\0\0\0\0\0\0\0TLi\352tg\5\0\33\0\0\0ConditionTimestampMonotonic\0\1t\0\0\0\0\0\0w\306\242\315\0\0\0\0\17\0\0\0AssertTimestamp\0\1t\0\0TLi\352tg\5\0\30\0\0\0AssertTimestampMonotonic\0\1t\0w\306\242\315\0\0\0\0\0\0\0\0", 860}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, {pid=1, uid=0, gid=0}}], msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 860 <0.000008>


And finally some recvmsg/poll forever until command is killed:

3683  16:12:11.210903 recvmsg(3<socket:[40648]>, 0x7fff169f2f10, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable) <0.000008>
3683  16:12:11.210935 ppoll([{fd=3<socket:[40648]>, events=POLLIN}], 1, NULL, NULL, 8) = ? ERESTARTNOHAND (To be restarted if no handler) <10.424494>

Comment 5 Renaud Métrich 2018-03-16 10:01:08 UTC
Created attachment 1408708 [details]
strace of working/non-working "systemctl start" command

Comment 6 Renaud Métrich 2018-03-16 10:10:32 UTC
Comparing systemctl_ok.strace and systemctl_ko.strace, we can see the following:

- sendmsg() are identical
- recvmsg() are identical, up to reading test-init.service status

recvmsg() missing from systemctl_ko.strace is the following: notification job "test.service" is done

systemd likely 'lost' that information while reloading

Comment 7 Renaud Métrich 2018-03-19 08:05:43 UTC
Also happens with other services which depend on another service and the latter has to run (e.g. sshd.service depends on sshd-keygen.service and there is no host key in /etc/ssh/).

Comment 8 Maxim Lvov 2018-05-24 18:12:52 UTC
I've got issue with same symptoms, in my case there was a problem with dbus service.
System Centos-7.5 built from scratch. dbus=1.10 systemd=219
1. dbus creates socket in /run/dbus/ but services like hostnamectl and systemd expect it in /var/run/dbus
2. Package systemd-sysv needs to be installed

Comment 9 Maxim Lvov 2018-05-25 19:35:43 UTC
(In reply to Maxim Lvov from comment #8)
> I've got issue with same symptoms, in my case there was a problem with dbus
> service.
> System Centos-7.5 built from scratch. dbus=1.10 systemd=219
> 1. dbus creates socket in /run/dbus/ but services like hostnamectl and
> systemd expect it in /var/run/dbus
> 2. Package systemd-sysv needs to be installed

Actually my issue was that during image build /var/run/ directory was created and thus system-tmpfiles doesn't create link because in /usr/lib/tmpfiles.d/var.conf it's configured to create it only if directory/link doesn't exist. For some reason it wasn't issue in 7.4

Comment 10 David Tardon 2018-06-22 08:08:03 UTC

*** This bug has been marked as a duplicate of bug 1351793 ***