Bug 753586 - try-restart of a service scheduled to start disturbs its ordering (should be a NOP)
Summary: try-restart of a service scheduled to start disturbs its ordering (should be ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: 16
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: systemd-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 645569 814946 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-11-13 16:20 UTC by Tomasz Kepczynski
Modified: 2012-06-05 23:10 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-05-29 10:33:40 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
dmesg output, grepped by 'systemd\[1\]' (145.95 KB, text/plain)
2011-11-14 17:17 UTC, Tomasz Kepczynski
no flags Details
self-contained test program (2.62 KB, text/plain)
2012-03-21 23:30 UTC, Jay Fenlason
no flags Details

Description Tomasz Kepczynski 2011-11-13 16:20:41 UTC
Description of problem:
I have a system which uses /etc/init.d/network to bringup network interfaces (bridging is involved). The system also is a part of nis domain but ypbind fails to start and network logins are impossible. autofs is also affected as it is supposed to use some nis maps.

Version-Release number of selected component (if applicable):
systemd-37-3.fc16.x86_64
ypbind-1.33-7.fc16.x86_64
autofs-5.0.6-3.fc16.x86_64

How reproducible:
Always when system boots.

Steps to Reproduce:
1. disable NetowrkManager
2. enable network service
3. enable ypbind
4. reboot system
5. login as root and issue ypwhich

Actual results:
cannot communicate with ypbind (or something like that)

Expected results:
server the system is bound to

Additional info:
Used to work before systemd times (Fedora 12 for sure).

Comment 1 Michal Schmidt 2011-11-14 09:50:42 UTC
What does "systemctl status network.service ypbind.service" say?

Comment 2 Tomasz Kepczynski 2011-11-14 10:40:13 UTC
It was showing that ypbind failed, I will post the whole output in the evening, when I get back home.

Comment 3 Michal Schmidt 2011-11-14 11:18:55 UTC
It's more likely a problem with ypbind than systemd. Reassigning.

Comment 4 Tomasz Kepczynski 2011-11-14 11:22:30 UTC
Unless it was started too early (for example before rpcbind). I am able to start it from commandline (using systemctl) without problems.

Comment 5 Michal Schmidt 2011-11-14 11:41:41 UTC
To check that, please boot with "log_buf_len=1M systemd.log_level=debug systemd.log_target=kmsg" and attach the output of the "dmesg" command here.

Comment 6 Honza Horak 2011-11-14 12:24:30 UTC
Do you see anything ypbind related in /var/log/messages?

Comment 7 Tomasz Kepczynski 2011-11-14 12:26:11 UTC
Yes, there were a few messages, I will include them here together with systemd log and "systemctl status network.service ypbind.service" later today.

Comment 8 Tomasz Kepczynski 2011-11-14 17:15:22 UTC
/var/log/messages:

Nov 14 17:50:29 vesemir setsebool: The allow_ypbind policy boolean was changed to 1 by root
Nov 14 17:50:29 vesemir dbus[1320]: avc:  received policyload notice (seqno=2)
Nov 14 17:50:29 vesemir dbus-daemon[1320]: dbus[1320]: avc:  received policyload notice (seqno=2)
Nov 14 17:50:29 vesemir dbus[1320]: [system] Reloaded configuration
Nov 14 17:50:29 vesemir dbus-daemon[1320]: dbus[1320]: [system] Reloaded configuration
Nov 14 17:50:29 vesemir ypbind[1622]: Cannot register service: RPC: Unable to receive; errno = Connection refused
Nov 14 17:50:29 vesemir ypbind[1622]: Unable to register (YPBINDPROG, YPBINDVERS, udp).
Nov 14 17:50:29 vesemir ypbind: Binding NIS service
Nov 14 17:50:40 vesemir dbus[1320]: [system] Activating via systemd: service name='org.freedesktop.ConsoleKit' unit='console-kit-daemon.service'
Nov 14 17:50:40 vesemir dbus-daemon[1320]: dbus[1320]: [system] Activating via systemd: service name='org.freedesktop.ConsoleKit' unit='console-kit-daemon.service'
Nov 14 17:50:40 vesemir dbus[1320]: [system] Activating service name='org.freedesktop.PolicyKit1' (using servicehelper)
Nov 14 17:50:40 vesemir dbus-daemon[1320]: dbus[1320]: [system] Activating service name='org.freedesktop.PolicyKit1' (using servicehelper)
Nov 14 17:50:40 vesemir polkitd[1697]: started daemon version 0.102 using authority implementation `local' version `0.102'
Nov 14 17:50:40 vesemir dbus[1320]: [system] Successfully activated service 'org.freedesktop.PolicyKit1'
Nov 14 17:50:40 vesemir dbus-daemon[1320]: dbus[1320]: [system] Successfully activated service 'org.freedesktop.PolicyKit1'
Nov 14 17:50:40 vesemir dbus[1320]: [system] Successfully activated service 'org.freedesktop.ConsoleKit'
Nov 14 17:50:40 vesemir dbus-daemon[1320]: dbus[1320]: [system] Successfully activated service 'org.freedesktop.ConsoleKit'
Nov 14 17:51:29 vesemir ypbind-post-waitbind[1623]: rpcinfo: can't contact portmapper: RPC: Timed out
Nov 14 17:51:31 vesemir ypbind: Binding took 62 seconds
Nov 14 17:51:31 vesemir ypbind: NIS domain: nis.jot23.org, NIS server:

-------------------------------------------------------
systemctl status network.service ypbind.service output:

network.service - LSB: Bring up/down networking
          Loaded: loaded (/etc/rc.d/init.d/network)
          Active: active (running) since Mon, 14 Nov 2011 17:51:32 +0100; 7min ago
         Process: 1254 ExecStart=/etc/rc.d/init.d/network start (code=exited, status=0/SUCCESS)
          CGroup: name=systemd:/system/network.service
                  └ 1747 /sbin/dhclient -H vesemir -1 -q -cf /etc/dhcp/dhcl...

ypbind.service - NIS/YP (Network Information Service) Clients to NIS Domain Binder
          Loaded: loaded (/lib/systemd/system/ypbind.service; enabled)
          Active: failed since Mon, 14 Nov 2011 17:51:31 +0100; 7min ago
         Process: 1623 ExecStartPost=/usr/lib/ypbind/ypbind-post-waitbind (code=exited, status=0/SUCCESS)
         Process: 1622 ExecStart=/usr/sbin/ypbind -n $OTHER_YPBIND_OPTS (code=exited, status=1/FAILURE)
         Process: 1620 ExecStartPre=/usr/sbin/setsebool allow_ypbind=1 (code=exited, status=0/SUCCESS)
         Process: 1617 ExecStartPre=/usr/lib/ypbind/ypbind-pre-setdomain (code=exited, status=0/SUCCESS)
          CGroup: name=systemd:/system/ypbind.service

------------------------------------------------------------
When I give systemctl restart ypbind.service from command prompt then the status changes:

network.service - LSB: Bring up/down networking
          Loaded: loaded (/etc/rc.d/init.d/network)
          Active: active (running) since Mon, 14 Nov 2011 17:51:32 +0100; 8min ago
         Process: 1254 ExecStart=/etc/rc.d/init.d/network start (code=exited, status=0/SUCCESS)
          CGroup: name=systemd:/system/network.service
                  └ 1747 /sbin/dhclient -H vesemir -1 -q -cf /etc/dhcp/dhcl...

ypbind.service - NIS/YP (Network Information Service) Clients to NIS Domain Binder
          Loaded: loaded (/lib/systemd/system/ypbind.service; enabled)
          Active: active (running) since Mon, 14 Nov 2011 18:00:13 +0100; 8s ago
         Process: 1947 ExecStartPost=/usr/lib/ypbind/ypbind-post-waitbind (code=exited, status=0/SUCCESS)
         Process: 1944 ExecStartPre=/usr/sbin/setsebool allow_ypbind=1 (code=exited, status=0/SUCCESS)
         Process: 1941 ExecStartPre=/usr/lib/ypbind/ypbind-pre-setdomain (code=exited, status=0/SUCCESS)
        Main PID: 1946 (ypbind)
          CGroup: name=systemd:/system/ypbind.service
                  └ 1946 /usr/sbin/ypbind -n

Comment 9 Tomasz Kepczynski 2011-11-14 17:17:09 UTC
Created attachment 533577 [details]
dmesg output, grepped by 'systemd\[1\]'

Comment 10 Honza Horak 2011-11-15 11:57:18 UTC
This is what is important in dmesg output.

[    8.927416] systemd[1]: Installed new job rpcbind.socket/start as 64
[    8.927481] systemd[1]: Installed new job rpcbind.service/start as 84
[    8.927512] systemd[1]: Installed new job ypbind.service/start as 96
[   16.348618] systemd[1]: rpcbind.socket changed dead -> listening
[   16.348629] systemd[1]: Job rpcbind.socket/start finished, result=done

So far so good.

[   21.637868] systemd[1]: Trying to enqueue job ypbind.service/try-restart/replace
[   21.637916] systemd[1]: Installed new job ypbind.service/restart as 355
[   21.637923] systemd[1]: Enqueued job ypbind.service/try-restart as 355

That's the first problem - ypbind is started the second time - who is restarting it and why?

[   21.670211] systemd[1]: ypbind.service changed dead -> start-pre
[   21.763365] systemd[1]: About to execute: /usr/sbin/ypbind -n $OTHER_YPBIND_OPTS
[   21.841852] systemd[1]: Child 1622 belongs to ypbind.service
[   21.841859] systemd[1]: ypbind.service: main process exited, code=exited, status=1

ypbind doesn't start, which is right, because rpcbind is not started yet.

[   21.847924] systemd[1]: Incoming traffic on rpcbind.socket
[   21.847976] systemd[1]: Trying to enqueue job rpcbind.service/start/replace
[   21.848241] systemd[1]: Enqueued job rpcbind.service/start as 84
[   21.848254] systemd[1]: rpcbind.socket changed listening -> running

rpcbind is started after rpcbind's socket is touched, but it's too late. 

There is "After=rpcbind.service" in ypbind's unit file, shouldn't rpcbind have been already started when ypbind is executed? I think it should.

Comment 11 Michal Schmidt 2011-11-15 13:36:30 UTC
(In reply to comment #10)
> [   21.637868] systemd[1]: Trying to enqueue job
> ypbind.service/try-restart/replace
> [   21.637916] systemd[1]: Installed new job ypbind.service/restart as 355
> [   21.637923] systemd[1]: Enqueued job ypbind.service/try-restart as 355
> 
> That's the first problem - ypbind is started the second time - who is
> restarting it and why?

Probably /etc/dhcp/dhclient.d/nis.sh

> There is "After=rpcbind.service" in ypbind's unit file, shouldn't rpcbind have
> been already started when ypbind is executed? I think it should.

Right. I don't see nis.sh using "--ignore-dependencies", so it seems that the bug is in systemd when processing the try-restart request for a unit that already has a start job installed and is waiting for its dependencies.

I'll try to reproduce it with a simpler test case...

Comment 12 Michal Schmidt 2011-11-15 13:44:42 UTC
Reproduced with a couple of units:

$ cat sleep.service 
[Unit]
Description=Sleep for 1 minute

[Service]
Type=oneshot
ExecStart=/bin/sleep 60

$ cat hello.service 
[Unit]
Description=Hello World
After=sleep.service
Wants=sleep.service

[Service]
Type=oneshot
ExecStart=/bin/echo "Hello World"


and running "systemctl try-restart hello.service" while "systemctl start hello.service" is still waiting.

Comment 13 Stephen Tweedie 2011-11-26 13:54:31 UTC
I'm seeing the same: it's making yp boot fail on all my systems updated from f14 to f16.

I seem to have a workaround for now by starting rpcbind.service manally in /etc/dhcp/dhclient.d/nis.sh.

Comment 14 Jóhann B. Guðmundsson 2012-01-29 20:09:02 UTC
Is this still an issue on a fully updated release or can this bug be closed?

Comment 15 Tomasz Kepczynski 2012-01-30 18:56:07 UTC
I didn't see any indication of the fix above so I didn't expect it to work. I was right.
No, this bug can't be closed, it still does not work.

<rant mode on>
It is suprising that people in XXI century still believe in miracles.
<rant mode off>

Comment 16 Attila 2012-02-21 10:28:42 UTC
Same on FC15 and FC16 32Bit. Here what i did:

systemctl start ypbind.service

systemctl status ypbind.service

ypbind.service - NIS/YP (Network Information Service) Clients to NIS Domain Binder
          Loaded: loaded (/lib/systemd/system/ypbind.service; enabled)
          Active: active (running) since Tue, 21 Feb 2012 10:43:03 +0100; 10s ago
         Process: 2387 ExecStartPost=/usr/lib/ypbind/ypbind-post-waitbind (code=exited, status=0/SUCCESS)
         Process: 2383 ExecStartPre=/usr/sbin/setsebool allow_ypbind=1 (code=exited, status=1/FAILURE)
         Process: 2381 ExecStartPre=/usr/lib/ypbind/ypbind-pre-setdomain (code=exited, status=0/SUCCESS)
        Main PID: 2386 (ypbind)
          CGroup: name=systemd:/system/ypbind.service
                  └ 2386 /usr/sbin/ypbind -n


_________________________________________________________________________


dmesg |grep ypbind

[   22.465608] ypbind[966]: Binding NIS service
[   22.511117] ypbind[964]: Kann den Service nicht registrieren: RPC: Kann nicht empfangen; errno = Verbindungsaufbau abgelehnt
[   22.511183] ypbind[964]: Unable to register (YPBINDPROG, YPBINDVERS, udp).
[   22.512153] systemd[1]: ypbind.service: main process exited, code=exited, status=1
[   82.674666] ypbind-post-waitbind[965]: rpcinfo: can't contact portmapper: RPC: Timed out
[   84.702656] ypbind[972]: Binding took 62 seconds
[   84.704746] ypbind[974]: NIS domain: XXX, ypbind not registered with rpcbind.
[   84.715488] systemd[1]: Unit ypbind.service entered failed state.


_________________________________________________________________________

My workaround is to disable the service ypbind (systemctl disable ypbind.service) and add this to /etc/rc.d/rc.local

systemctl start ypbind.service

Is there any chance to fix this bug?

Comment 17 Michal Schmidt 2012-03-01 15:21:13 UTC
*** Bug 645569 has been marked as a duplicate of this bug. ***

Comment 19 Jay Fenlason 2012-03-21 23:27:02 UTC
The problem is when job_type_merge() in job.c merges a JOB_TRY_RESTART and a JOB_START job, it turns the TRY_RESTART job into a RESTART job, but loses all the dependencies on the JOB_START job.  So it tries to do the RESTART immediately, even though none of the dependencies of the JOB_START have been satisfied.

The systemd developers have know about this bug since 2010, and there's been no progress toward fixing it, even though there's a clear reproducer in comment #12, so don't hold your breath waiting for a fix.

Comment 20 Jay Fenlason 2012-03-21 23:30:54 UTC
Created attachment 571888 [details]
self-contained test program

Save this file as /usr/local/bin/abc.pl and make it executable.
Run abc.pl create m
do systemctl start m.service
Observe that only the a.service is started.

Comment 21 Michal Schmidt 2012-03-23 10:21:37 UTC
(In reply to comment #19)
> The problem is when job_type_merge() in job.c merges a JOB_TRY_RESTART and a
> JOB_START job, it turns the TRY_RESTART job into a RESTART job, but loses all
> the dependencies on the JOB_START job.  So it tries to do the RESTART
> immediately, even though none of the dependencies of the JOB_START have been
> satisfied.

Right. But the fix should not consist of transferring the dependencies into the merged job. The right behaviour is to have the TRY_RESTART job take a shortcut and finish immediately with success - because the service is still inactive, so try-restart should be a NOP. The scheduled START job should not be disturbed.

Comment 22 Michal Schmidt 2012-03-28 09:22:03 UTC
This should avoid the running of the services out of order:
http://cgit.freedesktop.org/systemd/systemd/commit/?id=dd17d38879503f018fdf6bbff822970afcddb6f1

It's still not the optimal solution though (see comment #21).

Comment 23 Jay Fenlason 2012-03-28 20:25:41 UTC
I built a local systemd with that patch and tried it, but it didn't improve matters with either ypbind or my test program.

Comment 24 Michal Schmidt 2012-03-28 21:31:52 UTC
Jay,

could you test if replacing
  service ypbind condrestart
with
  systemctl try-restart --no-block ypbind.service
in /etc/dhcp/dhclient.d/nis.sh helps?

And similarly, in your abc.pl try adding "--no-block" to the try-restart call.

Comment 25 Jay Fenlason 2012-03-29 18:10:14 UTC
That seems to make them both work, at least on lab3.  Lab4 is being more problematic, but its configuration appears to be messed up.  In any case, perhaps we should change how systemd treats service condrestart requests so as to undo breakage of existing scripts?

Comment 26 Michal Schmidt 2012-04-25 07:12:56 UTC
*** Bug 814946 has been marked as a duplicate of this bug. ***

Comment 27 Michal Schmidt 2012-04-25 16:40:14 UTC
Upstream patch to complete the fix:

http://cgit.freedesktop.org/systemd/systemd/commit/?id=e0209d83e7b30153f43b1a633c955f66eb2c2e4a

Comment 28 Stephen Tweedie 2012-05-04 11:37:40 UTC
Will there be an f16 build with this fix?

Comment 29 Michal Schmidt 2012-05-04 23:44:42 UTC
F16 candidate build:
http://koji.fedoraproject.org/koji/taskinfo?taskID=4054950

Comment 30 Stephen Tweedie 2012-05-08 18:09:11 UTC
Thanks!  Running it locally now.  I can only reproduce the problem intermittently, though; I'll see how it looks in a bit.

Comment 31 Fedora Update System 2012-05-18 23:38:08 UTC
systemd-44-9.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/systemd-44-9.fc17

Comment 32 Fedora Update System 2012-05-18 23:57:18 UTC
systemd-37-20.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/systemd-37-20.fc16

Comment 33 Fedora Update System 2012-05-19 18:33:22 UTC
Package systemd-44-9.fc17:
* should fix your issue,
* was pushed to the Fedora 17 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing systemd-44-9.fc17'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-8127/systemd-44-9.fc17
then log in and leave karma (feedback).

Comment 34 Fedora Update System 2012-05-29 10:33:40 UTC
systemd-44-12.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 35 Fedora Update System 2012-06-05 23:10:09 UTC
systemd-37-25.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.