Bug 1250955 - Long-running type=oneshot script times out if the script calls systemctl daemon-reload
Long-running type=oneshot script times out if the script calls systemctl daem...
Status: CLOSED EOL
Product: Fedora
Classification: Fedora
Component: systemd (Show other bugs)
22
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: systemd-maint
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-08-06 07:21 EDT by Marius Vollmer
Modified: 2016-07-19 13:23 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-07-19 13:23:17 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Reproducer (2.56 KB, text/plain)
2015-08-07 03:10 EDT, Marius Vollmer
no flags Details

  None (edit)
Description Marius Vollmer 2015-08-06 07:21:59 EDT
If you pass a long-running script to virt-builder --firstboot, systemd will kill it after a timeout.

I think systemd considers the whole run of the script as 'start-up', and will timeout after TimeoutStartSec / DefaultTimeoutStartSec.

libguestfs-tools-c-1.28.10-1.fc21.x86_64
Comment 1 Richard W.M. Jones 2015-08-06 07:37:57 EDT
We're using Type=oneshot, and systemd.service says:

       TimeoutStartSec=
           Configures the time to wait for start-up. If a daemon service does
           not signal start-up completion within the configured time, the
           service will be considered failed and will be shut down again.
           Takes a unit-less value in seconds, or a time span value such as
           "5min 20s". Pass "0" to disable the timeout logic. Defaults to
           DefaultTimeoutStartSec= from the manager configuration file, except
           when Type=oneshot is used, in which case the timeout is disabled by
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           default (see systemd-system.conf(5)).
           ^^^^^^^

If it's not doing that, then maybe this patch?

diff --git a/customize/firstboot.ml b/customize/firstboot.ml
index 29ec544..c15a8dd 100644
--- a/customize/firstboot.ml
+++ b/customize/firstboot.ml
@@ -92,6 +92,7 @@ ExecStart=%s/firstboot.sh start
 RemainAfterExit=yes
 StandardOutput=journal+console
 StandardError=inherit
+TimeoutStartSec=0
 
 [Install]
 WantedBy=default.target
Comment 2 Marius Vollmer 2015-08-07 02:22:04 EDT
Here is some evidence, for the record.

# journalctl -l -u firstboot.service | cat
-- Logs begin at Thu 2015-08-06 05:27:07 EDT, end at Thu 2015-08-06 05:42:16 EDT. --
Aug 06 05:27:10 localhost.localdomain systemd[1]: Starting virt-sysprep firstboot service...
Aug 06 05:27:10 localhost.localdomain firstboot.sh[405]: /usr/lib/virt-sysprep/firstboot.sh start
Aug 06 05:27:10 localhost.localdomain firstboot.sh[405]: Scripts dir: /usr/lib/virt-sysprep/scripts
Aug 06 05:27:10 localhost.localdomain firstboot.sh[405]: === Running /usr/lib/virt-sysprep/scripts/0001--tmp-setup-fraw-28058 ===

[...]

Aug 06 05:30:50 m10 firstboot.sh[405]: which                    x86_64 2.20-12.fc23                     rawhide  46 k
Aug 06 05:30:50 m10 firstboot.sh[405]: xfsprogs                 x86_64 3.2.4-1.fc24                     rawhide 892 k
Aug 06 05:30:50 m10 firstboot.sh[405]: xkeyboard-config         noarch 2.15-2.fc23                      rawhide 760 k
Aug 06 05:30:50 m10 firstboot.sh[405]: xz                       x86_64 5.2.1-3.fc23                     rawhide 147 k
Aug 06 05:30:50 m10 firstboot.sh[405]: xz-libs                  x86_64 5.2.1-3.fc23                     rawhide  90 k
Aug 06 05:30:50 m10 firstboot.sh[405]: zlib                     x86_64 1.2.8-8.fc23                     rawhide  95 k
Aug 06 05:30:50 m10 firstboot.sh[405]: Transaction Summary
Aug 06 05:30:50 m10 firstboot.sh[405]: ================================================================================
Aug 06 05:30:50 m10 firstboot.sh[405]: Install   27 Packages
Aug 06 05:30:50 m10 firstboot.sh[405]: Upgrade  253 Packages
Aug 06 05:30:50 m10 firstboot.sh[405]: Total download size: 204 M
Aug 06 05:30:50 m10 firstboot.sh[405]: Downloading Packages:
Aug 06 05:30:53 m10 systemd[1]: firstboot.service start operation timed out. Terminating.
Aug 06 05:30:53 m10 systemd[1]: Failed to start firstboot.service.
Aug 06 05:30:53 m10 systemd[1]: Unit firstboot.service entered failed state.
Aug 06 05:30:53 m10 systemd[1]: firstboot.service failed.
Comment 3 Marius Vollmer 2015-08-07 03:09:51 EDT
Adding TimeoutStartSec=0 did not help (or I have done it wrong).  I attach the script to reproduce this.

My hunch right now is this:

 - firstboot.service gets removed early on somehow (I didn't find
   the code that does this.)

 - Some time later someone executes "systemctl daemon-reload" (as might happen during package installation).

 - Systemd gets confused and times out the running firstboot.service despite Type=oneshot.
Comment 4 Marius Vollmer 2015-08-07 03:10:45 EDT
Created attachment 1060248 [details]
Reproducer
Comment 5 Marius Vollmer 2015-08-07 03:13:25 EDT
(In reply to Marius Vollmer from comment #4)
> Created attachment 1060248 [details]
> Reproducer

You probably need to adjust the qemu-kvm command.  The network config works for me, but it's magic as far as I am concerned.
Comment 6 Marius Vollmer 2015-08-07 03:30:50 EDT
Putting this at the top of the --firstboot script helps:

    ln -s /usr/lib/virt-sysprep/firstboot.service /etc/systemd/system/

The timout does not happen and the script finishes normally.  (Or actually, dnf hangs, which is the thing that I wanted to provoke.)
Comment 7 Richard W.M. Jones 2015-08-07 04:14:38 EDT
Do you have a simple reproducer for the bug?
Comment 8 Richard W.M. Jones 2015-08-07 04:15:38 EDT
Ignore that - the reproducer is in comment 4.
Comment 9 Richard W.M. Jones 2015-08-07 04:20:43 EDT
This was my attempt at a minimal reproducer (not successfully):

virt-builder fedora-22 --firstboot-command 'sleep 60; echo reloading; systemctl daemon-reload; echo reloaded; sleep 60'
qemu-kvm -nodefaults -display none -hda fedora-22.img -m 1024 -serial stdio

The output (lightly edited) is:

[    8.293979] firstboot.sh[397]: /usr/lib/virt-sysprep/firstboot.sh start
[    8.301537] firstboot.sh[397]: Scripts dir: /usr/lib/virt-sysprep/scripts
[    8.338522] firstboot.sh[397]: === Running /usr/lib/virt-sysprep/scripts/0001-sleep-60--echo-reloading--systemctl-daemon-reload--echo-relo ===
[   68.354411] firstboot.sh[397]: reloading
[   68.420697] systemd-sysv-generator[724]: Overwriting existing symlink /run/systemd/generator.late/network.service with real service
[   68.423146] systemd-sysv-generator[724]: Overwriting existing symlink /run/systemd/generator.late/netconsole.service with real service
[   68.466432] firstboot.sh[397]: reloaded

I suspect it's not daemon-reload that's causing the problem.
Comment 10 Marius Vollmer 2015-08-07 05:17:40 EDT
(In reply to Richard W.M. Jones from comment #9)
> This was my attempt at a minimal reproducer (not successfully):
> 
> virt-builder fedora-22 --firstboot-command 'sleep 60; echo reloading;
> systemctl daemon-reload; echo reloaded; sleep 60'

Try this:

    --firstboot-command 'sleep 10; echo reloading; systemctl daemon-reload; echo reloaded; sleep 300'

With that I get the timeout:

# systemctl status firstboot -l
● firstboot.service
   Loaded: not-found (Reason: No such file or directory)
   Active: failed (Result: timeout) since Fri 2015-08-07 05:14:45 EDT; 12s ago
 Main PID: 394 (code=killed, signal=TERM)

Aug 07 05:13:05 localhost.localdomain systemd[1]: Starting virt-sysprep firstboot service...
Aug 07 05:13:05 localhost.localdomain firstboot.sh[394]: /usr/lib/virt-sysprep/firstboot.sh start
Aug 07 05:13:05 localhost.localdomain firstboot.sh[394]: Scripts dir: /usr/lib/virt-sysprep/scripts
Aug 07 05:13:05 localhost.localdomain firstboot.sh[394]: === Running /usr/lib/virt-sysprep/scripts/0001-sleep-10--echo-reloading--systemctl-daemon-reload--echo-reloaded--sleep-300 ===
Aug 07 05:13:15 localhost.localdomain firstboot.sh[394]: reloading
Aug 07 05:13:15 localhost.localdomain firstboot.sh[394]: reloaded
Aug 07 05:14:45 localhost.localdomain systemd[1]: firstboot.service start operation timed out. Terminating.
Aug 07 05:14:45 localhost.localdomain systemd[1]: Failed to start firstboot.service.
Aug 07 05:14:45 localhost.localdomain systemd[1]: Unit firstboot.service entered failed state.
Aug 07 05:14:45 localhost.localdomain systemd[1]: firstboot.service failed.


The default timeout is 90 seconds and starts counting from the daemon-reload, so your "sleep 60" was too short.
Comment 11 Marius Vollmer 2015-08-07 05:18:49 EDT
I am debating with myself whether this could possibly count as a bug in systemd. Wdyt?
Comment 12 Richard W.M. Jones 2015-08-10 04:06:56 EDT
I get slightly different behaviour.  The firstboot process seems to be
killed by the reload, or else cannot produce any output.

[    8.598598] firstboot.sh[394]: === Running /usr/lib/virt-sysprep/scripts/0001-sleep-10--echo-reloading--systemctl-daemon-reload--echo-relo ===
[   18.616499] firstboot.sh[394]: reloading
[   18.682547] systemd-sysv-generator[723]: Overwriting existing symlink /run/systemd/generator.late/netconsole.service with real service
[   18.688280] systemd-sysv-generator[723]: Overwriting existing symlink /run/systemd/generator.late/network.service with real service
[   18.732751] firstboot.sh[394]: reloaded

(and no further output even after about 10 minutes)

With:

$ virt-builder fedora-22 \
    --firstboot-command 'sleep 10; echo reloading; systemctl daemon-reload; echo reloaded; sleep 120; echo finished'

I get:

[    8.666625] firstboot.sh[397]: === Running /usr/lib/virt-sysprep/scripts/0001-sleep-10--echo-reloading--systemctl-daemon-reload--echo-relo ===
[   18.682183] firstboot.sh[397]: reloading
[   18.747675] systemd-sysv-generator[724]: Overwriting existing symlink /run/systemd/generator.late/netconsole.service with real service
[   18.752127] systemd-sysv-generator[724]: Overwriting existing symlink /run/systemd/generator.late/network.service with real service
[   18.798305] firstboot.sh[397]: reloaded

(and no further output even after about 5 minutes)

Since we should be using identical versions of systemd (in the image) I
don't think we have a clear, reliable reproducer yet.
Comment 13 Marius Vollmer 2015-08-10 06:30:54 EDT
(In reply to Richard W.M. Jones from comment #12)
> I get slightly different behaviour.  The firstboot process seems to be
> killed by the reload, or else cannot produce any output.

I think we are seeing the same things, actually.  Please log into the VM and execute

    # systemctl status firstboot -l

to monitor the state of firstboot.service.

This will show the output of firstboot, plus everything that systemd has to say about it, such as:

    Aug 07 05:14:45 localhost.localdomain systemd[1]: firstboot.service start operation timed out. Terminating.
Comment 14 Marius Vollmer 2015-08-10 06:32:36 EDT
(In reply to Marius Vollmer from comment #13)

> This will show the output of firstboot, plus everything that systemd has to
> say about it, [...]

To clarify this further, the messages from systemd about firstboot.service do not show up on the console during boot, but the output of firstboot.service does.  I think that explains the difference of what you see vs what I have copy/pasted.
Comment 15 Richard W.M. Jones 2015-08-11 12:05:03 EDT
I agree, and I think this is a systemd bug.
Comment 16 Richard W.M. Jones 2015-08-11 12:07:36 EDT
The unit file is:

-----
[Unit]
Description=virt-sysprep firstboot service
After=network.target
Before=prefdm.service

[Service]
Type=oneshot
ExecStart=/usr/lib/virt-sysprep/firstboot.sh start
RemainAfterExit=yes
StandardOutput=journal+console
StandardError=inherit

[Install]
WantedBy=default.target
-----

The exec'd script calls dnf, and package installed by dnf calls
systemd daemon-reload.

According to my understanding, the Type=oneshot script should
not time out, but it seems to in this case.

I observed this with:

systemd-219-13.fc22.i686
Comment 17 Richard W.M. Jones 2015-08-11 12:08:54 EDT
> The exec'd script calls dnf, and package installed by dnf calls
> systemd daemon-reload.

systemctl daemon-reload
Comment 18 Marius Vollmer 2015-08-11 13:39:37 EDT
(In reply to Richard W.M. Jones from comment #15)
> I agree, and I think this is a systemd bug.

One important piece of the puzzle is that the firstboot.service unit file is removed before "systemctl daemon-reload".

As mentioned in comment 6, preventing the removal of the unit file will stop the timeout from triggering, as expected.
Comment 19 Richard W.M. Jones 2015-08-11 15:28:18 EDT
As far as I know, the firstboot unit file is *not* removed.  Or
at least, in the tests I have been conducting it is not removed.

$ virt-builder fedora-22 --firstboot-command 'echo hello' --root-password password:123456

$ qemu-kvm -nodefaults -display none -hda fedora-22.img -m 1024 -serial stdio
[ guest boots ]
...
[    9.343888] firstboot.sh[395]: /usr/lib/virt-sysprep/firstboot.sh start
[    9.352713] firstboot.sh[395]: Scripts dir: /usr/lib/virt-sysprep/scripts
[    9.390065] firstboot.sh[395]: === Running /usr/lib/virt-sysprep/scripts/0001-echo-hello ===
[    9.395240] firstboot.sh[395]: hello

Fedora release 22 (Twenty Two)
Kernel 4.0.4-301.fc22.x86_64 on an x86_64 (ttyS0)

localhost login: root
Password: 
[root@localhost ~]# ls -l /usr/lib/virt-sysprep/
total 8
-rw-r--r--. 1 root root  276 Aug 11 15:25 firstboot.service
-rwxr-xr-x. 1 root root 1099 Aug 11 15:25 firstboot.sh
drwxr-xr-x. 2 root root    6 Aug 11 15:25 scripts
drwxr-xr-x. 2 root root    6 Aug 11 15:25 scripts-done
[root@localhost ~]# ls -l /etc/systemd/system/default.target.wants/
total 0
lrwxrwxrwx. 1 root root 39 Aug 11 15:25 firstboot.service -> /usr/lib/virt-sysprep/firstboot.service
Comment 20 Marius Vollmer 2015-08-12 03:19:43 EDT
(In reply to Richard W.M. Jones from comment #19)
> As far as I know, the firstboot unit file is *not* removed.

(Aha!  I think we are getting closer.)

The files and the symlink are indeed still there, but systemd will considers it "not-found" after a daemon-reload.

So the core of this issue seems to be that the symlink in /etc/systemd/system/default.target.wants/ does not point into any of the directories that systemd looks into for unit files.  Systemd seems to treat this inconsistently: On boot, the unit file is loaded, on daemon-reload, it is not.


$ virt-builder fedora-22 --firstboot-command 'echo hello' --root-password password:123456

$ qemu-kvm -nodefaults -display none -hda fedora-22.img -m 1024 -serial stdio
[ guest boots ]
localhost login: root
Password: 

[root@localhost ~]# systemctl status firstboot
● firstboot.service - virt-sysprep firstboot service
   Loaded: loaded (/usr/lib/virt-sysprep/firstboot.service)
   [...]

[root@localhost ~]# systemctl daemon-reload

[root@localhost ~]# systemctl status firstboot
● firstboot.service
   Loaded: not-found (Reason: No such file or directory)
   [...]
Comment 21 Jóhann B. Guðmundsson 2015-08-12 04:45:10 EDT
That unit file should reside in /usr/lib/systemd/system/ directory not /usr/lib/virt-sysprep/ and simlinks should not be followed outside any of the checked directories for unit files. If does that it's a bug.

Any particular reason you are placing that unit file in /usr/lib/virt-sysprep/ ?
Comment 22 Richard W.M. Jones 2015-08-12 05:34:46 EDT
(In reply to Jóhann B. Guðmundsson from comment #21)
> That unit file should reside in /usr/lib/systemd/system/ directory not
> /usr/lib/virt-sysprep/ and simlinks should not be followed outside any of
> the checked directories for unit files. If does that it's a bug.

It seems like it is (some of the time) following such a symlink.

> Any particular reason you are placing that unit file in
> /usr/lib/virt-sysprep/ ?

Not especially.  I'm going to move it to see if that helps.
Comment 23 Richard W.M. Jones 2015-08-12 05:47:33 EDT
I pushed this which fixes the problem from the libguestfs p.o.v:

https://github.com/libguestfs/libguestfs/commit/f3a7cf994b5c7350c32fd6011c4224d81f7ee86a

However there may still be a systemd issue here: it seems to
handle the unit file inconsistently, recognizing it at boot
but then not recognizing it after systemctl daemon-reload.
Comment 24 Jóhann B. Guðmundsson 2015-08-12 06:53:41 EDT
Could be the systemd implementation in dracut doing some voodoo or shipping some old fedora firstboot unit or script etc. Plenty of reason why that might work ( it should not read firstboot from  /usr/lib/virt-sysprep/ directory )  

Now for another question why aren't you using the systemd built in firstboot feature (  man systemd-firstboot ) rather then shipping your own as in is it not better we tried to fix any limitation systemd firsboot might have instead of having a semi functional re-implementation of it somewhere else?

What are you doing which it does not do?
Comment 25 Richard W.M. Jones 2015-08-12 07:34:51 EDT
"firstboot" in this context really means "next boot" (especially
for virt-customize).  Firstboot in systemd is really the first boot,
which is not what we want.  But yes having a "only once at next boot"
implementation in systemd would be useful.
Comment 26 Jóhann B. Guðmundsson 2015-08-12 11:32:06 EDT
Hmm so if systemd can count "boots" and systemd-firstboot could be told to run on spesific boot like boot nr.2 or nr.4 etc ( or after next reboot ) then that would solve that problem? 

What are you doing in the "second" boot that you cannot accomplish in the first one? ( relates to if that additional functionality belongs natively in machinectl instead )
Comment 27 Richard W.M. Jones 2015-08-12 11:43:44 EDT
If there was a way to get the current boot nr from the filesystem
(without running some program) then yes.

virt-customize and virt-builder allow you to install arbitrary
scripts to run on first boot (really next boot in the case of
virt-customize).  This works on Windows too.
http://libguestfs.org/virt-builder.1.html#first-boot-scripts
Comment 28 Lennart Poettering 2016-02-10 09:39:54 EST
Should be fixed with current git.
Comment 29 Mike McCune 2016-03-28 19:38:32 EDT
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune@redhat.com with any questions
Comment 30 Fedora End Of Life 2016-07-19 13:23:17 EDT
Fedora 22 changed to end-of-life (EOL) status on 2016-07-19. Fedora 22 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.