Bug 1369794

Summary: anaconda can no longer enable non-systemd services (so current F25 and Rawhide Cloud images don't bring up networking)
Product: [Fedora] Fedora Reporter: Lukas Brabec <lbrabec>
Component: anacondaAssignee: Anaconda Maintenance Team <anaconda-maint-list>
Status: CLOSED EOL QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 25CC: anaconda-maint-list, awilliam, dennis, g.kaviyarasu, jonathan, kevin, kparal, lbrabec, lnykryn, mark, pbrobinson, robatino, sbueno, vanmeeuwen+fedora
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-12 10:24:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Lukas Brabec 2016-08-24 12:23:44 UTC
Using testcloud to create test instance, I wasn't able to obtain IP with image Fedora-Cloud-Base-25_Alpha-1.1.x86_64.qcow2. Testcloud ended with exception: "Instance test25 has failed to boot in 30 seconds".

Immediately after instance creation, I tried to connect to the cloud instance using 'virsh console` with no success, there was no output of boot process.

I also tried create a new VM in virt-manager, and while the VM boots with F24 cloud image, VM with Fedora-Cloud-Base-25_Alpha-1.1.x86_64.qcow2 is stuck on:
"Booting From Hard Disk..."

Comment 1 Fedora Blocker Bugs Application 2016-08-24 12:32:34 UTC
Proposed as a Blocker for 25-alpha by Fedora user lbrabec using the blocker tracking app because:

 This bug can be a violation of alpha criterion:
Supported cloud environments: Release-blocking cloud images must boot in the Fedora OpenStack Cloud and in Amazon EC2.

While I tested this only locally with testcloud, I think this bug should be investigated further.

Comment 2 Adam Williamson 2016-08-24 14:37:13 UTC
It failed in autocloud too:

https://apps.fedoraproject.org/autocloud/jobs/451/output

Comment 3 Adam Williamson 2016-08-24 14:39:54 UTC
per https://apps.fedoraproject.org/autocloud/jobs/?family=b&arch=&image_type=qcow2&status=f , it's been failing in F25 and Rawhide approximately forever. There are only successful results for F24.

Comment 4 Kevin Fenzi 2016-08-24 14:48:31 UTC
Looking at the build logs this might be related to trying to enable 'network' (which is not a systemd unit): 

...
00:50:43,595 INFO program: Running... systemctl enable network --root /mnt/sysimage
00:50:43,619 INFO program: network.service is not a native service, redirecting to systemd-sysv-install.
00:50:43,619 INFO program: Executing: /usr/lib/systemd/systemd-sysv-install --root=/mnt/sysimage enable network
00:50:43,620 INFO program: Failed to execute /usr/lib/systemd/systemd-sysv-install: No such file or directory
00:50:43,620 DEBUG program: Return code: 1
00:50:43,621 DEBUG anaconda: running handleException
00:50:43,622 CRIT anaconda: Traceback (most recent call last):#012#012  File "/usr/lib64/python3.5/site-packages/pyanaconda/threads.py", line 251, in run#012    threading.Thread.run(self, *args, **kwargs)#012#012  File "/usr/lib64/python3.5/threading.py", line 862, in run#012    self._target(*self._args, **self._kwargs)#012#012  File "/usr/lib64/python3.5/site-packages/pyanaconda/install.py", line 77, in doConfiguration#012    ksdata.services.execute(storage, ksdata, instClass)#012#012  File "/usr/lib64/python3.5/site-packages/pyanaconda/kickstart.py", line 1664, in execute#012    iutil.enable_service(svc)#012#012  File "/usr/lib64/python3.5/site-packages/pyanaconda/iutil.py", line 787, in enable_service#012    raise ValueError("Error enabling service %s: %s" % (service, ret))#012#012ValueError: Error enabling service network: 1
00:50:44,117 DEBUG anaconda: Gtk cannot be initialized
00:50:44,117 DEBUG anaconda: In a non-main thread, sending a message with exception data
00:50:44,118 INFO anaconda: Thread Done: AnaConfigurationThread (140307388299008)
00:50:44,783 DEBUG anaconda: running handleException
00:50:44,784 CRIT anaconda: Traceback (most recent call last):#012#012  File "/usr/lib64/python3.5/site-packages/pyanaconda/threads.py", line 251, in run#012    threading.Thread.run(self, *args, **kwargs)#012#012  File "/usr/lib64/python3.5/threading.py", line 862, in run#012    self._target(*self._args, **self._kwargs)#012#012  File "/usr/lib64/python3.5/site-packages/pyanaconda/install.py", line 77, in doConfiguration#012    ksdata.services.execute(storage, ksdata, instClass)#012#012  File "/usr/lib64/python3.5/site-packages/pyanaconda/kickstart.py", line 1664, in execute#012    iutil.enable_service(svc)#012#012  File "/usr/lib64/python3.5/site-packages/pyanaconda/iutil.py", line 787, in enable_service#012    raise ValueError("Error enabling service %s: %s" % (service, ret))#012#012ValueError: Error enabling service network: 1
00:50:44,786 DEBUG anaconda: Gtk cannot be initialized
00:50:44,786 DEBUG anaconda: In the main thread, running exception handler
Waiting for factory-build-288d1c60-0e4a-4bad-a58c-02cf7e73d61d to finish installing, 6910/7200
...

So, perhaps some anaconda change related to non systemd unit file enabling?

Comment 5 Adam Williamson 2016-08-24 15:04:17 UTC
well:

00:50:43,620 INFO program: Failed to execute /usr/lib/systemd/systemd-sysv-install: No such file or directory

seems to be the problem.

Comment 6 Adam Williamson 2016-08-24 15:05:27 UTC
yeah, it's obvious if you compare to an f24 log. f24:

05:20:35,441 INFO program: Running... systemctl enable network
05:20:35,453 INFO program: network.service is not a native service, redirecting to systemd-sysv-install
05:20:35,453 INFO program: Executing /usr/lib/systemd/systemd-sysv-install enable network
05:20:35,454 DEBUG program: Return code: 0

f25:

00:50:43,595 INFO program: Running... systemctl enable network --root /mnt/sysimage
00:50:43,619 INFO program: network.service is not a native service, redirecting to systemd-sysv-install.
00:50:43,619 INFO program: Executing: /usr/lib/systemd/systemd-sysv-install --root=/mnt/sysimage enable network
00:50:43,620 INFO program: Failed to execute /usr/lib/systemd/systemd-sysv-install: No such file or directory
00:50:43,620 DEBUG program: Return code: 1

Comment 7 Adam Williamson 2016-08-24 15:39:43 UTC
Aha. I think I see it. Note the difference in the commands:

05:20:35,441 INFO program: Running... systemctl enable network
00:50:43,595 INFO program: Running... systemctl enable network --root /mnt/sysimage

in F24, this was run without --root (thus, we can presume, in a chroot to the installed system, or else it wouldn't have worked). In F25 it's run with --root .

Thus in F24 we'll wind up using systemd-sysv-install from the installed system chroot too, but in F25 we'll be using the one from the installer environment. Only it's not there in the installer environment, because lorax runtime-cleanup.tmpl has this:

## no services to turn on/off (keep the /etc/init.d link though)
removefrom chkconfig --allbut /etc/init.d

and systemd-sysv-install is in chkconfig. now we can tweak that bit of lorax so it doesn't strip systemd-sysv-install, but then it'll be interesting to see if this redirection from systemctl to systemd-sysv-install really works properly with --root...

Comment 8 Adam Williamson 2016-08-24 15:40:05 UTC
https://github.com/rhinstaller/anaconda/commit/412ca74154bef8ac232e5b3be820a182d77c30f6 is the commit that changed anaconda's behaviour, for the record.

Comment 9 Adam Williamson 2016-08-24 15:47:49 UTC
nirik points out that /usr/lib/systemd/systemd-sysv-install is just a symlink to /sbin/chkconfig , so we'd need to keep both of those in the installer environment. But here's a more pressing problem:

[adamw@adam etc]$ systemctl --root=/tmp/fakesys enable network
network.service is not a native service, redirecting to systemd-sysv-install.
Executing: /usr/lib/systemd/systemd-sysv-install --root=/tmp/fakesys enable network
--root=/tmp/fakesys: unknown option

i.e. I don't think we can rely on `systemctl --root` to work with non-systemd services.

Comment 10 Adam Williamson 2016-08-24 15:53:45 UTC
setting back to anaconda, since just fixing the lorax stripping wouldn't be enough here. I think for short term we may simply have to revert the anaconda commit.

Comment 11 Adam Williamson 2016-08-24 16:13:24 UTC
https://www.happyassassin.net/updates/1369794.0.img is an updates.img reverting the anaconda commit, for testing. I tested with https://www.happyassassin.net/ks/testsvc.ks , which just does:

services --enabled=network

and confirmed that indeed it crashes with a stock F25 installer image, works with the patch reverted via the updates image. It's just barely possible that reverting this would have non-obvious other consequences - for one, it probably renders https://github.com/rhinstaller/anaconda/commit/b35fe094bcd9f792bb8eb9e0ed3679c175f632fa moot - but I think it's probably our best short-term option to fix Cloud for Alpha. The 'proper' fix would, I guess, be to get chkconfig to support --root (so I cc'ed lnykryn), and of course then not strip it out of the installer.

Comment 12 Adam Williamson 2016-08-24 16:38:05 UTC
https://github.com/rhinstaller/anaconda/pull/749 should deal with this on the anaconda side, but sbueno isn't sure about doing another anaconda build for Alpha, so I will also see if we can work around this in the kickstart (by taking network out of the `services` line and just manually enabling it in %post).

Comment 13 Adam Williamson 2016-08-24 16:39:42 UTC
whoops, forgot to mention, https://www.happyassassin.net/updates/1369794.1.img includes the patch from https://github.com/rhinstaller/anaconda/pull/749 . I tested it and it seems to work fine.

Comment 14 Adam Williamson 2016-08-24 17:25:34 UTC
As an alternative to patching anaconda, https://pagure.io/fedora-kickstarts/pull-request/52 should work around this in fedora-kickstarts . I did not test it directly as I don't have a full compose chain set up here, but I did test the concept, with these three kickstarts:

https://www.happyassassin.net/ks/testsvc.ks (has `services --enabled=network`)
https://www.happyassassin.net/ks/testsvc2.ks (has `chkconfig network on` in %post instead)
https://www.happyassassin.net/ks/testsvc3.ks (has neither `services` line nor `chkconfig` in %post)

The first causes a current Fedora 25 installer to crash (unless you use one of the updates images). The second installs clean and has the network service enabled. The third installs clean and does not have the network service enabled. That's all as I'd expect.

Comment 15 Adam Williamson 2016-08-24 17:42:22 UTC
oh, obviously I'm +1 blocker on this, it violates the cited criterion.

Comment 16 Kevin Fenzi 2016-08-24 17:44:34 UTC
+1 blocker.

Comment 17 Adam Williamson 2016-08-24 17:46:43 UTC
That's +3, setting accepted. We have applied the fedora-kickstarts workaround for this, but I don't think we should mark the bug as fixed; rather, if that works, we should just drop the blocker status. The anaconda bug is still valid, we are just working around it.

Comment 18 Adam Williamson 2016-08-24 19:07:32 UTC
So this might turn out to have been a bit of a hijack...

With the kickstart workaround and a few other things we ran into along the way, we can do a Cloud base image compose where there's no crash of the post-install setup thread, so all the service enablement happens and %post happens:

https://koji.fedoraproject.org/koji/taskinfo?taskID=15365954

unfortunately, it still doesn't freaking *boot*. So it seems like the 'post-install setup thread crashes' bug wasn't actually causing the 'image doesn't boot' bug. (I was kinda figuring that the failure to do the %post workaround for #1147998 was what was causing the image not to be bootable, but apparently not).

So we may need to create another bug for the boot failure. This is still definitely a real anaconda bug, though.

Comment 19 Samantha N. Bueno 2016-08-24 19:40:47 UTC
I have a question: why is the onus on us to fix this? Why doesn't the network service migrate to using systemd after all these years?

I'm kind of starting to see this as unearthing the historical relics that need to be updated. systemd isn't new anymore.

Comment 20 Adam Williamson 2016-08-24 20:36:58 UTC
well, you'd be best asking lukas I guess, but I believe the network service is basically kinda resistant to systemd conversion. the main reason we still have it is for backward compatibility with all the use cases that rely on its behaviour, and I believe that making it into a systemd service would kind of unavoidably change its behaviour, at which point the value of having it is substantially diminished. but I'm not an expert on that, IMBW.

the onus isn't *necessarily* on anaconda to fix it; probably the 'best' fix would be to make chkconfig handle --root. that's why I filed a bug on that, and marked it as See Also: - it's https://bugzilla.redhat.com/show_bug.cgi?id=1369916 . but it does seem worth tracking this separately as it is possible to fix it on the anaconda side without chkconfig being fixed, if that turns out to be a problem for some reason. However, if that bug does get fixed, we coul close this one right away, as anaconda would then work fine without changes.

I understand what you're saying, but in the real world I think we're _probably_ not going to get away with 'you can't work with sysv services any more, sorry'...at least not yet.

Comment 21 Adam Williamson 2016-08-24 21:55:44 UTC
oh, I did actually file a new bug for the syslinux boot fail even after this problem is worked around: https://bugzilla.redhat.com/show_bug.cgi?id=1369934 .

Comment 22 Samantha N. Bueno 2016-08-24 23:07:40 UTC
(In reply to Adam Williamson from comment #20)
> well, you'd be best asking lukas I guess, but I believe the network service
> is basically kinda resistant to systemd conversion. the main reason we still
> have it is for backward compatibility with all the use cases that rely on
> its behaviour, and I believe that making it into a systemd service would
> kind of unavoidably change its behaviour, at which point the value of having
> it is substantially diminished. but I'm not an expert on that, IMBW.

Ok, that's a fair point -- so then I pose the question to Lukas, about why the network service hasn't been migrated to use systemd. I just haven't kept up with a lot of the mail flying around in fedora-devel, so if that discussion ever took place there, I missed it.

I'm not sure which Lukas to needinfo here, since I can think of like three off the top of my head. :-/

> I understand what you're saying, but in the real world I think we're
> _probably_ not going to get away with 'you can't work with sysv services any
> more, sorry'...at least not yet.

Sure, "not yet" is fine and understandable, but I think we should consider deprecating old sysvinit style services then, so there is enough time for people to migrate them to our not-so-new-anymore default.

Comment 23 Samantha N. Bueno 2016-08-24 23:11:47 UTC
And I'm still voicing my dissent on this being a blocker. I see the criteria, but I need to bring up a few glaring points:

(a) The blocker criteria is arbitrary. We set those guidelines, therefore we can change them. We've also just ignored them before.

(b) I'm not sure how large the target audience is for Fedora cloud, especially in an alpha release. If that's a small group of people, this isn't worth it.

(c) The cloud images have been completely broken since the end of F24, and nobody noticed until today, the day before go/no-go. Not even the Cloud Working Group noticed.

Comment 24 Adam Williamson 2016-08-24 23:22:26 UTC
we never 'just ignore' the criteria. we sometimes change them on the fly, and we sometimes come up with extremely flexible interpretations of them, and we sometimes say "we can't block on this otherwise-blocker-worthy issue because we have no ability to fix it in any remotely reasonable time frame", but we never just *ignore* them. ;)

as the criteria stand this is quite clearly an automatic blocker. Cloud base is a release blocking image, and it completely fails to boot. There's really just no ambiguity there.

Whoever wants to be in charge of these things (it should be Cloud WG, but I hear rumblings that Cloud WG is obsolete or something) could declare that Cloud base is no longer a blocking image, at which point this magically ceases to be a blocker. Or FESCo or the Board or someone could say 'Cloud WG is clearly not doing a great job and thus none of their images are blocking any more'. I don't mind at all, personally, if whoever's responsible wants to do that, though it seems a bit like sharp practice. But so long as it's a blocking image, this bug is clearly a blocker.

We already worked around it in kickstarts, anyway, and have a new compose running which should work.

Comment 25 Adam Williamson 2016-08-24 23:22:56 UTC
oh, the Lukas in question is lnykryn. He's already in CC.

Comment 26 Lukáš Nykrýn 2016-08-25 06:39:51 UTC
There is a couple of historical reason why we don't want have unitfile for that. I am also maintaining chkconfig and in the past I tried to implement the --root, but it was "too ugly and nobody complained" so I did not finished that. But I will move it to my todo again.

Comment 27 Kevin Fenzi 2016-08-25 14:40:45 UTC
There was also some move to change the cloud image to use systemd-networkd which would avoid this issue for that case at least.

Comment 28 Adam Williamson 2016-08-25 17:10:51 UTC
the workaround via kickstarts does appear to have worked in the Alpha-1.2 compose, so this issue is no longer release blocking.

Comment 29 Fedora End Of Life 2017-12-12 10:24:30 UTC
Fedora 25 changed to end-of-life (EOL) status on 2017-12-12. Fedora 25 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.