Bug 1719057

Summary: Installer boot fails if any option requiring network access during initramfs phase is used
Product: [Fedora] Fedora Reporter: Adam Williamson <awilliam>
Component: dracutAssignee: dracut-maint-list
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rawhideCC: bgalvani, bgilbert, dan, dcbw, dracut-maint-list, fgiudici, gnome-sig, jkonecny, john.j5live, jonathan, lkundrak, mclasen, me, mkolman, rhughes, robatino, rstrode, sandmann, yaneti, zbyszek
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard: openqa
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-07-03 15:22:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1644937    

Description Adam Williamson 2019-06-10 23:26:50 UTC
In openQA testing of yesterday's Rawhide compose, all the kickstart tests failed. So did the tests that use an updates image that is hosted on a network server. In each case, the test failed because the system failed to boot to the installer, instead booting to the dracut rescue prompt.

I think what's going on here is any scenario which requires the network to be brought up during the initramfs phase - which includes the use of a kickstart or updates image retrieved over the network - causes the boot to fail.

Here are the failed tests:

https://openqa.fedoraproject.org/tests/410060
https://openqa.fedoraproject.org/tests/410063
https://openqa.fedoraproject.org/tests/410048
https://openqa.fedoraproject.org/tests/410008
https://openqa.fedoraproject.org/tests/410265
https://openqa.fedoraproject.org/tests/410264

I'm blaming this on NetworkManager because the tests passed on the previous compose (20190604.n.0) and neither anaconda nor dbus nor any other obvious suspect changed in the 0609.n.1 compose. NetworkManager *did* change, and the changelog looks a bit suspicious for this bug:

  * Tue Jun 04 2019 Lubomir Rintel <lkundrak> - 1:1.20.0-0.2
  - Update the 1.20.0 snapshot
  - Re-enable the initrd generator

Those sure look like relevant changes to me.

This is pretty easy to reproduce: just download an installer image from the 20190609.n.1 compose - e.g. https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20190609.n.1/compose/Server/x86_64/iso/Fedora-Server-dvd-x86_64-Rawhide-20190609.n.1.iso - boot it, and add a kickstart or updates.img from a network server to the boot options. e.g. add 'inst.ks=http://fedorapeople.org/groups/qa/kickstarts/firewall-configured-net.ks' . That should be enough to trigger the bug.

Proposing as a Beta blocker as a violation of "The installer must be able to use all available kickstart delivery methods" - https://fedoraproject.org/wiki/Fedora_30_Beta_Release_Criteria#Kickstart_delivery .

Comment 1 Adam Williamson 2019-06-11 21:16:28 UTC
Looking at the journal from the rescue shell, there seems to be a cycle of NetworkManager starting up, running into three dbus errors because dbus is not running (I'm not sure whether that's expected or not in the initramfs environment), exiting with the network device in state 'disconnected', then restarting and going through the whole cycle again. It does this hundreds of times. The end of the process looks like this:

device (ens3): carrier: link connected
manager: (ens3): new Ethernet device (/org/freedesktop/NetworkManager/Devices/2)
device (ens3): state change: unmanaged -> unavailable
sleep-monitor-sd: failed to acquire D-Bus proxy: Could not connect: No such file or directory
firewall: could not connect to system D-Bus (Could not connect: No such file or directory)
ifcfg-rh: dbus: couldn't initialize system bus: Could not connect: No such file or directory
device (ens3): state change: unavailable -> disconnected
manager: startup complete
quitting now that startup is complete
exiting (success)

Then a half second later it starts up again:

NetworkManager (version 1.20.0-0.2.fc31) is starting... (after a restart)

and goes through the same process.

Comment 2 Lubomir Rintel 2019-06-12 13:05:44 UTC
Thanks for the report. The fix for dracut is here: https://github.com/dracutdevs/dracut/pull/578

If the dracut maintainers will be willing to review and apply the patch I'd prefer if we didn't revert the change in NetworkManager.

Comment 3 Adam Williamson 2019-06-12 15:10:12 UTC
It just so happens I'm a proven packager. Soo...;)

Comment 4 Adam Williamson 2019-06-12 15:23:58 UTC
https://koji.fedoraproject.org/koji/taskinfo?taskID=35504593

Let's see how the next compose goes.

Comment 5 Lubomir Rintel 2019-06-14 07:13:46 UTC
(In reply to Adam Williamson from comment #3)
> It just so happens I'm a proven packager. Soo...;)

Ah, okay, me too, but I thought this sort of thing should get an upstream ack.
Guess this is all right, thanks for doing that.

Comment 6 Adam Williamson 2019-06-14 15:53:20 UTC
eh, if upstream doesn't like it he can take it out again. :P I like composes that work!

Unfortunately we're not getting any composes at all ATM, I think partly because of the libgit2 module drama, so don't know if this is fixed yet.

Comment 7 Dan HorĂ¡k 2019-07-01 15:57:42 UTC
I suspect bug #1725872 might be another variant of this one ...

Comment 8 Adam Williamson 2019-07-03 15:22:29 UTC
This one was actually fixed by the change I made back on June 12, thanks for the reminder to close it :)