Rawhide kernels after 4.17.0-0.rc2 cause nightly rawhide composes to fail when trying to create the base cloud image.
This is created with oz.
https://koji.fedoraproject.org/koji/taskinfo?taskID=26610483 (the top level task)
which in turn gives:
https://koji.fedoraproject.org/koji/taskinfo?taskID=26610493 (the blocking x86_64 task)
you can see from the screenshot there that the image boots to do the install, but hangs at:
clocksource: tsc: mask: 0xfffffffffffffffff max_cycles: 0x212738e3f90, max_idle_ns: 440795225370 ns
The install fails because it hangs there too long.
Happy to try and gather more info on this.
I've attached a log with the kernel command line "debug initcall_debug ignore_loglevel console=tty0 console=ttyS0,115200n8 systemd.log_level=debug systemd.log_target=console init=/bin/bash" added.
Do note that after it stops printing, I can actually enter characters, and those are printed back, so are newlines etc, but I see no output other than the characters I enter by hand.
If I use the rawhide vmlinuz, it results in this same behaviour with either rawhide or f28 initramfs.
With f28 vmlinuz, it does boot correctly with either rawhide or f28 initramfs.
Created attachment 1428608 [details]
This is reproducible on my system with the initrd and vmlinuz of https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20180426.n.0/compose/Everything/x86_64/os/isolinux/
QEMU command: qemu-system-x86_64 --initrd initrd.img --kernel vmlinuz -m 2048 -append "console=tty0 console=ttyS0,115200n8 debug rd.debug initcall_debug ignore_loglevel" -machine q35,accel=kvm
I just found out that if I leave the specified qemu command running, there is a chance that the system actually continues to boot after waiting for more than 90 seconds.
Not every time, but I'd guess about 10% of the time?
Created attachment 1428609 [details]
Attached is a log of a QEMU session that (after about 110 seconds) did actually continue to boot.
I cut it short at the dracut loop, since there's no disk mounted that it could ever use as sysroot (which is expected, as I didn't mount one).
Laura references https://marc.info/?l=linux-kernel&m=152471887030737&w=2, and if I keep pressing my enter key (or any) on the VGA display of qemu, it does indeed boot within a few seconds.
So this does seem like the same bug.
Marking a F29 Beta blocker per "Bugs which entirely prevent the composition of one or more of the release-blocking images required to be built for a currently-pending (pre-)release"
That makes it an *automatic* blocker. So setting Accepted. https://fedoraproject.org/wiki/QA:SOP_blocker_bug_process#Automatic_blockers
For the record, nirik has untagged the latest kernel build so we should get a successful compose. This isn't a long-term solution, though (for a start, every new kernel package build will get tagged again...)
So I poked at this a little more to see what was going on. init is definitely stuck on wait_for_random_bytes. systemd's getrandom call is non-blocking but it links against gcrypt which seems to be making a blocking call. gcrypt looks to be used for hmac calcuation for the journal (not 100% if that's the cause, still playing with stuff). The upstream kernel discussion seems to be making okay progress so I think we should focus that for now.
I have an oz patch from imcleod that should hopefully add rng passthough to the oz vm's. I've sent a freeze break out to apply it to our image builders. Hopefully this will work around the issue for now.
See also bug #1572944 for a similar thing on 4.16.4.
Debugged with Patrick a little more and he found
https://src.fedoraproject.org/rpms/libgcrypt/blob/master/f/libgcrypt-1.6.2-fips-ctor.patch . The default compose image comes with fips so what's happening is when systemd goes to open the log, it calls libgcrypt initialization which will initialize the random pool and make the blocking call. We have a very good handle on what's happening but no perfect solution at the moment.
Update for rawhide, we have inlined the virtio_rng. Since it isn't a module, it is there early enough. This is sufficient for the compose, but it could still cause issues testing in environments where a virtio_rng isn't available to the guest. Certainly not closing this out yet as there needs to be a better solution.
Hitting this when attempting to install Rawhide on hardware, reproducible on all arches(tested on armhfp, aarch64 and x86_64). Qemu installs work as expected.
Hardware installs are working with Fedora-Rawhide-20180528.n.0 which includes 4.17.0-0.rc6.git3.1.fc29
I am not sure I observe the same, but for me, the "hardly ever finishes
booting" issue started with kernel-4.17.0-0.rc7.git2.1.fc29.x86_64
(4.17.0-0.rc7.git1.1.fc29.x86_64 works fine).
As an obstacle in the boot process, I've identified something related to
[ 3.868366] systemd-udevd (390) used greatest stack depth: 12424 bytes left
[ 275.416359] random: crng init done
[ 275.417044] random: 7 urandom warning(s) missed due to ratelimiting
[ 332.770711] kworker/dying (213) used greatest stack depth: 12232 bytes left
<booting stuck again>
The above is the transcription of the unmodified excerpt from on-screen
messages from boot (I've removed "quiet" from the command-line).
No idea why CRNG initialization took so long (4.5+ minutes) and how
long it would take for system to actually come up (if ever), I simply
rebooted with Ctrl+Alt+Del combo, which the system responded to, and
booted the previous kernel (see above).
It's also possible that LUKS password screen relies on RNG being
already initialized, so the situation may incur a simple deadlock.
re [comment 17]:
Stated issue went away with update to kernel-4.18.0-0.rc0.git2.1.fc29
(no other possibly relevant packages like dracut or systemd got updated;
when rebooted into problematic kernel afterwards, the problem persists).
Laura, Justin, are you aware of anything that needs resolving for F29 Beta here? This bug is still on the F29 Beta blocker list. Thanks!
This should be closed, it is fixed in userspace as well.
This continues to be a problem. I just updated to kernel 4.18.9-200.fc28.x86_64 and attempted to boot about eight times before my random keystrokes met the entrpy criteria -- very frustrating!!
In short, this issue occurs nearly every time I process a kernel update and this is still an issue with kernel 4.18.9-200.fc28.x86_64.
Please let me know of a workaround. Also, should I open a new bug or has one already been opened?
Most of the work is already being tracked at https://bugzilla.redhat.com/show_bug.cgi?id=1572944 , can you put details about your setup there?
are you running rngd? That should provide you all the entropy you need to the kernel
I am afraid that rngd won't help as that is not started yet when the boot process gets stuck.
(In reply to Laura Abbott from comment #23)
> Most of the work is already being tracked at
> https://bugzilla.redhat.com/show_bug.cgi?id=1572944 , can you put details
> about your setup there?
That ticket is closed. Where is a ticket that is actually being worked?
I tried to report my details in bug 1593843, but I get an access denied message: "You are not authorized to access bug #1593843."
Is there an active bug where I can report this issue with Kernel 4.18.9-200.fc28.x86_64?
Because this bug is closed, I submitted the following:
*** Bug 1759603 has been marked as a duplicate of this bug. ***