Bug 1572916

Summary: kernel after 4.17.0-0.rc2.git0.1.fc29 waits for random entropy on boot
Product: [Fedora] Fedora Reporter: Kevin Fenzi <kevin>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: airlied, awilliam, bskeggs, ewk, hdegoede, ichavero, itamar, jarodwilson, jforbes, jglisse, john.j5live, jonathan, josef, jpokorny, kernel-maint, labbott, linville, mattdm, mattia.verga, mchehab, mjg59, mkolman, morrison12, nhorman, noloader, omosnace, pbrobinson, ppisar, puiterwijk, pwhalen, robatino, sheldon.corey, smconvey, steved, tmraz, wehuang
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: AcceptedBlocker
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1593843 (view as bug list) Environment:
Last Closed: 2018-09-09 15:56:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1517011    
Attachments:
Description Flags
kernel-initcall-log
none
kernel-log-booted none

Description Kevin Fenzi 2018-04-28 17:56:35 UTC
Rawhide kernels after 4.17.0-0.rc2 cause nightly rawhide composes to fail when trying to create the base cloud image. 

This is created with oz. 

for 2018-04-28: 

https://koji.fedoraproject.org/koji/taskinfo?taskID=26610483 (the top level task)

which in turn gives: 

https://koji.fedoraproject.org/koji/taskinfo?taskID=26610493 (the blocking x86_64 task)

you can see from the screenshot there that the image boots to do the install, but hangs at:

clocksource: tsc: mask: 0xfffffffffffffffff max_cycles: 0x212738e3f90, max_idle_ns: 440795225370 ns

The install fails because it hangs there too long.

Happy to try and gather more info on this.

Comment 1 Patrick Uiterwijk 2018-04-29 23:23:00 UTC
I've attached a log with the kernel command line "debug initcall_debug ignore_loglevel console=tty0 console=ttyS0,115200n8 systemd.log_level=debug systemd.log_target=console init=/bin/bash" added.

Do note that after it stops printing, I can actually enter characters, and those are printed back, so are newlines etc, but I see no output other than the characters I enter by hand.

If I use the rawhide vmlinuz, it results in this same behaviour with either rawhide or f28 initramfs.
With f28 vmlinuz, it does boot correctly with either rawhide or f28 initramfs.

Comment 2 Patrick Uiterwijk 2018-04-29 23:23:36 UTC
Created attachment 1428608 [details]
kernel-initcall-log

Comment 3 Patrick Uiterwijk 2018-04-29 23:36:19 UTC
This is reproducible on my system with the initrd and vmlinuz of https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20180426.n.0/compose/Everything/x86_64/os/isolinux/

QEMU command: qemu-system-x86_64 --initrd initrd.img --kernel vmlinuz -m 2048 -append "console=tty0 console=ttyS0,115200n8 debug rd.debug initcall_debug ignore_loglevel" -machine q35,accel=kvm

Comment 4 Patrick Uiterwijk 2018-04-30 00:02:16 UTC
I just found out that if I leave the specified qemu command running, there is a chance that the system actually continues to boot after waiting for more than 90 seconds.
Not every time, but I'd guess about 10% of the time?

Comment 5 Patrick Uiterwijk 2018-04-30 00:04:05 UTC
Created attachment 1428609 [details]
kernel-log-booted

Attached is a log of a QEMU session that (after about 110 seconds) did actually continue to boot.
I cut it short at the dracut loop, since there's no disk mounted that it could ever use as sysroot (which is expected, as I didn't mount one).

Comment 6 Patrick Uiterwijk 2018-04-30 00:20:08 UTC
Laura references https://marc.info/?l=linux-kernel&m=152471887030737&w=2, and if I keep pressing my enter key (or any) on the VGA display of qemu, it does indeed boot within a few seconds.
So this does seem like the same bug.

Comment 7 Kevin Fenzi 2018-04-30 01:48:53 UTC
Marking a F29 Beta blocker per "Bugs which entirely prevent the composition of one or more of the release-blocking images required to be built for a currently-pending (pre-)release"

Comment 8 Adam Williamson 2018-04-30 16:52:58 UTC
That makes it an *automatic* blocker. So setting Accepted. https://fedoraproject.org/wiki/QA:SOP_blocker_bug_process#Automatic_blockers

Comment 9 Adam Williamson 2018-04-30 17:01:03 UTC
For the record, nirik has untagged the latest kernel build so we should get a successful compose. This isn't a long-term solution, though (for a start, every new kernel package build will get tagged again...)

Comment 10 Laura Abbott 2018-04-30 22:21:05 UTC
So I poked at this a little more to see what was going on. init is definitely stuck on wait_for_random_bytes. systemd's getrandom call is non-blocking but it links against gcrypt which seems to be making a blocking call. gcrypt looks to be used for hmac calcuation for the journal (not 100% if that's the cause, still playing with stuff). The upstream kernel discussion seems to be making okay progress so I think we should focus that for now.

Comment 11 Kevin Fenzi 2018-04-30 22:30:39 UTC
I have an oz patch from imcleod that should hopefully add rng passthough to the oz vm's. I've sent a freeze break out to apply it to our image builders. Hopefully this will work around the issue for now.

Comment 12 Matthew Miller 2018-05-01 13:37:15 UTC
See also bug #1572944 for a similar thing on 4.16.4.

Comment 13 Laura Abbott 2018-05-01 18:20:09 UTC
Debugged with Patrick a little more and he found 
https://src.fedoraproject.org/rpms/libgcrypt/blob/master/f/libgcrypt-1.6.2-fips-ctor.patch . The default compose image comes with fips so what's happening is when systemd goes to open the log, it calls libgcrypt initialization which will initialize the random pool and make the blocking call. We have a very good handle on what's happening but no perfect solution at the moment.

Comment 14 Justin M. Forbes 2018-05-02 21:48:28 UTC
Update for rawhide, we have inlined the virtio_rng. Since it isn't a module, it is there early enough. This is sufficient for the compose, but it could still cause issues testing in environments where a virtio_rng isn't available to the guest. Certainly not closing this out yet as there needs to be a better solution.

Comment 15 Paul Whalen 2018-05-09 19:05:29 UTC
Hitting this when attempting to install Rawhide on hardware, reproducible on all arches(tested on armhfp, aarch64 and x86_64). Qemu installs work as expected.

Comment 16 Paul Whalen 2018-05-29 14:15:47 UTC
Hardware installs are working with Fedora-Rawhide-20180528.n.0 which includes 4.17.0-0.rc6.git3.1.fc29

Comment 17 Jan Pokorný [poki] 2018-06-04 12:44:06 UTC
I am not sure I observe the same, but for me, the "hardly ever finishes
booting" issue started with kernel-4.17.0-0.rc7.git2.1.fc29.x86_64
(4.17.0-0.rc7.git1.1.fc29.x86_64 works fine).

As an obstacle in the boot process, I've identified something related to
entropy/CRNG/urandom:

[     3.868366] systemd-udevd (390) used greatest stack depth: 12424 bytes left
[   275.416359] random: crng init done
[   275.417044] random: 7 urandom warning(s) missed due to ratelimiting
[   332.770711] kworker/dying (213) used greatest stack depth: 12232 bytes left
<booting stuck again>

The above is the transcription of the unmodified excerpt from on-screen
messages from boot (I've removed "quiet" from the command-line).

No idea why CRNG initialization took so long (4.5+ minutes) and how
long it would take for system to actually come up (if ever), I simply
rebooted with Ctrl+Alt+Del combo, which the system responded to, and
booted the previous kernel (see above).

It's also possible that LUKS password screen relies on RNG being
already initialized, so the situation may incur a simple deadlock.

Comment 18 Jan Pokorný [poki] 2018-06-07 11:51:50 UTC
re [comment 17]:

Stated issue went away with update to kernel-4.18.0-0.rc0.git2.1.fc29
(no other possibly relevant packages like dracut or systemd got updated;
when rebooted into problematic kernel afterwards, the problem persists).

Comment 19 Adam Williamson 2018-09-08 00:06:28 UTC
Laura, Justin, are you aware of anything that needs resolving for F29 Beta here? This bug is still on the F29 Beta blocker list. Thanks!

Comment 20 Justin M. Forbes 2018-09-09 15:56:57 UTC
This should be closed, it is fixed in userspace as well.

Comment 21 Michael Convey 2018-09-24 04:51:27 UTC
This continues to be a problem. I just updated to kernel 4.18.9-200.fc28.x86_64 and attempted to boot about eight times before my random keystrokes met the entrpy criteria -- very frustrating!!

In short, this issue occurs nearly every time I process a kernel update and this is still an issue with kernel 4.18.9-200.fc28.x86_64.

Please let me know of a workaround. Also, should I open a new bug or has one already been opened?

Comment 22 Michael Convey 2018-09-24 04:52:04 UTC
This continues to be a problem. I just updated to kernel 4.18.9-200.fc28.x86_64 and attempted to boot about eight times before my random keystrokes met the entrpy criteria -- very frustrating!!

In short, this issue occurs nearly every time I process a kernel update and this is still an issue with kernel 4.18.9-200.fc28.x86_64.

Please let me know of a workaround. Also, should I open a new bug or has one already been opened?

Comment 23 Laura Abbott 2018-09-24 06:01:41 UTC
Most of the work is already being tracked at https://bugzilla.redhat.com/show_bug.cgi?id=1572944 , can you put details about your setup there?

Comment 24 Neil Horman 2018-09-24 13:17:54 UTC
are you running rngd?  That should provide you all the entropy you need to the kernel

Comment 25 Tomas Mraz 2018-09-25 08:39:46 UTC
I am afraid that rngd won't help as that is not started yet when the boot process gets stuck.

Comment 26 Michael Convey 2018-09-28 16:54:27 UTC
(In reply to Laura Abbott from comment #23)
> Most of the work is already being tracked at
> https://bugzilla.redhat.com/show_bug.cgi?id=1572944 , can you put details
> about your setup there?

That ticket is closed. Where is a ticket that is actually being worked?

Comment 27 Michael Convey 2018-09-28 16:54:57 UTC
(In reply to Laura Abbott from comment #23)
> Most of the work is already being tracked at
> https://bugzilla.redhat.com/show_bug.cgi?id=1572944 , can you put details
> about your setup there?

That ticket is closed. Where is a ticket that is actually being worked?

Comment 28 Michael Convey 2018-09-29 01:17:03 UTC
I tried to report my details in bug 1593843, but I get an access denied message: "You are not authorized to access bug #1593843."

https://bugzilla.redhat.com/show_bug.cgi?id=1593843

Is there an active bug where I can report this issue with Kernel 4.18.9-200.fc28.x86_64?

Comment 29 Michael Convey 2018-10-16 17:52:55 UTC
Because this bug is closed, I submitted the following: 
https://bugzilla.redhat.com/show_bug.cgi?id=1639840

Comment 30 Pavel Valena 2019-10-09 17:01:40 UTC
*** Bug 1759603 has been marked as a duplicate of this bug. ***