Since systemd 257 rc1 landed in Rawhide (I think), openQA has been showing several failed install tests with every compose. In the failure cases, the system apparently got stuck at "Starting initrd-switch-root.service - Switch Root..." and never reached the installer. https://openqa.fedoraproject.org/tests/3019609#step/_boot_to_anaconda/6 is an example of this. In today's Rawhide compose testing, this bug affected six tests. I was able to reproduce this locally by grabbing the current Rawhide Server DVD - https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20241112.n.0/compose/Server/x86_64/iso/Fedora-Server-dvd-x86_64-Rawhide-20241112.n.0.iso - and booting it in a VM. Sometimes it boots fine, but sometimes it stops at that stage apparently forever (I've left it up to 12 minutes at max). But by accident I found that if I hit any key, it makes the boot proceed almost immediately. There's no prompt shown to press a key anywhere, and it doesn't have any immediate visible consequence, but a few seconds later the boot proceeds and ultimately reaches the installer. In the journal after reaching the installer, this is how things look: [journal starts at Nov 12 21:40:19] ... Nov 12 21:40:27 localhost systemd[1]: Finished initrd-udevadm-cleanup-db.service - Cleanup udev Database. Nov 12 21:40:27 localhost systemd[1]: Reached target initrd-switch-root.target - Switch Root. Nov 12 21:40:27 localhost systemd[1]: Finished plymouth-switch-root.service - Plymouth switch root service. Nov 12 21:40:27 localhost systemd[1]: Starting initrd-switch-root.service - Switch Root... Nov 12 21:40:27 localhost systemd[1]: Switching root. Nov 12 21:40:27 localhost systemd-journald[309]: Journal stopped Nov 12 21:52:31 localhost systemd-journald[309]: Received SIGTERM from PID 1 (systemd). Nov 12 21:52:31 localhost kernel: SELinux: policy capability network_peer_controls=1 Nov 12 21:52:31 localhost kernel: SELinux: policy capability open_perms=1 Nov 12 21:52:31 localhost kernel: SELinux: policy capability extended_socket_class=1 ... [journal proceeds as normal and system boots to the installer] you can see that there's twelve minutes where apparently nothing at all happened. I pressed a key around 21:52:25 or so, I guess.
Proposing as an F42 Beta blocker as a violation of "Release-blocking dedicated installer images must boot to the expected boot menu, and then after a reasonable timeout to the installer" - when the bug happens, that requirement is not met unless the user presses a key in the stuck state, which is not obvious or intended.
If the system were waiting for a key press, I'd have expected this to happen every time. One idea: it happens in initrd-switch-root.service, which re-execs PID 1 (and re-initializes lot of stuff), so maybe the hang is caused by low entropy? And pressing a key or two creates just enough of it to fill the pool, so open("/dev/urandom") can finish? (Which would still require an explanation why systemd is falling back to /dev/urandom in the first place, as Fedora's kernel is certainly new enough to support getrandom(GRND_INSECURE). So I'm probably just babbling, but I've got no other ideas ATM...)
*** Bug 2326127 has been marked as a duplicate of this bug. ***
Hmmm. "Real random bytes" are used to encrypt credentials. It's hard to figure out if/when exactly we hit those code paths. But we're certaily doing more credentials in each release, so maybe this is related. Looking at the code, we'll call getrandom() with flags==0 for "real random bytes", and that may block if the random seed pool is not yet initialized, according to getrandom(2). So we don't need to hit the fallback with the device file to block.
Maybe you could do a build that logs when we hit that code path, then we can reproduce the bug and see if those log messages show up at the relevant times?
I can reproduce the issue. 1. wget https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20241112.n.0/compose/Server/x86_64/iso/Fedora-Server-dvd-x86_64-Rawhide-20241112.n.0.iso 2. cp /usr/share/edk2/ovmf/OVMF_VARS.secboot.fd . 3. qemu-kvm -smp 2 -m 3072 -cdrom ./Fedora-Server-dvd-x86_64-Rawhide-20241112.n.0.iso -global 'driver=cfi.pflash01,property=secure,value=on' -drive 'if=pflash,format=raw,unit=0,file=/usr/share/edk2/ovmf/OVMF_CODE.secboot.fd,readonly=on' -drive 'if=pflash,format=raw,unit=1,file=./OVMF_VARS.secboot.fd' -machine q35,smm=on https://paste.centos.org/view/3809c97d are logs from a debug boot. There is a bunch of AVCs: avc: denied { write } for pid=1 comm="systemd" name="core_pattern" ... permissive=1 { read } name="net" { relabelto } name="vga_arbiter" ... With permissive=1 they are probably not an immediate problem, but it's something to fix anyway. I see 'kernel: crng init done' *after* the delay. So it looks like this happened late. But looking at a log where the boot did not hang, 'kernel: crng init done' is also logged after the 'Switching root' message. So if this is a race condition, then it seems we're pretty always pretty close to hitting it. But after looking at all the places which call crypto_random_bytes(), I don't see why we'd call it from pid1. It is called in various places to decrypt credentials, but that happens only if there _is_ an encrypted credential provided. AFAIK, we don't have those in this case. It is also called when encrypting credentials, but that is done by systemd-creds and systemd-pcrlock, not from pid1. After looking at the logs for a while, I filed https://github.com/systemd/systemd/pull/35237, https://github.com/systemd/systemd/issues/35236 for various details about logging in systemd and #2327296 for a crash I'm seeing.
Ping? This is quite a bad bug, people who try to install Fedora are going to think it's broken. And it messes up the openQA tests. I am going to institute a workaround in openQA so the day-to-day results will be consistent, but I don't like doing that as it means we may forget about this bug. I'm worried that none of the upstream issues Zbigniew filed seems to actually be *this* issue. Do I need to file an upstream report for this issue?
I see that Systemd 257 released with such an annoying bug!
I think upstream beyond Zbigniew wasn't aware of it. I was kinda expecting him to forward it but should've just gone ahead and done it :/ it's being worked on now at https://github.com/systemd/systemd/issues/35499 .
FEDORA-2024-6619b6dff3 (systemd-257.1-1.fc42) has been submitted as an update to Fedora 42. https://bodhi.fedoraproject.org/updates/FEDORA-2024-6619b6dff3
That build has the offending patch reverted, so the problem should be resolved for us for now. (Sorry for my inaction here. I started looking into this, but then there were other pressing issues and then I went for holidays for a week and forgot to follow up. Thank you Adam for the debugging and bisecting.)
FEDORA-2024-6619b6dff3 (systemd-257.1-1.fc42) has been pushed to the Fedora 42 stable repository. If problem still persists, please make note of it in this bug report.