2325780 – Installer boot sometimes gets stuck at "Starting initrd-switch-root.service - Switch Root..." until any key is pressed

Bug 2325780 - Installer boot sometimes gets stuck at "Starting initrd-switch-root.service - Switch Root..." until any key is pressed

Summary: Installer boot sometimes gets stuck at "Starting initrd-switch-root.service -...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	systemd
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	systemd-maint
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	openqa
Duplicates (1):	2326127 (view as bug list)
Depends On:
Blocks:	F42BetaBlocker
TreeView+	depends on / blocked

Reported:	2024-11-12 22:01 UTC by Adam Williamson
Modified:	2024-12-20 14:58 UTC (History)
CC List:	12 users (show)
Fixed In Version:	systemd-257.1-1.fc42
Clone Of:
Environment:
Last Closed:	2024-12-20 14:58:50 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	systemd systemd issues 35499#issuecomment-2543773669	0	None	open	System hangs at boot until key press	2024-12-15 16:11:19 UTC
Red Hat Bugzilla	2327296	0	unspecified	NEW	rpcbind is terminated with ABRT in the initrd	2025-02-26 13:17:08 UTC

Description Adam Williamson 2024-11-12 22:01:50 UTC

Since systemd 257 rc1 landed in Rawhide (I think), openQA has been showing several failed install tests with every compose. In the failure cases, the system apparently got stuck at "Starting initrd-switch-root.service - Switch Root..." and never reached the installer. https://openqa.fedoraproject.org/tests/3019609#step/_boot_to_anaconda/6 is an example of this. In today's Rawhide compose testing, this bug affected six tests.

I was able to reproduce this locally by grabbing the current Rawhide Server DVD - https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20241112.n.0/compose/Server/x86_64/iso/Fedora-Server-dvd-x86_64-Rawhide-20241112.n.0.iso - and booting it in a VM. Sometimes it boots fine, but sometimes it stops at that stage apparently forever (I've left it up to 12 minutes at max).

But by accident I found that if I hit any key, it makes the boot proceed almost immediately. There's no prompt shown to press a key anywhere, and it doesn't have any immediate visible consequence, but a few seconds later the boot proceeds and ultimately reaches the installer.

In the journal after reaching the installer, this is how things look:

[journal starts at Nov 12 21:40:19]
...
Nov 12 21:40:27 localhost systemd[1]: Finished initrd-udevadm-cleanup-db.service - Cleanup udev Database.
Nov 12 21:40:27 localhost systemd[1]: Reached target initrd-switch-root.target - Switch Root.
Nov 12 21:40:27 localhost systemd[1]: Finished plymouth-switch-root.service - Plymouth switch root service.
Nov 12 21:40:27 localhost systemd[1]: Starting initrd-switch-root.service - Switch Root...
Nov 12 21:40:27 localhost systemd[1]: Switching root.
Nov 12 21:40:27 localhost systemd-journald[309]: Journal stopped
Nov 12 21:52:31 localhost systemd-journald[309]: Received SIGTERM from PID 1 (systemd).
Nov 12 21:52:31 localhost kernel: SELinux: policy capability network_peer_controls=1
Nov 12 21:52:31 localhost kernel: SELinux: policy capability open_perms=1
Nov 12 21:52:31 localhost kernel: SELinux: policy capability extended_socket_class=1
...
[journal proceeds as normal and system boots to the installer]

you can see that there's twelve minutes where apparently nothing at all happened. I pressed a key around 21:52:25 or so, I guess.

Comment 1 Adam Williamson 2024-11-12 22:03:19 UTC

Proposing as an F42 Beta blocker as a violation of "Release-blocking dedicated installer images must boot to the expected boot menu, and then after a reasonable timeout to the installer" - when the bug happens, that requirement is not met unless the user presses a key in the stuck state, which is not obvious or intended.

Comment 2 David Tardon 2024-11-13 15:08:10 UTC

If the system were waiting for a key press, I'd have expected this to happen every time. One idea: it happens in initrd-switch-root.service, which re-execs PID 1 (and re-initializes lot of stuff), so maybe the hang is caused by low entropy? And pressing a key or two creates just enough of it to fill the pool, so open("/dev/urandom") can finish? (Which would still require an explanation why systemd is falling back to /dev/urandom in the first place, as Fedora's kernel is certainly new enough to support getrandom(GRND_INSECURE). So I'm probably just babbling, but I've got no other ideas ATM...)

Comment 3 Zbigniew Jędrzejewski-Szmek 2024-11-19 09:33:50 UTC

*** Bug 2326127 has been marked as a duplicate of this bug. ***

Comment 4 Zbigniew Jędrzejewski-Szmek 2024-11-19 10:57:07 UTC

Hmmm. "Real random bytes" are used to encrypt credentials. It's hard to figure out if/when exactly we hit those code paths. But we're certaily doing more credentials in each release, so maybe this is related.

Looking at the code, we'll call getrandom() with flags==0 for "real random bytes", and that may block if the random seed pool is not yet initialized, according to getrandom(2). So we don't need to hit the fallback with the device file to block.

Comment 5 Adam Williamson 2024-11-19 15:43:20 UTC

Maybe you could do a build that logs when we hit that code path, then we can reproduce the bug and see if those log messages show up at the relevant times?

Comment 6 Zbigniew Jędrzejewski-Szmek 2024-11-19 17:16:58 UTC

I can reproduce the issue.
1. wget https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20241112.n.0/compose/Server/x86_64/iso/Fedora-Server-dvd-x86_64-Rawhide-20241112.n.0.iso
2. cp /usr/share/edk2/ovmf/OVMF_VARS.secboot.fd .
3. qemu-kvm -smp 2 -m 3072 -cdrom ./Fedora-Server-dvd-x86_64-Rawhide-20241112.n.0.iso -global 'driver=cfi.pflash01,property=secure,value=on' -drive 'if=pflash,format=raw,unit=0,file=/usr/share/edk2/ovmf/OVMF_CODE.secboot.fd,readonly=on' -drive 'if=pflash,format=raw,unit=1,file=./OVMF_VARS.secboot.fd' -machine q35,smm=on

https://paste.centos.org/view/3809c97d are logs from a debug boot.

There is a bunch of AVCs:
avc: denied { write } for pid=1 comm="systemd" name="core_pattern" ... permissive=1
            { read }                           name="net"
            { relabelto }                      name="vga_arbiter"
            ...
With permissive=1 they are probably not an immediate problem, but it's something to fix anyway.

I see 'kernel: crng init done' *after* the delay. So it looks like this happened late.
But looking at a log where the boot did not hang, 'kernel: crng init done' is also
logged after the 'Switching root' message. So if this is a race condition, then it seems
we're pretty always pretty close to hitting it.

But after looking at all the places which call crypto_random_bytes(), I don't see why
we'd call it from pid1. It is called in various places to decrypt credentials, but
that happens only if there _is_ an encrypted credential provided. AFAIK, we don't
have those in this case. It is also called when encrypting credentials, but that is
done by systemd-creds and systemd-pcrlock, not from pid1.

After looking at the logs for a while, I filed https://github.com/systemd/systemd/pull/35237,
https://github.com/systemd/systemd/issues/35236 for various details about logging in systemd
and #2327296 for a crash I'm seeing.

Comment 7 Adam Williamson 2024-12-02 16:49:20 UTC

Ping? This is quite a bad bug, people who try to install Fedora are going to think it's broken. And it messes up the openQA tests. I am going to institute a workaround in openQA so the day-to-day results will be consistent, but I don't like doing that as it means we may forget about this bug. I'm worried that none of the upstream issues Zbigniew filed seems to actually be *this* issue. Do I need to file an upstream report for this issue?

Comment 8 Mikhail 2024-12-15 10:57:34 UTC

I see that Systemd 257 released with such an annoying bug!

Comment 9 Adam Williamson 2024-12-15 16:11:09 UTC

I think upstream beyond Zbigniew wasn't aware of it. I was kinda expecting him to forward it but should've just gone ahead and done it :/ it's being worked on now at https://github.com/systemd/systemd/issues/35499 .

Comment 10 Fedora Update System 2024-12-20 12:26:06 UTC

FEDORA-2024-6619b6dff3 (systemd-257.1-1.fc42) has been submitted as an update to Fedora 42.
https://bodhi.fedoraproject.org/updates/FEDORA-2024-6619b6dff3

Comment 11 Zbigniew Jędrzejewski-Szmek 2024-12-20 12:46:01 UTC

That build has the offending patch reverted, so the problem should be resolved for us for now.

(Sorry for my inaction here. I started looking into this, but then there were other pressing issues and then I went for holidays for a week and forgot to follow up. Thank you Adam for the debugging and bisecting.)

Comment 12 Fedora Update System 2024-12-20 14:58:50 UTC

FEDORA-2024-6619b6dff3 (systemd-257.1-1.fc42) has been pushed to the Fedora 42 stable repository.
If problem still persists, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.