Bug 2325780
| Summary: | Installer boot sometimes gets stuck at "Starting initrd-switch-root.service - Switch Root..." until any key is pressed | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Adam Williamson <awilliam> |
| Component: | systemd | Assignee: | systemd-maint |
| Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | rawhide | CC: | daan.j.demeyer, dtardon, fedoraproject, lnykryn, mikhail.v.gavrilov, msekleta, robatino, ryncsn, suraj.ghimire7, systemd-maint, yuwatana, zbyszek |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | openqa | ||
| Fixed In Version: | systemd-257.1-1.fc42 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2024-12-20 14:58:50 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2291263 | ||
|
Description
Adam Williamson
2024-11-12 22:01:50 UTC
Proposing as an F42 Beta blocker as a violation of "Release-blocking dedicated installer images must boot to the expected boot menu, and then after a reasonable timeout to the installer" - when the bug happens, that requirement is not met unless the user presses a key in the stuck state, which is not obvious or intended. If the system were waiting for a key press, I'd have expected this to happen every time. One idea: it happens in initrd-switch-root.service, which re-execs PID 1 (and re-initializes lot of stuff), so maybe the hang is caused by low entropy? And pressing a key or two creates just enough of it to fill the pool, so open("/dev/urandom") can finish? (Which would still require an explanation why systemd is falling back to /dev/urandom in the first place, as Fedora's kernel is certainly new enough to support getrandom(GRND_INSECURE). So I'm probably just babbling, but I've got no other ideas ATM...)
*** Bug 2326127 has been marked as a duplicate of this bug. *** Hmmm. "Real random bytes" are used to encrypt credentials. It's hard to figure out if/when exactly we hit those code paths. But we're certaily doing more credentials in each release, so maybe this is related. Looking at the code, we'll call getrandom() with flags==0 for "real random bytes", and that may block if the random seed pool is not yet initialized, according to getrandom(2). So we don't need to hit the fallback with the device file to block. Maybe you could do a build that logs when we hit that code path, then we can reproduce the bug and see if those log messages show up at the relevant times? I can reproduce the issue. 1. wget https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20241112.n.0/compose/Server/x86_64/iso/Fedora-Server-dvd-x86_64-Rawhide-20241112.n.0.iso 2. cp /usr/share/edk2/ovmf/OVMF_VARS.secboot.fd . 3. qemu-kvm -smp 2 -m 3072 -cdrom ./Fedora-Server-dvd-x86_64-Rawhide-20241112.n.0.iso -global 'driver=cfi.pflash01,property=secure,value=on' -drive 'if=pflash,format=raw,unit=0,file=/usr/share/edk2/ovmf/OVMF_CODE.secboot.fd,readonly=on' -drive 'if=pflash,format=raw,unit=1,file=./OVMF_VARS.secboot.fd' -machine q35,smm=on https://paste.centos.org/view/3809c97d are logs from a debug boot. There is a bunch of AVCs: avc: denied { write } for pid=1 comm="systemd" name="core_pattern" ... permissive=1 { read } name="net" { relabelto } name="vga_arbiter" ... With permissive=1 they are probably not an immediate problem, but it's something to fix anyway. I see 'kernel: crng init done' *after* the delay. So it looks like this happened late. But looking at a log where the boot did not hang, 'kernel: crng init done' is also logged after the 'Switching root' message. So if this is a race condition, then it seems we're pretty always pretty close to hitting it. But after looking at all the places which call crypto_random_bytes(), I don't see why we'd call it from pid1. It is called in various places to decrypt credentials, but that happens only if there _is_ an encrypted credential provided. AFAIK, we don't have those in this case. It is also called when encrypting credentials, but that is done by systemd-creds and systemd-pcrlock, not from pid1. After looking at the logs for a while, I filed https://github.com/systemd/systemd/pull/35237, https://github.com/systemd/systemd/issues/35236 for various details about logging in systemd and #2327296 for a crash I'm seeing. Ping? This is quite a bad bug, people who try to install Fedora are going to think it's broken. And it messes up the openQA tests. I am going to institute a workaround in openQA so the day-to-day results will be consistent, but I don't like doing that as it means we may forget about this bug. I'm worried that none of the upstream issues Zbigniew filed seems to actually be *this* issue. Do I need to file an upstream report for this issue? I see that Systemd 257 released with such an annoying bug! I think upstream beyond Zbigniew wasn't aware of it. I was kinda expecting him to forward it but should've just gone ahead and done it :/ it's being worked on now at https://github.com/systemd/systemd/issues/35499 . FEDORA-2024-6619b6dff3 (systemd-257.1-1.fc42) has been submitted as an update to Fedora 42. https://bodhi.fedoraproject.org/updates/FEDORA-2024-6619b6dff3 That build has the offending patch reverted, so the problem should be resolved for us for now. (Sorry for my inaction here. I started looking into this, but then there were other pressing issues and then I went for holidays for a week and forgot to follow up. Thank you Adam for the debugging and bisecting.) FEDORA-2024-6619b6dff3 (systemd-257.1-1.fc42) has been pushed to the Fedora 42 stable repository. If problem still persists, please make note of it in this bug report. |