Bug 2439826
| Summary: | Boot sometimes hangs (apparently) at Starting initrd-switch-root.service | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Adam Williamson <awilliam> | ||||||
| Component: | systemd | Assignee: | systemd-maint | ||||||
| Status: | NEW --- | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 44 | CC: | daan.j.demeyer, fedoraproject, gary.buhrmaster, hans, kparal, lnykryn, moschlegbz, msekleta, pbrobinson, psklenar, suraj.ghimire7, systemd-maint, tk2345_, yuwatana, zbyszek | ||||||
| Target Milestone: | --- | Keywords: | Regression | ||||||
| Target Release: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | openqa AcceptedFreezeException | ||||||||
| Fixed In Version: | Doc Type: | --- | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | Type: | Bug | |||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 2362358 | ||||||||
| Attachments: |
|
||||||||
|
Description
Adam Williamson
2026-02-14 08:04:08 UTC
Created attachment 2129449 [details]
full journal from a bad boot (where it got stuck)
Created attachment 2129450 [details]
full journal from a good boot
Beta freeze is coming up soon, so nominating for a Beta FE - if we figure this out and get a fix I think it should go through the freeze. +4 in https://pagure.io/fedora-qa/blocker-review/issue/2035 , marking accepted. I've been re-running my test cannon all day with the test tweaked to set kernel param systemd.log_level=debug before booting; unfortunately, I haven't yet hit the bug once with that tweak. Either I'm just unlucky, or this is some kinda timing issue and enabling debugging happens to change the timing so the bug no longer happens, or happens very rarely... grep -n plymouth-read-write goodboot.txt badboot.txt goodboot.txt:911:Feb 13 23:23:27 fedora systemd[1]: Starting plymouth-read-write.service - Tell Plymouth To Write Out Runtime Data... goodboot.txt:920:Feb 13 23:23:27 fedora audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=plymouth-read-write comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' goodboot.txt:921:Feb 13 23:23:27 fedora systemd[1]: Finished plymouth-read-write.service - Tell Plymouth To Write Out Runtime Data. badboot.txt:896:Feb 13 23:08:20 fedora systemd[1]: Starting plymouth-read-write.service - Tell Plymouth To Write Out Runtime Data... Hmmmm. Hans, any chance maybe https://src.fedoraproject.org/rpms/plymouth/c/a14276e5eeae89ed28fb95022133b59eb4995c6b?branch=rawhide caused this, or something? (In reply to Adam Williamson from comment #7) > Hmmmm. Hans, any chance maybe > https://src.fedoraproject.org/rpms/plymouth/c/ > a14276e5eeae89ed28fb95022133b59eb4995c6b?branch=rawhide caused this, or > something? I think that is highly unlikely. It would be interesting to try and to reproduce the hang and see if /var/log/boot.log is created (by preserving the root diskimage and looking at it later), that is the main file which get written out on plymouth-read-write.service. plymouth-read-write.service calls "/usr/bin/plymouth update-root-fs --read-write" which makes a plymouthd IPC call, which end up in src/main.c in: static void on_system_initialized (state_t *state) { ply_trace ("system now initialized, opening log"); state->system_initialized = true; #ifdef PLY_ENABLE_SYSTEMD_INTEGRATION if (state->is_attached) tell_systemd_to_print_details (state); #endif prepare_logging (state); } since this also interacts with the console logging a bit, I wonder if these failures where seen with the new kmscon or not? Thanks. There's also a chance this is still https://github.com/systemd/systemd/issues/35499 - I'm planning to see if the change to terminal size discovery in 260 affects this, but hadn't got round to it yet. (In reply to Adam Williamson from comment #0) > I have seen this one time on a local VM while testing something else, but > it's not easy to reproduce organically Anecdotal confirmation (and, as anecdotal, have zero real value). I also saw this once on a VM (before branch), and did not any spend time doing any analysis (I expect rawhide to be "special" (where my definition of "special" includes various brokenness)) and just rebooted, but this suggests (as conjectured) some race/timing issue (which as the most annoying of bugs to diagnose). So it's not totally conclusive, but I looked at Rawhide and Branched results today, and saw several instances of this on Branched (which still has systemd 259) but no instances on Rawhide (systemd 260). So it does look like the change in 260 might have helped here. I'll try and check in a few more times, if the pattern persists it might be worth backporting that change. I've seen this exact bug, simply pressing the Enter key gets me past the hang. Update: this is still happening. It's much more common on F44 (systemd 259) than Rawhide (systemd 260), but I definitely *did* see it happen on Rawhide at least once. Upstream have tried various changes to fix it but in openQA scattergun testing, none of those changes reduced the frequency of the problem when backported to 259, so far. See https://github.com/systemd/systemd/pull/41439 for some context. I have seen it occasionally on real devices as well, sometimes hitting enter/esc will actually make it move forward again. Not sure if it's related but often see repetitive systemd boot messages repeated during boot Also not sure if having a serial console configured causes issues here, I have a serial console on a lot of my arm devices for easy debug. |