Bug 2439826

Summary:

Boot sometimes hangs (apparently) at Starting initrd-switch-root.service

Product:

[Fedora] Fedora

Reporter:

Adam Williamson <awilliam>

Component:

systemd

Assignee:

systemd-maint

Status:

NEW ---

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

CC:

daan.j.demeyer, fedoraproject, gary.buhrmaster, hans, kparal, lnykryn, moschlegbz, msekleta, pbrobinson, psklenar, suraj.ghimire7, systemd-maint, tk2345_, yuwatana, zbyszek

Target Milestone:

---

Keywords:

Regression

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

openqa AcceptedFreezeException

Fixed In Version:

Doc Type:

---

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

2362358

Attachments:

Description	Flags
full journal from a bad boot (where it got stuck)	none
full journal from a good boot	none

Description Adam Williamson 2026-02-14 08:04:08 UTC

I've noticed in openQA tests lately, boot sometimes just gets stuck. The console shows "Starting initrd-switch-root.service", but logs indicate it got a bit further. This most commonly happens on aarch64, but it does sometimes also happen on x86_64.

https://openqa.stg.fedoraproject.org/tests/5920121 is an example, see https://openqa.stg.fedoraproject.org/tests/5920121#step/_graphical_wait_login/4 . In that case I hacked the test to notice when it happened and reboot so it could upload logs. The journal from the failed boot just sort of...ends, with no obvious errors:

...
Feb 13 23:08:21 fedora systemd[1]: Starting systemd-vconsole-setup.service - Virtual Console Setup...
Feb 13 23:08:21 fedora systemd[1]: systemd-ask-password-console.path - Dispatch Password Requests to Console Directory Watch skipped, unmet condition check ConditionPat>
Feb 13 23:08:21 fedora systemd[1]: fips-crypto-policy-overlay.service - Bind-mount FIPS crypto-policy in FIPS mode skipped, unmet condition check ConditionKernelCommand>
Feb 13 23:08:21 fedora systemd[1]: iscsi-onboot.service - Special handling of early boot iSCSI sessions skipped, unmet condition check ConditionDirectoryNotEmpty=/sys/c>
Feb 13 23:08:21 fedora systemd[1]: iscsi-starter.service skipped, unmet condition check ConditionDirectoryNotEmpty=/var/lib/iscsi/nodes
Feb 13 23:08:21 fedora systemd[1]: modprobe - Load Kernel Module efi_pstore skipped, unmet condition check ConditionKernelModuleLoaded=!efi_pstore
Feb 13 23:08:21 fedora systemd[1]: selinux-autorelabel-mark.service - Mark the need to relabel after reboot skipped, unmet condition check ConditionSecurity=!selinux
Feb 13 23:08:21 fedora systemd[1]: systemd-binfmt.service - Set Up Additional Binary Formats skipped, no trigger condition checks were met.
Feb 13 23:08:21 fedora systemd[1]: systemd-boot-clear-sysfail.service - Clear SysFail Entry If The Boot Is Successful skipped, unmet condition check ConditionPathExists>
Feb 13 23:08:21 fedora systemd[1]: systemd-boot-random-seed.service - Update Boot Loader Random Seed skipped, no trigger condition checks were met.
Feb 13 23:08:21 fedora systemd[1]: systemd-confext.service - Merge System Configuration Images into /etc/ skipped, no trigger condition checks were met.
Feb 13 23:08:21 fedora systemd[1]: ldconfig.service - Rebuild Dynamic Linker Cache skipped, no trigger condition checks were met.
Feb 13 23:08:21 fedora systemd[1]: systemd-hibernate-clear.service - Clear Stale Hibernate Storage Info skipped, unmet condition check ConditionPathExists=/sys/firmware>
Feb 13 23:08:21 fedora systemd[1]: systemd-hwdb-update.service - Rebuild Hardware Database skipped, unmet condition check ConditionNeedsUpdate=/etc
Feb 13 23:08:21 fedora systemd[1]: systemd-journal-catalog-update.service - Rebuild Journal Catalog skipped, unmet condition check ConditionNeedsUpdate=/var
Feb 13 23:08:21 fedora systemd[1]: systemd-pcrmachine.service - TPM PCR Machine ID Measurement skipped, unmet condition check ConditionSecurity=measured-uki
Feb 13 23:08:21 fedora systemd[1]: systemd-pcrproduct.service - TPM NvPCR Product ID Measurement skipped, unmet condition check ConditionSecurity=measured-uki
Feb 13 23:08:21 fedora systemd[1]: systemd-pstore.service - Platform Persistent Storage Archival skipped, unmet condition check ConditionDirectoryNotEmpty=/sys/fs/pstore
Feb 13 23:08:21 fedora systemd[1]: systemd-sysext.service - Merge System Extension Images into /usr/ and /opt/ skipped, no trigger condition checks were met.
Feb 13 23:08:21 fedora systemd[1]: systemd-sysusers.service - Create System Users skipped, no trigger condition checks were met.
Feb 13 23:08:21 fedora systemd[1]: systemd-tpm2-setup-early.service - Early TPM SRK Setup skipped, unmet condition check ConditionSecurity=measured-uki
Feb 13 23:08:21 fedora systemd[1]: systemd-repart.service - Repartition Root Disk skipped, no trigger condition checks were met.
Feb 13 23:08:21 fedora systemd[1]: systemd-tpm2-setup.service - TPM SRK Setup skipped, unmet condition check ConditionSecurity=measured-uki
Feb 13 23:08:21 fedora systemd[1]: systemd-pcrnvdone.service - TPM PCR NvPCR Initialization Separator skipped, unmet condition check ConditionSecurity=measured-uki
Feb 13 23:08:21 fedora systemd[1]: systemd-update-done.service - Update is Completed skipped, no trigger condition checks were met.
Feb 13 23:08:21 fedora systemd[1]: Finished systemd-vconsole-setup.service - Virtual Console Setup.
Feb 13 23:08:21 fedora audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-vconsole-setup comm="system>
Feb 13 23:08:21 fedora systemd[1]: systemd-firstboot.service - Initial Setup skipped, unmet condition check ConditionFirstBoot=yes
Feb 13 23:08:21 fedora systemd[1]: first-boot-complete.target - First Boot Complete skipped, unmet condition check ConditionFirstBoot=yes
Feb 13 23:08:21 fedora systemd[1]: systemd-machine-id-commit.service - Save Transient machine-id to Disk skipped, unmet condition check ConditionPathIsMountPoint=/etc/m>
Feb 13 23:08:21 fedora systemd[1]: Mounting var-lib-nfs-rpc_pipefs.mount - RPC Pipe File System...
Feb 13 23:08:21 fedora kernel: RPC: Registered named UNIX socket transport module.
Feb 13 23:08:21 fedora kernel: RPC: Registered udp transport module.
Feb 13 23:08:21 fedora kernel: RPC: Registered tcp transport module.
Feb 13 23:08:21 fedora kernel: RPC: Registered tcp-with-tls transport module.
Feb 13 23:08:21 fedora kernel: RPC: Registered tcp NFSv4.1 backchannel transport module.
Feb 13 23:08:21 fedora systemd[1]: Mounted var-lib-nfs-rpc_pipefs.mount - RPC Pipe File System.
Feb 13 23:08:21 fedora systemd[1]: Reached target rpc_pipefs.target.

there's nothing after that. openQA sat and waited for 15 minutes before rebooting, so clearly it was stuck.

I have seen this one time on a local VM while testing something else, but it's not easy to reproduce organically (I tried 10-15 boots of an installed system earlier and didn't hit it). openQA runs a lot of tests, so it's expected that it runs into rare failures fairly often.

It's hard to tell when this started exactly, since there's been a ton of other breakage lately which sort of pollutes the history. I can see it at least as far back as Jan 2 - https://openqa.fedoraproject.org/tests/4112855#step/_console_wait_login/12 . It's only easy to access back to Dec 30 before that, so it may have been around longer; it could have been introduced by systemd-259-1.fc44 on Dec 18, possibly, I'll try and look into it.

I'm attaching the full journal from an affected boot, and the journal from the subsequent successful boot to compare against. I can try and get a journal with debugging enabled tomorrow or next week.

Comment 1 Adam Williamson 2026-02-14 08:04:45 UTC

Created attachment 2129449 [details]
full journal from a bad boot (where it got stuck)

Comment 2 Adam Williamson 2026-02-14 08:05:08 UTC

Created attachment 2129450 [details]
full journal from a good boot

Comment 3 Adam Williamson 2026-02-14 08:06:18 UTC

Beta freeze is coming up soon, so nominating for a Beta FE - if we figure this out and get a fix I think it should go through the freeze.

Comment 4 Adam Williamson 2026-02-16 17:01:09 UTC

+4 in https://pagure.io/fedora-qa/blocker-review/issue/2035 , marking accepted.

Comment 5 Adam Williamson 2026-02-18 01:23:03 UTC

I've been re-running my test cannon all day with the test tweaked to set kernel param systemd.log_level=debug before booting; unfortunately, I haven't yet hit the bug once with that tweak. Either I'm just unlucky, or this is some kinda timing issue and enabling debugging happens to change the timing so the bug no longer happens, or happens very rarely...

Comment 6 tk2345_ 2026-02-18 08:38:48 UTC

grep -n plymouth-read-write goodboot.txt badboot.txt 

goodboot.txt:911:Feb 13 23:23:27 fedora systemd[1]: Starting plymouth-read-write.service - Tell Plymouth To Write Out Runtime Data...
goodboot.txt:920:Feb 13 23:23:27 fedora audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=plymouth-read-write comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
goodboot.txt:921:Feb 13 23:23:27 fedora systemd[1]: Finished plymouth-read-write.service - Tell Plymouth To Write Out Runtime Data.

badboot.txt:896:Feb 13 23:08:20 fedora systemd[1]: Starting plymouth-read-write.service - Tell Plymouth To Write Out Runtime Data...

Comment 7 Adam Williamson 2026-02-24 02:10:10 UTC

Hmmmm. Hans, any chance maybe https://src.fedoraproject.org/rpms/plymouth/c/a14276e5eeae89ed28fb95022133b59eb4995c6b?branch=rawhide caused this, or something?

Comment 8 Hans de Goede 2026-02-28 11:57:54 UTC

(In reply to Adam Williamson from comment #7)
> Hmmmm. Hans, any chance maybe
> https://src.fedoraproject.org/rpms/plymouth/c/
> a14276e5eeae89ed28fb95022133b59eb4995c6b?branch=rawhide caused this, or
> something?

I think that is highly unlikely.

It would be interesting to try and to reproduce the hang and see if /var/log/boot.log is created (by preserving the root diskimage and looking at it later), that is the main file which get written out on plymouth-read-write.service.

plymouth-read-write.service calls "/usr/bin/plymouth update-root-fs --read-write" which makes a plymouthd IPC call, which end up in src/main.c in:

static void
on_system_initialized (state_t *state)
{
        ply_trace ("system now initialized, opening log");
        state->system_initialized = true;

#ifdef PLY_ENABLE_SYSTEMD_INTEGRATION
        if (state->is_attached)
                tell_systemd_to_print_details (state);
#endif

        prepare_logging (state);
}

since this also interacts with the console logging a bit, I wonder if these failures where seen with the new kmscon or not?

Comment 9 Adam Williamson 2026-02-28 17:48:08 UTC

Thanks. There's also a chance this is still https://github.com/systemd/systemd/issues/35499 - I'm planning to see if the change to terminal size discovery in 260 affects this, but hadn't got round to it yet.

Comment 10 Gary Buhrmaster 2026-03-01 04:25:14 UTC

(In reply to Adam Williamson from comment #0)

> I have seen this one time on a local VM while testing something else, but
> it's not easy to reproduce organically

Anecdotal confirmation (and, as anecdotal, have zero real value).

I also saw this once on a VM (before branch), and did not any spend time doing any analysis (I expect rawhide to be "special" (where my definition of "special" includes various brokenness)) and just rebooted, but this suggests (as conjectured) some race/timing issue (which as the most annoying of bugs to diagnose).

Comment 11 Adam Williamson 2026-03-12 19:02:54 UTC

So it's not totally conclusive, but I looked at Rawhide and Branched results today, and saw several instances of this on Branched (which still has systemd 259) but no instances on Rawhide (systemd 260). So it does look like the change in 260 might have helped here. I'll try and check in a few more times, if the pattern persists it might be worth backporting that change.

Comment 12 M. Schlegel 2026-03-18 20:31:17 UTC

I've seen this exact bug, simply pressing the Enter key gets me past the hang.

Comment 13 Adam Williamson 2026-04-08 19:48:25 UTC

Update: this is still happening. It's much more common on F44 (systemd 259) than Rawhide (systemd 260), but I definitely *did* see it happen on Rawhide at least once. Upstream have tried various changes to fix it but in openQA scattergun testing, none of those changes reduced the frequency of the problem when backported to 259, so far. See https://github.com/systemd/systemd/pull/41439 for some context.

Comment 14 Peter Robinson 2026-04-09 15:11:49 UTC

I have seen it occasionally on real devices as well, sometimes hitting enter/esc will actually make it move forward again. Not sure if it's related but often see repetitive systemd boot messages repeated during boot

Comment 15 Peter Robinson 2026-04-09 21:40:08 UTC

Also not sure if having a serial console configured causes issues here, I have a serial console on a lot of my arm devices for easy debug.