Fedora Account System
Red Hat Associate
Red Hat Customer
I've noticed in openQA tests lately, boot sometimes just gets stuck. The console shows "Starting initrd-switch-root.service", but logs indicate it got a bit further. This most commonly happens on aarch64, but it does sometimes also happen on x86_64. https://openqa.stg.fedoraproject.org/tests/5920121 is an example, see https://openqa.stg.fedoraproject.org/tests/5920121#step/_graphical_wait_login/4 . In that case I hacked the test to notice when it happened and reboot so it could upload logs. The journal from the failed boot just sort of...ends, with no obvious errors: ... Feb 13 23:08:21 fedora systemd[1]: Starting systemd-vconsole-setup.service - Virtual Console Setup... Feb 13 23:08:21 fedora systemd[1]: systemd-ask-password-console.path - Dispatch Password Requests to Console Directory Watch skipped, unmet condition check ConditionPat> Feb 13 23:08:21 fedora systemd[1]: fips-crypto-policy-overlay.service - Bind-mount FIPS crypto-policy in FIPS mode skipped, unmet condition check ConditionKernelCommand> Feb 13 23:08:21 fedora systemd[1]: iscsi-onboot.service - Special handling of early boot iSCSI sessions skipped, unmet condition check ConditionDirectoryNotEmpty=/sys/c> Feb 13 23:08:21 fedora systemd[1]: iscsi-starter.service skipped, unmet condition check ConditionDirectoryNotEmpty=/var/lib/iscsi/nodes Feb 13 23:08:21 fedora systemd[1]: modprobe - Load Kernel Module efi_pstore skipped, unmet condition check ConditionKernelModuleLoaded=!efi_pstore Feb 13 23:08:21 fedora systemd[1]: selinux-autorelabel-mark.service - Mark the need to relabel after reboot skipped, unmet condition check ConditionSecurity=!selinux Feb 13 23:08:21 fedora systemd[1]: systemd-binfmt.service - Set Up Additional Binary Formats skipped, no trigger condition checks were met. Feb 13 23:08:21 fedora systemd[1]: systemd-boot-clear-sysfail.service - Clear SysFail Entry If The Boot Is Successful skipped, unmet condition check ConditionPathExists> Feb 13 23:08:21 fedora systemd[1]: systemd-boot-random-seed.service - Update Boot Loader Random Seed skipped, no trigger condition checks were met. Feb 13 23:08:21 fedora systemd[1]: systemd-confext.service - Merge System Configuration Images into /etc/ skipped, no trigger condition checks were met. Feb 13 23:08:21 fedora systemd[1]: ldconfig.service - Rebuild Dynamic Linker Cache skipped, no trigger condition checks were met. Feb 13 23:08:21 fedora systemd[1]: systemd-hibernate-clear.service - Clear Stale Hibernate Storage Info skipped, unmet condition check ConditionPathExists=/sys/firmware> Feb 13 23:08:21 fedora systemd[1]: systemd-hwdb-update.service - Rebuild Hardware Database skipped, unmet condition check ConditionNeedsUpdate=/etc Feb 13 23:08:21 fedora systemd[1]: systemd-journal-catalog-update.service - Rebuild Journal Catalog skipped, unmet condition check ConditionNeedsUpdate=/var Feb 13 23:08:21 fedora systemd[1]: systemd-pcrmachine.service - TPM PCR Machine ID Measurement skipped, unmet condition check ConditionSecurity=measured-uki Feb 13 23:08:21 fedora systemd[1]: systemd-pcrproduct.service - TPM NvPCR Product ID Measurement skipped, unmet condition check ConditionSecurity=measured-uki Feb 13 23:08:21 fedora systemd[1]: systemd-pstore.service - Platform Persistent Storage Archival skipped, unmet condition check ConditionDirectoryNotEmpty=/sys/fs/pstore Feb 13 23:08:21 fedora systemd[1]: systemd-sysext.service - Merge System Extension Images into /usr/ and /opt/ skipped, no trigger condition checks were met. Feb 13 23:08:21 fedora systemd[1]: systemd-sysusers.service - Create System Users skipped, no trigger condition checks were met. Feb 13 23:08:21 fedora systemd[1]: systemd-tpm2-setup-early.service - Early TPM SRK Setup skipped, unmet condition check ConditionSecurity=measured-uki Feb 13 23:08:21 fedora systemd[1]: systemd-repart.service - Repartition Root Disk skipped, no trigger condition checks were met. Feb 13 23:08:21 fedora systemd[1]: systemd-tpm2-setup.service - TPM SRK Setup skipped, unmet condition check ConditionSecurity=measured-uki Feb 13 23:08:21 fedora systemd[1]: systemd-pcrnvdone.service - TPM PCR NvPCR Initialization Separator skipped, unmet condition check ConditionSecurity=measured-uki Feb 13 23:08:21 fedora systemd[1]: systemd-update-done.service - Update is Completed skipped, no trigger condition checks were met. Feb 13 23:08:21 fedora systemd[1]: Finished systemd-vconsole-setup.service - Virtual Console Setup. Feb 13 23:08:21 fedora audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-vconsole-setup comm="system> Feb 13 23:08:21 fedora systemd[1]: systemd-firstboot.service - Initial Setup skipped, unmet condition check ConditionFirstBoot=yes Feb 13 23:08:21 fedora systemd[1]: first-boot-complete.target - First Boot Complete skipped, unmet condition check ConditionFirstBoot=yes Feb 13 23:08:21 fedora systemd[1]: systemd-machine-id-commit.service - Save Transient machine-id to Disk skipped, unmet condition check ConditionPathIsMountPoint=/etc/m> Feb 13 23:08:21 fedora systemd[1]: Mounting var-lib-nfs-rpc_pipefs.mount - RPC Pipe File System... Feb 13 23:08:21 fedora kernel: RPC: Registered named UNIX socket transport module. Feb 13 23:08:21 fedora kernel: RPC: Registered udp transport module. Feb 13 23:08:21 fedora kernel: RPC: Registered tcp transport module. Feb 13 23:08:21 fedora kernel: RPC: Registered tcp-with-tls transport module. Feb 13 23:08:21 fedora kernel: RPC: Registered tcp NFSv4.1 backchannel transport module. Feb 13 23:08:21 fedora systemd[1]: Mounted var-lib-nfs-rpc_pipefs.mount - RPC Pipe File System. Feb 13 23:08:21 fedora systemd[1]: Reached target rpc_pipefs.target. there's nothing after that. openQA sat and waited for 15 minutes before rebooting, so clearly it was stuck. I have seen this one time on a local VM while testing something else, but it's not easy to reproduce organically (I tried 10-15 boots of an installed system earlier and didn't hit it). openQA runs a lot of tests, so it's expected that it runs into rare failures fairly often. It's hard to tell when this started exactly, since there's been a ton of other breakage lately which sort of pollutes the history. I can see it at least as far back as Jan 2 - https://openqa.fedoraproject.org/tests/4112855#step/_console_wait_login/12 . It's only easy to access back to Dec 30 before that, so it may have been around longer; it could have been introduced by systemd-259-1.fc44 on Dec 18, possibly, I'll try and look into it. I'm attaching the full journal from an affected boot, and the journal from the subsequent successful boot to compare against. I can try and get a journal with debugging enabled tomorrow or next week.
Created attachment 2129449 [details] full journal from a bad boot (where it got stuck)
Created attachment 2129450 [details] full journal from a good boot
Beta freeze is coming up soon, so nominating for a Beta FE - if we figure this out and get a fix I think it should go through the freeze.
+4 in https://pagure.io/fedora-qa/blocker-review/issue/2035 , marking accepted.
I've been re-running my test cannon all day with the test tweaked to set kernel param systemd.log_level=debug before booting; unfortunately, I haven't yet hit the bug once with that tweak. Either I'm just unlucky, or this is some kinda timing issue and enabling debugging happens to change the timing so the bug no longer happens, or happens very rarely...
grep -n plymouth-read-write goodboot.txt badboot.txt goodboot.txt:911:Feb 13 23:23:27 fedora systemd[1]: Starting plymouth-read-write.service - Tell Plymouth To Write Out Runtime Data... goodboot.txt:920:Feb 13 23:23:27 fedora audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=plymouth-read-write comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' goodboot.txt:921:Feb 13 23:23:27 fedora systemd[1]: Finished plymouth-read-write.service - Tell Plymouth To Write Out Runtime Data. badboot.txt:896:Feb 13 23:08:20 fedora systemd[1]: Starting plymouth-read-write.service - Tell Plymouth To Write Out Runtime Data...
Hmmmm. Hans, any chance maybe https://src.fedoraproject.org/rpms/plymouth/c/a14276e5eeae89ed28fb95022133b59eb4995c6b?branch=rawhide caused this, or something?
(In reply to Adam Williamson from comment #7) > Hmmmm. Hans, any chance maybe > https://src.fedoraproject.org/rpms/plymouth/c/ > a14276e5eeae89ed28fb95022133b59eb4995c6b?branch=rawhide caused this, or > something? I think that is highly unlikely. It would be interesting to try and to reproduce the hang and see if /var/log/boot.log is created (by preserving the root diskimage and looking at it later), that is the main file which get written out on plymouth-read-write.service. plymouth-read-write.service calls "/usr/bin/plymouth update-root-fs --read-write" which makes a plymouthd IPC call, which end up in src/main.c in: static void on_system_initialized (state_t *state) { ply_trace ("system now initialized, opening log"); state->system_initialized = true; #ifdef PLY_ENABLE_SYSTEMD_INTEGRATION if (state->is_attached) tell_systemd_to_print_details (state); #endif prepare_logging (state); } since this also interacts with the console logging a bit, I wonder if these failures where seen with the new kmscon or not?
Thanks. There's also a chance this is still https://github.com/systemd/systemd/issues/35499 - I'm planning to see if the change to terminal size discovery in 260 affects this, but hadn't got round to it yet.
(In reply to Adam Williamson from comment #0) > I have seen this one time on a local VM while testing something else, but > it's not easy to reproduce organically Anecdotal confirmation (and, as anecdotal, have zero real value). I also saw this once on a VM (before branch), and did not any spend time doing any analysis (I expect rawhide to be "special" (where my definition of "special" includes various brokenness)) and just rebooted, but this suggests (as conjectured) some race/timing issue (which as the most annoying of bugs to diagnose).
So it's not totally conclusive, but I looked at Rawhide and Branched results today, and saw several instances of this on Branched (which still has systemd 259) but no instances on Rawhide (systemd 260). So it does look like the change in 260 might have helped here. I'll try and check in a few more times, if the pattern persists it might be worth backporting that change.
I've seen this exact bug, simply pressing the Enter key gets me past the hang.
Update: this is still happening. It's much more common on F44 (systemd 259) than Rawhide (systemd 260), but I definitely *did* see it happen on Rawhide at least once. Upstream have tried various changes to fix it but in openQA scattergun testing, none of those changes reduced the frequency of the problem when backported to 259, so far. See https://github.com/systemd/systemd/pull/41439 for some context.
I have seen it occasionally on real devices as well, sometimes hitting enter/esc will actually make it move forward again. Not sure if it's related but often see repetitive systemd boot messages repeated during boot
Also not sure if having a serial console configured causes issues here, I have a serial console on a lot of my arm devices for easy debug.