I've noticed in openQA tests lately, boot sometimes just gets stuck. The console shows "Starting initrd-switch-root.service", but logs indicate it got a bit further. This most commonly happens on aarch64, but it does sometimes also happen on x86_64. https://openqa.stg.fedoraproject.org/tests/5920121 is an example, see https://openqa.stg.fedoraproject.org/tests/5920121#step/_graphical_wait_login/4 . In that case I hacked the test to notice when it happened and reboot so it could upload logs. The journal from the failed boot just sort of...ends, with no obvious errors: ... Feb 13 23:08:21 fedora systemd[1]: Starting systemd-vconsole-setup.service - Virtual Console Setup... Feb 13 23:08:21 fedora systemd[1]: systemd-ask-password-console.path - Dispatch Password Requests to Console Directory Watch skipped, unmet condition check ConditionPat> Feb 13 23:08:21 fedora systemd[1]: fips-crypto-policy-overlay.service - Bind-mount FIPS crypto-policy in FIPS mode skipped, unmet condition check ConditionKernelCommand> Feb 13 23:08:21 fedora systemd[1]: iscsi-onboot.service - Special handling of early boot iSCSI sessions skipped, unmet condition check ConditionDirectoryNotEmpty=/sys/c> Feb 13 23:08:21 fedora systemd[1]: iscsi-starter.service skipped, unmet condition check ConditionDirectoryNotEmpty=/var/lib/iscsi/nodes Feb 13 23:08:21 fedora systemd[1]: modprobe - Load Kernel Module efi_pstore skipped, unmet condition check ConditionKernelModuleLoaded=!efi_pstore Feb 13 23:08:21 fedora systemd[1]: selinux-autorelabel-mark.service - Mark the need to relabel after reboot skipped, unmet condition check ConditionSecurity=!selinux Feb 13 23:08:21 fedora systemd[1]: systemd-binfmt.service - Set Up Additional Binary Formats skipped, no trigger condition checks were met. Feb 13 23:08:21 fedora systemd[1]: systemd-boot-clear-sysfail.service - Clear SysFail Entry If The Boot Is Successful skipped, unmet condition check ConditionPathExists> Feb 13 23:08:21 fedora systemd[1]: systemd-boot-random-seed.service - Update Boot Loader Random Seed skipped, no trigger condition checks were met. Feb 13 23:08:21 fedora systemd[1]: systemd-confext.service - Merge System Configuration Images into /etc/ skipped, no trigger condition checks were met. Feb 13 23:08:21 fedora systemd[1]: ldconfig.service - Rebuild Dynamic Linker Cache skipped, no trigger condition checks were met. Feb 13 23:08:21 fedora systemd[1]: systemd-hibernate-clear.service - Clear Stale Hibernate Storage Info skipped, unmet condition check ConditionPathExists=/sys/firmware> Feb 13 23:08:21 fedora systemd[1]: systemd-hwdb-update.service - Rebuild Hardware Database skipped, unmet condition check ConditionNeedsUpdate=/etc Feb 13 23:08:21 fedora systemd[1]: systemd-journal-catalog-update.service - Rebuild Journal Catalog skipped, unmet condition check ConditionNeedsUpdate=/var Feb 13 23:08:21 fedora systemd[1]: systemd-pcrmachine.service - TPM PCR Machine ID Measurement skipped, unmet condition check ConditionSecurity=measured-uki Feb 13 23:08:21 fedora systemd[1]: systemd-pcrproduct.service - TPM NvPCR Product ID Measurement skipped, unmet condition check ConditionSecurity=measured-uki Feb 13 23:08:21 fedora systemd[1]: systemd-pstore.service - Platform Persistent Storage Archival skipped, unmet condition check ConditionDirectoryNotEmpty=/sys/fs/pstore Feb 13 23:08:21 fedora systemd[1]: systemd-sysext.service - Merge System Extension Images into /usr/ and /opt/ skipped, no trigger condition checks were met. Feb 13 23:08:21 fedora systemd[1]: systemd-sysusers.service - Create System Users skipped, no trigger condition checks were met. Feb 13 23:08:21 fedora systemd[1]: systemd-tpm2-setup-early.service - Early TPM SRK Setup skipped, unmet condition check ConditionSecurity=measured-uki Feb 13 23:08:21 fedora systemd[1]: systemd-repart.service - Repartition Root Disk skipped, no trigger condition checks were met. Feb 13 23:08:21 fedora systemd[1]: systemd-tpm2-setup.service - TPM SRK Setup skipped, unmet condition check ConditionSecurity=measured-uki Feb 13 23:08:21 fedora systemd[1]: systemd-pcrnvdone.service - TPM PCR NvPCR Initialization Separator skipped, unmet condition check ConditionSecurity=measured-uki Feb 13 23:08:21 fedora systemd[1]: systemd-update-done.service - Update is Completed skipped, no trigger condition checks were met. Feb 13 23:08:21 fedora systemd[1]: Finished systemd-vconsole-setup.service - Virtual Console Setup. Feb 13 23:08:21 fedora audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-vconsole-setup comm="system> Feb 13 23:08:21 fedora systemd[1]: systemd-firstboot.service - Initial Setup skipped, unmet condition check ConditionFirstBoot=yes Feb 13 23:08:21 fedora systemd[1]: first-boot-complete.target - First Boot Complete skipped, unmet condition check ConditionFirstBoot=yes Feb 13 23:08:21 fedora systemd[1]: systemd-machine-id-commit.service - Save Transient machine-id to Disk skipped, unmet condition check ConditionPathIsMountPoint=/etc/m> Feb 13 23:08:21 fedora systemd[1]: Mounting var-lib-nfs-rpc_pipefs.mount - RPC Pipe File System... Feb 13 23:08:21 fedora kernel: RPC: Registered named UNIX socket transport module. Feb 13 23:08:21 fedora kernel: RPC: Registered udp transport module. Feb 13 23:08:21 fedora kernel: RPC: Registered tcp transport module. Feb 13 23:08:21 fedora kernel: RPC: Registered tcp-with-tls transport module. Feb 13 23:08:21 fedora kernel: RPC: Registered tcp NFSv4.1 backchannel transport module. Feb 13 23:08:21 fedora systemd[1]: Mounted var-lib-nfs-rpc_pipefs.mount - RPC Pipe File System. Feb 13 23:08:21 fedora systemd[1]: Reached target rpc_pipefs.target. there's nothing after that. openQA sat and waited for 15 minutes before rebooting, so clearly it was stuck. I have seen this one time on a local VM while testing something else, but it's not easy to reproduce organically (I tried 10-15 boots of an installed system earlier and didn't hit it). openQA runs a lot of tests, so it's expected that it runs into rare failures fairly often. It's hard to tell when this started exactly, since there's been a ton of other breakage lately which sort of pollutes the history. I can see it at least as far back as Jan 2 - https://openqa.fedoraproject.org/tests/4112855#step/_console_wait_login/12 . It's only easy to access back to Dec 30 before that, so it may have been around longer; it could have been introduced by systemd-259-1.fc44 on Dec 18, possibly, I'll try and look into it. I'm attaching the full journal from an affected boot, and the journal from the subsequent successful boot to compare against. I can try and get a journal with debugging enabled tomorrow or next week.
Created attachment 2129449 [details] full journal from a bad boot (where it got stuck)
Created attachment 2129450 [details] full journal from a good boot
Beta freeze is coming up soon, so nominating for a Beta FE - if we figure this out and get a fix I think it should go through the freeze.
+4 in https://pagure.io/fedora-qa/blocker-review/issue/2035 , marking accepted.
I've been re-running my test cannon all day with the test tweaked to set kernel param systemd.log_level=debug before booting; unfortunately, I haven't yet hit the bug once with that tweak. Either I'm just unlucky, or this is some kinda timing issue and enabling debugging happens to change the timing so the bug no longer happens, or happens very rarely...
grep -n plymouth-read-write goodboot.txt badboot.txt goodboot.txt:911:Feb 13 23:23:27 fedora systemd[1]: Starting plymouth-read-write.service - Tell Plymouth To Write Out Runtime Data... goodboot.txt:920:Feb 13 23:23:27 fedora audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=plymouth-read-write comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' goodboot.txt:921:Feb 13 23:23:27 fedora systemd[1]: Finished plymouth-read-write.service - Tell Plymouth To Write Out Runtime Data. badboot.txt:896:Feb 13 23:08:20 fedora systemd[1]: Starting plymouth-read-write.service - Tell Plymouth To Write Out Runtime Data...
Hmmmm. Hans, any chance maybe https://src.fedoraproject.org/rpms/plymouth/c/a14276e5eeae89ed28fb95022133b59eb4995c6b?branch=rawhide caused this, or something?
(In reply to Adam Williamson from comment #7) > Hmmmm. Hans, any chance maybe > https://src.fedoraproject.org/rpms/plymouth/c/ > a14276e5eeae89ed28fb95022133b59eb4995c6b?branch=rawhide caused this, or > something? I think that is highly unlikely. It would be interesting to try and to reproduce the hang and see if /var/log/boot.log is created (by preserving the root diskimage and looking at it later), that is the main file which get written out on plymouth-read-write.service. plymouth-read-write.service calls "/usr/bin/plymouth update-root-fs --read-write" which makes a plymouthd IPC call, which end up in src/main.c in: static void on_system_initialized (state_t *state) { ply_trace ("system now initialized, opening log"); state->system_initialized = true; #ifdef PLY_ENABLE_SYSTEMD_INTEGRATION if (state->is_attached) tell_systemd_to_print_details (state); #endif prepare_logging (state); } since this also interacts with the console logging a bit, I wonder if these failures where seen with the new kmscon or not?
Thanks. There's also a chance this is still https://github.com/systemd/systemd/issues/35499 - I'm planning to see if the change to terminal size discovery in 260 affects this, but hadn't got round to it yet.
(In reply to Adam Williamson from comment #0) > I have seen this one time on a local VM while testing something else, but > it's not easy to reproduce organically Anecdotal confirmation (and, as anecdotal, have zero real value). I also saw this once on a VM (before branch), and did not any spend time doing any analysis (I expect rawhide to be "special" (where my definition of "special" includes various brokenness)) and just rebooted, but this suggests (as conjectured) some race/timing issue (which as the most annoying of bugs to diagnose).