Created attachment 1848319 [details]
Systemd is stuck and not booting
Description of problem:
Systemd is stuck on cli and not booting
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Download the Fedora 36 Rawhide 20211229.n.1 Base from test results
2. Create a VM with 2 GB, 2 cores on VMware WS Pro
3. Install Fedora to the VM, don't create a user account pre installation.
initial setup utility didn't run, system failed to boot.
initial setup utility would run, upon account creation system will successfully boot.
The message is from dracut. This appears to be some miscommunication between dracut and how the live image is set up…
Proposed as a Blocker for 36-beta by Fedora user ahmedalmeleh using the blocker tracking app because:
Systemd is stuck on cli and not booting, It's been like that for 20 minutes. (Initial setup utility) never ran.
I think it failed: Expected_installed_system_boot_behavior
It didn't give me the option to create a user account before the installation or after.
1. A working mechanism to create a user account must be clearly presented during installation and/or first boot of the installed system.
2. A system installed with a release-blocking desktop must boot to a log in screen where it is possible to log in to a working desktop using a user account created during installation or a 'first boot' utility.
3. If a utility for creating user accounts and other configuration is configured to launch, it must be visible within 10 seconds of the first boot reaching the launch point
Zbigniew: note to be clear - the failure happens *on boot of a system installed from the live image*, not on boot of the live image itself. It seems these kernel args are getting passed through to the installed system environment when they should not be.
dracut was built on Dec 20th, but the change does not look like it could be relevant:
grubby hasn't been touched lately, neither has anaconda, neither AFAICS have the relevant kernel bits. This does kinda seem to leave systemd as a candidate. The build that's actually in affected composes is systemd-250~rc3-1.fc36 , because later builds got untagged as they were causing compose failures IIRC.
So, I'm still digging into this, but it's definitely not dracut that's doing anything wrong. The problem is in the BLS snippets of the installed system:
[root@localhost-live liveuser]# cat /mnt/sysimage/boot/loader/entries/a69bd9379d6445668e7df3ddbda62f86-5.16.0-0.rc7.20211231git4f3d93c6eaff.52.fc36.x86_64.conf
title Fedora Linux (5.16.0-0.rc7.20211231git4f3d93c6eaff.52.fc36.x86_64) 36 (Workstation Edition Prerelease)
options root=live:CDLABEL=Fedora-WS-Live-rawh-20220103-n-0 rd.live.image quiet rhgb
note that "options" line is completely wrong for an installed system. It should look something like this (from my installed Rawhide system):
[root@xps13k nightlies]# cat /boot/loader/entries/cb5ab1ffc4424e98be5c03811cef31b6-5.16.0-0.rc2.18.fc36.x86_64.conf
title Fedora Linux (5.16.0-0.rc2.18.fc36.x86_64) 36 (Workstation Edition Prerelease)
options root=/dev/mapper/fedora-root ro rd.lvm.lv=fedora/root rd.luks.uuid=luks-6b57ca12-6deb-4032-9293-4748f072ae5f rd.lvm.lv=fedora/swap rhgb quiet pcie_aspm=off
The obvious thing in systemd that might be causing this is the kernel-install script. I'm looking into recent changes to that now.
OK, so I dug into this a bit farther. The problem is that /etc/machine-id and /etc/machine-info disagree over what the machine ID are, it seems.
In the broken case, the machine ID in /etc/machine-id does not match the ID used in the /boot/loader/entries files:
[root@localhost-live /]# cat /etc/machine-id
[root@localhost-live /]# ls /boot/loader/entries
the upshot of this is that when anaconda runs `grub2-mkconfig` after running `kernel-install` - which is expected to edit those files and 'fix' the options lines in them - it does not do this, because of the machine ID mismatch. Specifically it's `get_sorted_bls()` in `/etc/grub.d/10_linux` that gets tripped up - it reads the machine ID from /etc/machine-id and uses it to find the BLS snippet files, only because they don't match, it finds nothing.
So why do we have these different machine IDs? Well, the other one comes from /etc/machine-info:
[root@localhost-live /]# cat /etc/machine-info
and the reason it's now causing us trouble is this commit:
Which falls in exactly the right systemd delta (between 250~rc1 and 250~rc3) and changes kernel-install to prefer the ID in /etc/machine-info (if it exists) over the one in /etc/machine-id. So now systemd and grub2 disagree about where to read the machine ID from, hence the problem.
I don't know *why* the machine ID is different in those two files, yet. But they're both owned by systemd, and the offending commit is in systemd, so this is definitely systemd's fault. :D
Note the grub2 code here is part of a downstream patch:
It's not in upstream grub2.
CCing grub maintainers in case we decide the fix for this is to make grub2 read from /etc/machine-info too...
The new kernel-install behavior is to persist the current machine ID in /etc/machine-info if /etc/machine-id exists when kernel-install is executed. From then on it always uses the machine ID from /etc/machine-info. If no /etc/machine-id file exists when kernel-install is executed, we generate a new UUID and persist that in /etc/machine-info and always use that UUID.
I have two guesses about what could be happening:
1. No /etc/machine-id file is present when kernel-install is executed. kernel-install generates a new UUID and persists it in /etc/machine-info. Later on a machine ID is generated which leads to the mismatch.
2. A /etc/machine-id file is present when kernel-install is executed, but it is subsequently removed and regenerated after kernel-install is executed, leading to the mismatch.
I have not followed the issue in detail, but is this resolved by the second commit of https://github.com/systemd/systemd/pull/22013 ? Sorry for the noise, if it is not related.
Yu: I don't think so, no. This isn't about the fstab.
Daan: It's hard to see how either of those could happen at the time we're considering, because the grub2-mkconfig runs *immediately after* the kernel-install command, when both are run at the end of the live install process. I'll see what I can figure out, though.
Ah, I think I can see what's happening. The date on /etc/machine-info in my test install is 2022-01-03, which is the day *before* I did all the testing, but is the day on which the live image was generated. The date on /etc/machine-id is 2022-01-04, the day I did the testing.
So, I guess we're baking /etc/machine-info into the live image at build time and it's getting transferred to the installed system, but we're not doing that with /etc/machine-id - we strip that from the live image and let it get generated on startup (for the live environment) or (I think) during installation (for the installed system).
So I think we're in situation 1) here. One thing we should probably do is strip /etc/machine-info from the live images, the same way we strip /etc/machine-id, otherwise every system that boots or installs from the same live image will have the same machine ID in /etc/machine-info. But I think that still may not solve the problem, as kernel-install may still run and create /etc/machine-info before whatever generates /etc/machine-id in the installed system runs. Perhaps if kernel-install runs before /etc/machine-id exists, it should create it with the same ID as it writes to /etc/machine-info ?
In the meantime I'll see if I can find where /etc/machine-id is stripped from live images, do the same for /etc/machine-info , and see if we're lucky with the ordering and that turns out to be enough to fix it.
So, after trying to follow the idea of the upstream PR, it seems to me the best thing to do is not to strip /etc/machine-info from the live image itself, but to have anaconda avoid including it when running the installation. So I wrote a patch to do that:
I also built an updates image with it:
It works in my testing. If you run an install from a current Rawhide live image after booting with `inst.updates=https://www.happyassassin.net/updates/2036199.1.img`, the installed system boots. It looks like kernel-install gets to create both /etc/machine-id and /etc/machine-info , as they have the exact same timestamp. This is still a change from however /etc/machine-id was getting created on live installs before, but hopefully shouldn't affect anything.
*** Bug 2038520 has been marked as a duplicate of this bug. ***
I don't think this should've been closed yet. The change landed in an anaconda build this morning, but we need a working compose to confirm it.
This is confirmed fixed in the compose we just got today.