Bug 2036199 - Systems installed from live images do not boot due to incorrect kernel parameters (root= argument intended for live environment present in installed system)
Summary: Systems installed from live images do not boot due to incorrect kernel parame...
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: anaconda
Version: rawhide
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: Anaconda Maintenance Team
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 2038520 (view as bug list)
Depends On:
Blocks: BetaBlocker, F36BetaBlocker
TreeView+ depends on / blocked
 
Reported: 2021-12-30 11:26 UTC by Ahmed Almeleh
Modified: 2022-01-12 06:22 UTC (History)
26 users (show)

Fixed In Version: anaconda-36.14-1.fc36
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-01-12 06:22:09 UTC
Type: Bug


Attachments (Terms of Use)
Systemd is stuck and not booting (6.31 KB, image/png)
2021-12-30 11:26 UTC, Ahmed Almeleh
no flags Details

Description Ahmed Almeleh 2021-12-30 11:26:23 UTC
Created attachment 1848319 [details]
Systemd is stuck and not booting

Description of problem:
Systemd is stuck on cli and not booting

Version-Release number of selected component (if applicable):
36

How reproducible:
very

Steps to Reproduce:
1. Download the Fedora 36 Rawhide 20211229.n.1 Base from test results
2. Create a VM with 2 GB, 2 cores on VMware WS Pro
3. Install Fedora to the VM, don't create a user account pre installation.

Actual results:
initial setup utility didn't run, system failed to boot.


Expected results:
initial setup utility would run, upon account creation system will successfully boot.

Additional info:

Comment 1 Zbigniew Jędrzejewski-Szmek 2021-12-30 11:29:42 UTC
The message is from dracut. This appears to be some miscommunication between dracut and how the live image is set up…

Comment 2 Fedora Blocker Bugs Application 2021-12-30 11:35:03 UTC
Proposed as a Blocker for 36-beta by Fedora user ahmedalmeleh using the blocker tracking app because:

 Systemd is stuck on cli and not booting, It's been like that for 20 minutes. (Initial setup utility) never ran. 

I think it failed: Expected_installed_system_boot_behavior

It didn't give me the option to create a user account before the installation or after. 

It failed:
1. A working mechanism to create a user account must be clearly presented during installation and/or first boot of the installed system.

2. A system installed with a release-blocking desktop must boot to a log in screen where it is possible to log in to a working desktop using a user account created during installation or a 'first boot' utility.

3. If a utility for creating user accounts and other configuration is configured to launch, it must be visible within 10 seconds of the first boot reaching the launch point

Comment 3 Adam Williamson 2021-12-30 17:25:58 UTC
Zbigniew: note to be clear - the failure happens *on boot of a system installed from the live image*, not on boot of the live image itself. It seems these kernel args are getting passed through to the installed system environment when they should not be.

dracut was built on Dec 20th, but the change does not look like it could be relevant:

https://src.fedoraproject.org/rpms/dracut/c/76eb28fc2ef2f9e43b5ea66d0b9c96f83e124d4b?branch=rawhide

grubby hasn't been touched lately, neither has anaconda, neither AFAICS have the relevant kernel bits. This does kinda seem to leave systemd as a candidate. The build that's actually in affected composes is systemd-250~rc3-1.fc36 , because later builds got untagged as they were causing compose failures IIRC.

Comment 4 Adam Williamson 2022-01-04 21:29:34 UTC
So, I'm still digging into this, but it's definitely not dracut that's doing anything wrong. The problem is in the BLS snippets of the installed system:

[root@localhost-live liveuser]# cat /mnt/sysimage/boot/loader/entries/a69bd9379d6445668e7df3ddbda62f86-5.16.0-0.rc7.20211231git4f3d93c6eaff.52.fc36.x86_64.conf 
title Fedora Linux (5.16.0-0.rc7.20211231git4f3d93c6eaff.52.fc36.x86_64) 36 (Workstation Edition Prerelease)
version 5.16.0-0.rc7.20211231git4f3d93c6eaff.52.fc36.x86_64
linux /vmlinuz-5.16.0-0.rc7.20211231git4f3d93c6eaff.52.fc36.x86_64
initrd /initramfs-5.16.0-0.rc7.20211231git4f3d93c6eaff.52.fc36.x86_64.img
options root=live:CDLABEL=Fedora-WS-Live-rawh-20220103-n-0 rd.live.image quiet rhgb
grub_users $grub_users
grub_arg --unrestricted
grub_class fedora

note that "options" line is completely wrong for an installed system. It should look something like this (from my installed Rawhide system):

[root@xps13k nightlies]# cat /boot/loader/entries/cb5ab1ffc4424e98be5c03811cef31b6-5.16.0-0.rc2.18.fc36.x86_64.conf 
title Fedora Linux (5.16.0-0.rc2.18.fc36.x86_64) 36 (Workstation Edition Prerelease)
version 5.16.0-0.rc2.18.fc36.x86_64
linux /vmlinuz-5.16.0-0.rc2.18.fc36.x86_64
initrd /initramfs-5.16.0-0.rc2.18.fc36.x86_64.img
options root=/dev/mapper/fedora-root ro rd.lvm.lv=fedora/root rd.luks.uuid=luks-6b57ca12-6deb-4032-9293-4748f072ae5f rd.lvm.lv=fedora/swap rhgb quiet pcie_aspm=off
grub_users $grub_users
grub_arg --unrestricted
grub_class fedora

The obvious thing in systemd that might be causing this is the kernel-install script. I'm looking into recent changes to that now.

Comment 5 Adam Williamson 2022-01-05 01:09:15 UTC
OK, so I dug into this a bit farther. The problem is that /etc/machine-id and /etc/machine-info disagree over what the machine ID are, it seems.

In the broken case, the machine ID in /etc/machine-id does not match the ID used in the /boot/loader/entries files:

[root@localhost-live /]# cat /etc/machine-id
b8d80a4c887c40199c4ea1a8f02aa9b4
[root@localhost-live /]# ls /boot/loader/entries
a69bd9379d6445668e7df3ddbda62f86-0-rescue.conf
a69bd9379d6445668e7df3ddbda62f86-5.16.0-0.rc7.20211231git4f3d93c6eaff.52.fc36.x86_64.conf

the upshot of this is that when anaconda runs `grub2-mkconfig` after running `kernel-install` - which is expected to edit those files and 'fix' the options lines in them - it does not do this, because of the machine ID mismatch. Specifically it's `get_sorted_bls()` in `/etc/grub.d/10_linux` that gets tripped up - it reads the machine ID from /etc/machine-id and uses it to find the BLS snippet files, only because they don't match, it finds nothing.

So why do we have these different machine IDs? Well, the other one comes from /etc/machine-info:

[root@localhost-live /]# cat /etc/machine-info 
KERNEL_INSTALL_MACHINE_ID=a69bd9379d6445668e7df3ddbda62f86

and the reason it's now causing us trouble is this commit:

https://github.com/systemd/systemd/commit/357376d0bb525b064f468e0e2af8193b4b90d257

Which falls in exactly the right systemd delta (between 250~rc1 and 250~rc3) and changes kernel-install to prefer the ID in /etc/machine-info (if it exists) over the one in /etc/machine-id. So now systemd and grub2 disagree about where to read the machine ID from, hence the problem.

I don't know *why* the machine ID is different in those two files, yet. But they're both owned by systemd, and the offending commit is in systemd, so this is definitely systemd's fault. :D

Comment 6 Adam Williamson 2022-01-05 01:15:27 UTC
Note the grub2 code here is part of a downstream patch:

https://src.fedoraproject.org/rpms/grub2/blob/rawhide/f/0062-Add-BLS-support-to-grub-mkconfig.patch

It's not in upstream grub2.

Comment 7 Adam Williamson 2022-01-05 01:34:44 UTC
CCing grub maintainers in case we decide the fix for this is to make grub2 read from /etc/machine-info too...

Comment 8 Daan De Meyer 2022-01-05 13:56:40 UTC
The new kernel-install behavior is to persist the current machine ID in /etc/machine-info if /etc/machine-id exists when kernel-install is executed. From then on it always uses the machine ID from /etc/machine-info. If no /etc/machine-id file exists when kernel-install is executed, we generate a new UUID and persist that in /etc/machine-info and always use that UUID.

I have two guesses about what could be happening:

1. No /etc/machine-id file is present when kernel-install is executed. kernel-install generates a new UUID and persists it in /etc/machine-info. Later on a machine ID is generated which leads to the mismatch.
2. A /etc/machine-id file is present when kernel-install is executed, but it is subsequently removed and regenerated after kernel-install is executed, leading to the mismatch.

Comment 9 Yu Watanabe 2022-01-05 14:31:42 UTC
I have not followed the issue in detail, but is this resolved by the second commit of https://github.com/systemd/systemd/pull/22013 ? Sorry for the noise, if it is not related.

Comment 10 Adam Williamson 2022-01-05 16:54:18 UTC
Yu: I don't think so, no. This isn't about the fstab.

Daan: It's hard to see how either of those could happen at the time we're considering, because the grub2-mkconfig runs *immediately after* the kernel-install command, when both are run at the end of the live install process. I'll see what I can figure out, though.

Comment 11 Adam Williamson 2022-01-05 17:22:33 UTC
Ah, I think I can see what's happening. The date on /etc/machine-info in my test install is 2022-01-03, which is the day *before* I did all the testing, but is the day on which the live image was generated. The date on /etc/machine-id is 2022-01-04, the day I did the testing.

So, I guess we're baking /etc/machine-info into the live image at build time and it's getting transferred to the installed system, but we're not doing that with /etc/machine-id - we strip that from the live image and let it get generated on startup (for the live environment) or (I think) during installation (for the installed system).

So I think we're in situation 1) here. One thing we should probably do is strip /etc/machine-info from the live images, the same way we strip /etc/machine-id, otherwise every system that boots or installs from the same live image will have the same machine ID in /etc/machine-info. But I think that still may not solve the problem, as kernel-install may still run and create /etc/machine-info before whatever generates /etc/machine-id in the installed system runs. Perhaps if kernel-install runs before /etc/machine-id exists, it should create it with the same ID as it writes to /etc/machine-info ?

In the meantime I'll see if I can find where /etc/machine-id is stripped from live images, do the same for /etc/machine-info , and see if we're lucky with the ordering and that turns out to be enough to fix it.

Comment 12 Adam Williamson 2022-01-05 20:14:12 UTC
So, after trying to follow the idea of the upstream PR, it seems to me the best thing to do is not to strip /etc/machine-info from the live image itself, but to have anaconda avoid including it when running the installation. So I wrote a patch to do that:

https://github.com/rhinstaller/anaconda/pull/3770

I also built an updates image with it:

https://www.happyassassin.net/updates/2036199.1.img

It works in my testing. If you run an install from a current Rawhide live image after booting with `inst.updates=https://www.happyassassin.net/updates/2036199.1.img`, the installed system boots. It looks like kernel-install gets to create both /etc/machine-id and /etc/machine-info , as they have the exact same timestamp. This is still a change from however /etc/machine-id was getting created on live installs before, but hopefully shouldn't affect anything.

Comment 13 Chris Murphy 2022-01-09 17:37:43 UTC
*** Bug 2038520 has been marked as a duplicate of this bug. ***

Comment 14 Adam Williamson 2022-01-10 22:53:44 UTC
I don't think this should've been closed yet. The change landed in an anaconda build this morning, but we need a working compose to confirm it.

Comment 15 Adam Williamson 2022-01-12 06:22:09 UTC
This is confirmed fixed in the compose we just got today.


Note You need to log in before you can comment on or make changes to this bug.