Bug 2361624 - grubby --add-kernel unpredictably makes the new kernel the default or not
Summary: grubby --add-kernel unpredictably makes the new kernel the default or not
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: grubby
Version: 42
Hardware: Unspecified
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Peter Jones
QA Contact: Fedora Extras Quality Assurance
URL: https://github.com/linux-system-roles...
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2025-04-22 10:29 UTC by Martin Pitt
Modified: 2025-05-13 07:25 UTC (History)
6 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed:
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Martin Pitt 2025-04-22 10:29:56 UTC
The "bootloader" system role [1] can add a new kernel, and covers this in a test (tests_add_rm.yml). In high level terms, it copies the current kernel image and initrd, and adds it as "Clone1". The role/test expects that the newly added kernel does not automatically become the default. This is true for RHEL 9/10 and earlier Fedoras, but not any more for Fedora 42: There the newly added kernel *sometimes* becomes the new default and sometimes not. It's not clear to me what triggers this (file system order? boot ID? something else?)

[1] https://github.com/linux-system-roles/bootloader

Reproducible: Sometimes

Steps to Reproduce:
We found that with running the integration test:

git clone https://github.com/linux-system-roles/bootloader/
cd bootloader
pip install "git+https://github.com/linux-system-roles/tox-lsr@main"
tox -e qemu-ansible-core-2.17 -- --image-name fedora-42 --log-level=debug tests/tests_add_rm.yml 


However, this is easier to reproduce on the CLI, outside of system-roles. Copy the kernel and add it:

cp /boot/vmlinuz-6.14.0-63.fc42.x86_64{,_clone1}
cp /boot/initramfs-6.14.0-63.fc42.x86_64.img{,_clone1}
grubby --initrd=/boot/initramfs-6.14.0-63.fc42.x86_64.img_clone1 --add-kernel=/boot/vmlinuz-6.14.0-63.fc42.x86_64_clone1 --title=Clone1 --args=test=kernel --copy-default

then check the default:

grubby --info=ALL
grubby --default-title
Actual Results:
In some boots, the new kernel becomes the default:

# grubby --info=ALL

index=0
kernel="/boot/vmlinuz-6.14.0-63.fc42.x86_64_clone1"
args="no_timer_check console=tty1 console=ttyS0,115200n8 systemd.firstboot=off rootflags=subvol=root test=kernel"
root="UUID=47f1b394-2df4-43ee-a671-66c16f7a8aeb"
initrd="/boot/initramfs-6.14.0-63.fc42.x86_64.img_clone1"
title="Clone1"
id="73885463b0f04e49a1fe3fe912f0319c-6.14.0-63.fc42.x86_64_clone1"
index=1
kernel="/boot/vmlinuz-6.14.0-63.fc42.x86_64"
args="no_timer_check console=tty1 console=ttyS0,115200n8 systemd.firstboot=off rootflags=subvol=root"
root="UUID=47f1b394-2df4-43ee-a671-66c16f7a8aeb"
initrd="/boot/initramfs-6.14.0-63.fc42.x86_64.img"
title="Fedora Linux (6.14.0-63.fc42.x86_64) 42 (Cloud Edition)"
id="1031471b08b14ad5b1af48a57b33dfed-6.14.0-63.fc42.x86_64"


# grubby --default-title
Clone1

Note that "grubby --default-index" is *always* 0.

Expected Results:
In some other runs, the newly added kernel is not the default, but the old one stays:

# grubby --info=ALL

index=0
kernel="/boot/vmlinuz-6.14.0-63.fc42.x86_64"
args="no_timer_check console=tty1 console=ttyS0,115200n8 systemd.firstboot=off rootflags=subvol=root"
root="UUID=47f1b394-2df4-43ee-a671-66c16f7a8aeb"
initrd="/boot/initramfs-6.14.0-63.fc42.x86_64.img"
title="Fedora Linux (6.14.0-63.fc42.x86_64) 42 (Cloud Edition)"
id="1031471b08b14ad5b1af48a57b33dfed-6.14.0-63.fc42.x86_64"
index=1
kernel="/boot/vmlinuz-6.14.0-63.fc42.x86_64_clone1"
args="no_timer_check console=tty1 console=ttyS0,115200n8 systemd.firstboot=off rootflags=subvol=root test=kernel"
root="UUID=47f1b394-2df4-43ee-a671-66c16f7a8aeb"
initrd="/boot/initramfs-6.14.0-63.fc42.x86_64.img_clone1"
title="Clone1"
id="95c42ca89e374fc39928b0f91c508ad6-6.14.0-63.fc42.x86_64_clone1"

# grubby --default-title 
Fedora Linux (6.14.0-63.fc42.x86_64) 42 (Cloud Edition)


Additional Information:
If I iterate this without a reboot, the order is always robust. I.e. clean up first with

grubby --remove-kernel=/boot/vmlinuz-6.14.0-63.fc42.x86_64_clone1
rm -f /boot/initramfs-6.14.0-63.fc42.x86_64.img_clone1 /boot/vmlinuz-6.14.0-63.fc42.x86_64_clone1
cp /boot/vmlinuz-6.14.0-63.fc42.x86_64{,_clone1}
cp /boot/initramfs-6.14.0-63.fc42.x86_64.img{,_clone1}

Doing this cleanup/add run repeatedly always yields the same order/result. Only a fresh boot of the VM changes it.

grubby-8.40-82.fc42.x86_64

Comment 1 Martin Pitt 2025-04-22 10:36:48 UTC
I tested rebooting the cloud image VM, and that doesn't change the order either. So this must be related to some file system or other noise that happens on first booting a cloud image.

Comment 2 Marta Lewandowska 2025-04-24 10:08:54 UTC
While pleading [almost] complete ignorance of bootloader system roles, I can assure you that in rhel9 & 10 and fedora, installing a new kernel has always made it the default regardless of whether you were up- or downgrading, and it's specified in /etc/sysconfig/kernel Installation of a new kernel doesn't even invoke grubby AFAIK. Instead the scripts in /usr/lib/kernel/install.d/ run. 

If grubby --default-index always shows 0, then there could be something wrong with/in your grubenv. Here's a snippet of grubby code where the default index is determined:

get_default_index() {
    local default=""
    local index="-1"
    local title=""
    local version=""
    if [[ $bootloader = "grub2" ]]; then
        default="$(grep '^saved_entry=' ${env} | sed -e 's/^saved_entry=//')"
    else
        default="$(grep '^default=' ${zipl_config} | sed -e 's/^default=//')"
    fi

    if [[ -z $default ]]; then
        index=0
    elif [[ $default =~ ^[0-9]+$ ]]; then
        index="$default"
    fi

Also, in your grubby command you're missing --make-default in case you want the new kernel to be the default.

Please let me know if I'm missing or misunderstanding something (:

Comment 3 Martin Pitt 2025-04-24 10:27:06 UTC
Hello Marta!

> While pleading [almost] complete ignorance of bootloader system roles

Heh, me (almost) too -- that's why I created a totally separate reproducer.

> I can assure you that in rhel9 & 10 and fedora, installing a new kernel has always made it the default

No, not in RHEL 9/10 with `grubby --add-kernel`. The current system-roles test relies on that fact, otherwise the test fails (and it reliably passes in RHEL, just not in Fedora 42).

> Also, in your grubby command you're missing --make-default in case you want the new kernel to be the default.

No, the rule (or rather, the test) does *not* want that.

> there could be something wrong with/in your grubenv

Maybe, but keep in mind that the test/reproducer never actually reboots or boots into that kernel.

Comment 4 Marta Lewandowska 2025-04-24 12:31:03 UTC
> Hello Marta!

:)
 
> No, not in RHEL 9/10 with `grubby --add-kernel`. The current system-roles
> test relies on that fact, otherwise the test fails (and it reliably passes
> in RHEL, just not in Fedora 42).

ok, so my ignorance is already in the way ;)  grubby has its own system of ordering kernels, with the newest one being first-- not most recently installed, but actually highest version-- and that gets index=0, but unless you --make-default it shouldn't get set to default. (in writing 'should' I'm actually playing around on an f42 VM)
 
> > there could be something wrong with/in your grubenv
> 
> Maybe, but keep in mind that the test/reproducer never actually reboots or
> boots into that kernel.

It doesn't have to; take a look at the code. If grubby doesn't find a saved_entry in grubenv, it defaults to index=0.

Can you reproduce this on a normal VM? I can't so maybe it's something to do with the cloud VM or ... my bet is still on grubenv being screwed up!

Comment 5 Martin Pitt 2025-04-25 11:35:31 UTC
This reproduces it out of thin air, in the current F42 cloud VM image:

curl -o fedora.qcow2 -L https://download.fedoraproject.org/pub/fedora/linux/releases/42/Cloud/x86_64/images/Fedora-Cloud-Base-Generic-42-1.1.x86_64.qcow2
# nothing fancy, just admin:foobar and root:foobar
curl -L -O https://github.com/cockpit-project/bots/raw/main/machine/cloud-init.iso
qemu-system-x86_64 -cpu host -enable-kvm -nographic -m 2048 -drive file=fedora.qcow2,if=virtio -snapshot -cdrom cloud-init.iso -net nic,model=virtio -net user,hostfwd=tcp::2201-:22

Then you can log in as "root:foobar" in the VT (with a slightly crappy terminal, which is enough for the reproducer), or with `ssh -p 2201 admin@localhost` and `sudo -i` (for more elaborate investigation).

Use "Ctrl-a x" to exit the emulator.

I get the same unpredictable behaviour, in about half of the qemu boots Clone1 becomes the default.

Comment 6 Marta Lewandowska 2025-04-30 10:38:08 UTC
Hi Martin,

Thanks for the reproducer! I have not observed the clone sometimes becoming default and sometimes not, but rather, it always becomes the default for me... I tried the same thing using Fedora-Cloud-Base-Generic-41-1.4.x86_64.qcow2 and I get the same exact behavior. 

The reason is what I wrote in c#2: grubby is getting default index by checking what's listed as saved_entry in grubenv, and defaults to 0. If you `grub2-editenv - list` on your fresh image, there is nothing set there. Adding the new entry with grubby puts it at index=0, and since grubenv is unaffected by the grubby command, the index stays 0. (This behavior should be consistent across fedoras, rhels; it's nothing new)

If before the grubby --add-kernel command you first set the installed kernel to default (grubby --set-default /boot/kernel or grubby --set-default-index 0), then grubby will write that to grubenv, and subsequent grubby -add-kernel commands will work as you expect because even though clone1 will get index=0, the default index will reflect what's in grubenv.

What any of this has to do with bootloader system roles, I haven't a clue. but maybe something is missing in a config file or some grubby command is missing..?

Comment 7 Martin Pitt 2025-04-30 13:47:01 UTC
Hello Marta,

> maybe something is missing in a config file or some grubby command is missing

No, there's nothing missing. The reproducer is complete, there's no extra config mangling or --set-default etc. command happening anywhere (neither in the reproducer here nor in https://github.com/linux-system-roles/bootloader/). As mentioned, the role does *not* expect or wants to make the new kernel the default one. 

>If you `grub2-editenv - list` on your fresh image, there is nothing set there.

# grub2-editenv - list
boot_success=0

but I suppose you mean "nothing related to the default kernel".

> This behavior should be consistent across fedoras, rhels; it's nothing new

Comparing this to current CentOS 10:

# grub2-editenv - list
saved_entry=e08c73cba11e4f15b89f8f6532402839-6.12.0-72.el10.x86_64
menu_auto_hide=1
boot_success=0
boot_indeterminate=0

The reproducer:

cp /boot/vmlinuz-6.12.0-72.el10.x86_64{,_clone1}
cp /boot/initramfs-6.12.0-72.el10.x86_64.img{,_clone1}
grubby --initrd=/boot/initramfs-6.12.0-72.el10.x86_64.img_clone1 --add-kernel=/boot/vmlinuz-6.12.0-72.el10.x86_64_clone1 --title=Clone1 --args=test=kernel --copy-default
grubby --info=ALL
grubby --default-title

This consistently keeps the original kernel the default, clone1 never becomes the default.

So the unpredictability is new, and if as you say on your system clone1 consistently becomes the default, this is even easier to investigate then. (Note that it's unpredictable on Richard Megginson's and my laptop, as well as in GitHub actions -- so at least three environments).

Same with RHEL 7/8/9.

Thanks!

Comment 8 Marta Lewandowska 2025-05-02 14:58:28 UTC
Ok, either I'm doing a bad job of explaining, or we're somehow fundamentally misunderstanding each other... so forgive me, but I'll write things as simply as I can and maybe we can get to the root of the issue.

grubenv is a configuration file for GRUB, which keeps information from boot to boot, like the default boot entry, which is called saved_entry. You demonstrate that you have saved_entry set in your centos10 grubenv, but not in the cloud image grubenv. I observe the same, of course.

I don't know why saved_entry is not set, but it is also not set in the same Generic image for f40 and f41, and that's not a change in GRUB or grubby, but in how the image is created / GRUB is installed during image creation. If you install the same distro from rpms, 20-grub.install runs when the bootloader gets installed and adds the default kernel to grubenv as saved_entry.

Now, grubby uses the value of saved_entry to find which kernel is default: it sets its default-index to the correct value by comparing kernels in BLS files to saved_entry. In case this comparison fails, index get set to 0. That's the grubby code snippet I included in comment#2.

You observed from the beginning that index is always 0, and of course that's because saved_entry is not set in grubenv.

[grubby reverse sorts kernels, so which kernel gets index=0 depends on its name. That might explain the unpredictability you mention. This normally does not matter, though, because default is not determined by number, but by name. The index is only there to make it easier to refer to a kernel, not having to write out its NVR]

Comment 9 Martin Pitt 2025-05-05 05:33:50 UTC
Hello Marta,

so I understand it that you consider the bug being in the cloud image build process, as grubby relies on a `saved_entry` to function correctly? I filed https://pagure.io/fedora-infrastructure/issue/12542 but I suppose it will need some more detailed input from you.

That may be valid, but it is still suspicious that the exact same image boot without other external modifications results in two different behaviours:

> grubby reverse sorts kernels, so which kernel gets index=0 depends on its name. That might explain the unpredictability you mention.

If by "name" you mean file name or title, then no -- these never change. The default kernel ID is also always the same on that image (1031471b08b14ad5b1af48a57b33dfed-6.14.0-63.fc42.x86_64). The ID of the Clone1 kernel changes of course, but it is always asciibetically bigger than 103147... (just because that happens to be a very low number). I.e. whatever it sorts on, it's not a visible property. Perhaps file / inode number, i.e. the problem may be that it is *not* sorted explicitly?

Thanks!

Comment 10 Martin Pitt 2025-05-05 11:11:34 UTC
fedora-infrastructure was the wrong project. Filed https://pagure.io/cloud-sig/issue/441 instead.

Comment 11 Martin Pitt 2025-05-08 06:39:35 UTC
Still happens on current Fedora-Cloud-Base-Generic-42-20250507.0.x86_64.qcow2 after Neal's BLS fix. As the kernel changed, reproducer for convenience:

cp /boot/vmlinuz-6.14.5-300.fc42.x86_64{,_clone1}
cp /boot/initramfs-6.14.5-300.fc42.x86_64.img{,_clone1}
grubby --initrd=/boot/initramfs-6.14.5-300.fc42.x86_64.img_clone1 --add-kernel=/boot/vmlinuz-6.14.5-300.fc42.x86_64_clone1 --title=Clone1 --args=test=kernel --copy-default
grubby --default-title

The last command says "Clone1" (i.e. new kernel unexpectedly becomes the default) in about half of the boots.

Comment 12 Marta Lewandowska 2025-05-13 07:25:42 UTC
As long as saved_entry=$(machine-id)-$(kernel-nvr) isn't in grubenv, it will continue to fail.

If Neal does not change this or does not think it should be changed, you can run `grubby --set-default-index 0` to get the behavior you expect or install kernels the usual way, with dnf, instead of with grubby.


Note You need to log in before you can comment on or make changes to this bug.