Bug 2031640 - grub2-2.06-10.fc35 broke booting into btrfs snapshots (UEFI)...as did other grub2-updates before
Summary: grub2-2.06-10.fc35 broke booting into btrfs snapshots (UEFI)...as did other g...
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: grub2
Version: 35
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Javier Martinez Canillas
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-12-13 06:16 UTC by Tom Gugel
Modified: 2022-12-13 16:05 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-12-13 16:05:50 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Tom Gugel 2021-12-13 06:16:16 UTC
Hello,

as described already here https://bugzilla.redhat.com/show_bug.cgi?id=2030940#c8 onwards, the recent grub2 update broke booting into btrfs snapshots, yet again.

I don't know specifically what happens on Fedora with grub2 updates what causes this. But it happens when grub2-* packages are upgraded. And it renders the whole system and concept of snapshots unusable. Also no new snapshots with the upgraded version of grub2-* will work booting into after that happened. 
If on the other hand you reinstall the whole system everything works again, also with the new grub2-* versions. That is, until the next grub2-* update.

It seems to me that something happens on Fedora with the EFI partition/files and or/boot partition.
It might have also be MB/CPU specific, I don't know. I spent hours yesterday on that and did not get to a single fix.

The infrastructure is following : AMD Zenith II Extreme, AMD Threadripper 3960X, Firmware TPM enabled in BIOS (you cannot disable the complete TPM in AMD).

The boot partition is a btrfs partition itself, it is separate from root partition. The efi partition is another partition (FAT32) and is correctly mounted to /boot/efi. Real grub.cfg in on /boot/ partition on /boot/grub2 of course. /boot/efi/EFI/fedora/grub.cfg is the small file that it should be that is "linked" to /boot/grub2/grub.cfg


Steps to reproduce :
Install system with grub2-*-2.06-8, set up system snapshots, make boot entries for the snapshots (e.g. via grub-btrfs). Upgrade to grub2-*-2.06-10. 

Behaviour : You cannot boot into  a single snapshot anymore, for all of them you get 
error: ../../grub-core/commands/efi/tpm.c:148:Unknown TPM error
error: ../../grub-core/loader/i386/efi/linux.c:208:you need to load the kernel first 
If you select a snapshot, getting this error, and then try to boot into the active version, you get the same error and have to switch off/reset the whole PC. Then the only thing you can do is boot into the active version again. Snapshot versions keep being broken and unbootable. Not only old snapshots are affected but also new snapshots you create are unbootable, with the same error.

Delete the whole system, reinstall and set up snapshots with the most recent grub2-*-2.06-10 versions, take new snapshots, reboot into the snapshots -> everything works as if nothing ever happened.

Expected behavior : That the whole system does not have to be reinstalled after a simple grub update.

There must be something happening when doing these updates that affects the UEFI/TPM connection in any way.

Hope this can be investigated as it is unbearable to have to reinstall the whole system every time and also lose all snapshots.

Comment 1 Tom Gugel 2021-12-13 21:38:46 UTC
Hi, I will make this as short as I can :)
This is one of the craziest bugs/error messages I ever saw in my IT life, and this has been pretty long enough now

I tried to recreate this error today as I had to reinstall the prod workstation in order to be able to work. The plan was to go from a 2.06-6 version with a fresh install and then do an update (2.06-8 is not installable via dnf). But as I had a fresh install with 2.06-10 already there I did the following :

I had 2.06-10 installed
downgrade to 2.06-6 reboot ... ok
upgrade to 2.06-10 reboot ... ok
regenerate grub.cfg manually reboot ... ok
change /etc/default/grub, regenerate reboot ... ok
change /etc/default/grub differently, regenerate reboot ... ok
reinstall shim* reboot ... ok
reinstall grub2-common reboot ... ok
downgrade grub2* to 2.06-6 again, versionlock grub2* reboot ... UNKNOWN TPM ERROR, unfixable, out of the blue...
but main version bootable
login check everything in /boot and /boot/efi against previously saved, everything ok 
except what I had changed on configs
reinstall shim* grub2* reboot ... UNKNOWN TPM ERROR, but main version bootable...
rmmod tpm reboot ... UNNKNOWN TPM ERROR

What did the two things have in common then? It happened on the prod machine on the update, it happened here on a downgrade, everything looks fine...
It has nothing to with grub version? Why is the main version still bootable without Unknown TPM error, the snapshots not?

Yea well, the only thing the two had in common was : Really a lot of menu entries built up in the snapshot submenu
Ok, I delete half of them manually...
Unknown TPM error gone...for good
So, if a submenu gets too big grub reacts with Unknown TPM errror.
I have not checked how many entries it can digest before it dies, but I might find out eventually, allowing one more entry from time to time

Comment 2 Tom Gugel 2021-12-16 19:09:17 UTC
The bug can be easily reproduced on even a fresh Fedora 35 install.
Please see this discussion for further information and also a demonstration of the bug. It can be done with any menu with "enough" entries...

https://github.com/Antynea/grub-btrfs/issues/190#issuecomment-993806674

Comment 3 Ben Cotton 2022-11-29 17:29:14 UTC
This message is a reminder that Fedora Linux 35 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 35 on 2022-12-13.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '35'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 35 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 4 Ben Cotton 2022-12-13 16:05:50 UTC
Fedora Linux 35 entered end-of-life (EOL) status on 2022-12-13.

Fedora Linux 35 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.