Bug 2254370 - GRUB2 Fails to boot off XFS Partition
Summary: GRUB2 Fails to boot off XFS Partition
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: grub2
Version: 39
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Nicolas Frayer
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-12-13 15:40 UTC by Shane Hart
Modified: 2024-05-03 01:34 UTC (History)
11 users (show)

Fixed In Version: grub2-2.06-116.fc39 grub2-2.06-114.fc38 grub2-2.06-121.fc40 grub2-2.06-120.fc39 grub2-2.06-118.fc38
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-02-05 01:24:21 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
XFS metadump of /boot GRUB 2.06-109 (643.25 KB, application/octet-stream)
2023-12-15 01:20 UTC, Shane Hart
no flags Details
Truncated grub debug="all" log (7.92 KB, text/plain)
2024-01-01 21:10 UTC, Fernando Datz
no flags Details

Description Shane Hart 2023-12-13 15:40:05 UTC
When GRUB loads it starts fillin the screen with:

error: ../../grub-core/fs/xfs.c:541:not a correct XFS inode.

Only way out is to reboot.

Reproducible: Always

Steps to Reproduce:
1. Update GRUB to -110
2. Boot Fedora from XFS /boot partition
3. Sadness

Actual Results:  
The message below fills the screen:

error: ../../grub-core/fs/xfs.c:541:not a correct XFS inode.

I attached the disk to another computer and re-ran grub2-install and grub2-mkconfig but the message still occured.

Updating to -109 and rerunning grub2-install / grub2-mkconfig fixed the issue.

Expected Results:  
Booted successfully.

Since the -110 revision dealt with XFS patches it seems that something has gone awry.

Just some background:  I have 8 Fedora VMs running in an OpenStack environment.  7 of them are basically identical and is used as a sort of "job" nodes where people don't usually log in to them.  One of them is an "interactive" style node and interact with users.  All of them were initially Fedora 36 and have been updated through the years to 39.  They are all MSDOS partitions of virtualized block devices.  On 7 of them /boot is on the same partition as /, on the 1 /boot is a separate partition.

All of them experienced this problem.

Comment 1 Nicolas Frayer 2023-12-14 14:34:33 UTC
Hello Shane,

Which fedora image did you use for the initial install (f36) ? Is it cloud or server ?

Comment 2 Shane Hart 2023-12-14 15:49:48 UTC
Nicolas,

I used the Fedora 36 net install ISO (Fedora-Server-netinst-x86_64-36-1.5.iso).  Have been upgrading with dnf since.

Comment 3 Nicolas Frayer 2023-12-14 17:36:05 UTC
Shane,

Would it be possible for you to use xfs_metadump [1] (from the xfsprogs package) to dump the XFS filesystem meta info of the /boot partition ?
You would have to run:
# xfs_metadump -o /dev/$BOOTDEVICE $METADUMPFILENAME; bzip2 $METADUMPFILENAME
and send us the resulting file.

The meta data can only be dumped from a read-only or unmounted filesystem so I would suggest you boot into "rescue" and do it from there.

This tool doesn't dump any data from the files content [2], only the meta data of the filesystem.
"-o" disables filename obfuscation. If you are concerned about confidentiality, you can omit "-o" and we won't see filenames.
I would run this on the VM that has a separate partition for /boot so you won't have to worry about private data.

If you have any issues with the process, please let me know.

Regards,

Nicolas

[1] https://man7.org/linux/man-pages/man8/xfs_metadump.8.html
[2] https://man7.org/linux/man-pages/man8/xfs_metadump.8.html#DESCRIPTION

Comment 4 Shane Hart 2023-12-14 20:33:31 UTC
It might take until the weekend to get the metadata from the small /boot only partition.  I can get the metadata from one of the / (that includes /boot) tomorrow morning.  Those ones I don't care about data, they just have a pretty vanilla Fedora setup with a Ceph mount.

Also, I have a .qcow2 image that I copied, booted into, updated to -110, and it also exhibits the GRUB failure through the same failure mode.  It's also a "unified" / + /boot partition, but maybe I can just dump that?

Comment 5 Shane Hart 2023-12-15 01:19:08 UTC
Nicolas,

I realized I could just unmount /boot, do the dump, and mount it again.  How do I send the gzipped data to you?  It's 644 KB and is of the /boot partition with grub 2.06-109 (not -110).

Comment 6 Shane Hart 2023-12-15 01:20:10 UTC
Created attachment 2004363 [details]
XFS metadump of /boot GRUB 2.06-109

XFS metadata dump from /boot partition that exhibits problem.  /boot has GRUB 2.06-109 installed (not -110).

Comment 7 Nicolas Frayer 2023-12-15 08:53:07 UTC
Thanks for sending the dump, I'll take a look.

Comment 8 Nicolas Frayer 2023-12-15 16:10:15 UTC
Hi Shane,

I am still going through the meta data but haven't been successful in reproducing the error on my side yet, I was wondering if you could provide the qcow2 image you mentioned about earlier.
Also, that would be great if you could enable the debug output for XFS when grub is running.
To do so, when you are presented with the grub menu, press 'e' on the entry you want to launch.
Next, add "set debug=xfs" then Ctrl+x (to boot).
If you could please capture the output and send it to me that'd be great.
Last but not least, are you using bios or efi boot on your VMs ?

Comment 9 Shane Hart 2023-12-15 17:43:39 UTC
Nicolas,

I'll see what I can do regarding the boot option (I imagine I'll have to set up a screen recorder?  Or is there a way to dump the text?).

As far as the qcow2 image, I can definitely give it to you; it just has Fedora packages and some stuff installed in /usr/local (which I can remove).  However, it isn't small....dozens of GB.  I imagine that can't be uploaded here?

I'm using BIOS to boot the VMs.

Comment 10 Shane Hart 2023-12-15 17:48:30 UTC
Nicolas,

Actually, the debug dump might be possible (at least with the -110 revision).  It gives the "Welcome to GRUB!" message and immediately starts spewing:

error: ../../grub-core/fs/xfs.c:541:not a correct XFS inode.

So there is no boot menu.  Would it be useful to get the -109 debug dump?

Comment 11 Shane Hart 2023-12-15 19:59:51 UTC
I dd'ed out the boot partition, updated to -110 and dd'ed out that boot partition.  I attached both to a simple VM and booted, the -109 shows the GRUB menu and just hangs when you pick something (expected since there's no root file system anymore).  The -110 shows the XFS bug spam for a little bit, then a new error saying it's gone past the partition.  I imagine that's just some artifact of how I set this up.

The image files are about 258 MB each (compressed).  Would they be useful?  I'm trying to find somewhere to upload them.  Box has a limit of 250 MB for a free account.

Comment 12 Shane Hart 2023-12-15 20:07:06 UTC
I think I got it up on google drive:

https://drive.google.com/drive/folders/1AxRy87eyENZ1SZSJepb0lpL2UO_NU7RK?usp=sharing

Comment 13 Marta Lewandowska 2023-12-18 08:20:02 UTC
(In reply to Shane Hart from comment #10)
> Nicolas,
> 
> Actually, the debug dump might be possible (at least with the -110
> revision).  It gives the "Welcome to GRUB!" message and immediately starts
> spewing:
> 
> error: ../../grub-core/fs/xfs.c:541:not a correct XFS inode.
> 
> So there is no boot menu.  Would it be useful to get the -109 debug dump?

Hi Shane,
you can turn on the debugging when you have -109 installed and then update to -110 and reboot. You just need to
# grub2-editenv - set debug=xfs

Comment 14 Shane Hart 2023-12-18 13:30:02 UTC
Nicolas,

Either I did something incorrect, or the debug option isn't working.  In the attached video I boot up a VM with -109, update it, run the commands, and reboot.

https://drive.google.com/file/d/1S-apTzWuGv3h6TsO8wMGB9UbtKPU8Ox8/view?usp=drive_link

Comment 15 Marta Lewandowska 2023-12-18 13:58:14 UTC
Looks like you're doing the right thing. You can try debug=all instead...
# grub2-editenv - unset debug
# grub2-editenv - set debug=all
but since this is an error with the boot fs itself, it may be happening before any files are read (like grubenv).

Comment 16 Shane Hart 2023-12-18 14:44:02 UTC
Yes, even with debug=all nothing shows before the XFS spam.

Comment 17 Marta Lewandowska 2023-12-18 15:51:05 UTC
thanks for trying. (:

Comment 18 Shane Hart 2023-12-20 13:28:29 UTC
One thing I've done to try and work around this.  This was done on one of my block devices that just had one parition (i.e., / and /boot on the same device):

1. Clone the disk, so I have two identical virtual devices "disk01" and "disk01-new".
2. Attach devices to running system.  They attach as /dev/vdc and /dev/vdd respectively.
3. "make.xfs -f /dev/vdd1" to create a new XFS system.
4. Mount "disk01" and "disk01-new" as /mnt/disk01 and /mnt/disk01-new respectively.
5. "rsync --progress -atz /mnt/disk01/* /mnt/disk01-new"
6. Unmount the disks and set the filesystem UUID of "disk01-new" to match "disk01" using "xfs_admin -U <UID> /dev/vdd1".  This is so I can boot from it without changing fstab.
7. Detach drives.
8. Launch a new VM using "disk01-new".  SSH into it and update to -110 of GRUB.  Run grub2-install and grub2-mkconfig.
9. Reboot and see the XFS error still.  :(

I thought maybe making a new XFS filesystem would help, I guess it didn't.  Could this be something with virtual block devices?

Comment 19 Nicolas Frayer 2023-12-20 17:06:40 UTC
Thanks for trying that Shane.
I've done the same with the same result earlier this week.
Today I've done something different, I upgraded the xfsprogs package to 6.5.0 on a rawhide VM (check your current version with mkfs.xfs -V, should be 6.4.0) and re-formatted the partition you put on gdrive (the -110 image). Please note that xfsprogs 6.5.0 is currently only available on rawhide (Fedora 40).
It seems to have fixed it so that could be a workaround if you are stuck for now while I look at the issue.

Comment 20 Shane Hart 2023-12-20 17:11:18 UTC
Yes, I'm not running rawhide so it's 6.4.0.

That's good to know though!  I'm not stuck, it's just GRUB, and I'm just not upgrading it.  :)

Comment 21 Nicolas Frayer 2023-12-21 16:36:49 UTC
Hi Shane,

Correction about my previous comment.
I am able to attach the virtual disk (-110) to another VM, backup the files, re-format it (xfsprogs 6.4.0), copy the file back and it boots in a VM with this disk being the only disk there. (at least grub boots to the boot menu but I've got not rootfs so it just loads the kernel then it stops).
Also I've run xfs_repair -n on the device in a booted VM with the -110 disk you sent attached and I get a lot of errors reported.
I am wondering if the filesystem is corrupted hence grub can't boot from it.
The -109 disk image doesn't have any errors when running xfs_repair -n.

If the only difference between your -109 and -110 images is the grub update I am not sure what could have caused that ? Especially on several of your VMs.
Can you run xfs_repair -n on the other /boot disks and see if any of them have this issue ?

Comment 22 Shane Hart 2023-12-21 17:08:55 UTC
Nicolas,

Interesting!  The -109 and -110 disk images on GDrive were taken about 3 minutes apart, so it is weird.

On the one image with a separate boot parition I unmounted /boot and checked:

root@machine:~# umount /boot
root@machine:~# blkid
/dev/vdb: UUID="0b7c38d1-a806-4792-903e-3f53addb40cc" BLOCK_SIZE="4096" TYPE="ext4"
/dev/zram0: LABEL="zram0" UUID="361f1568-98ee-4c0c-a4ef-e04ef1f9ce89" TYPE="swap"
/dev/vda2: UUID="86a19005-5ced-4297-87a4-26e18c112c16" BLOCK_SIZE="512" TYPE="xfs" PARTUUID="77590160-02"
/dev/vda1: UUID="9e8c33c8-cead-49ee-93ca-34ca10ae7dda" BLOCK_SIZE="512" TYPE="xfs" PARTUUID="77590160-01"
root@scaledev:~# xfs_repair -n /dev/vda1
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
sb_fdblocks 174104, counted 184113
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 3
        - agno = 2
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

I can put the dump of that from one with a combined / and /boot once I can safely take one of the machines down.

Comment 23 Shane Hart 2023-12-21 17:40:17 UTC
With the combined / and /boot it looks clean:

# xfs_repair -n /dev/vdc1
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 3
        - agno = 2
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

Comment 24 Shane Hart 2023-12-21 17:49:29 UTC
If I take that combined / and /boot image that I just checked above, mount it, update it to grub -110, run grub2-install, power it off (I *do not* boot from it), and rerun xfs_repair -n on it I don't get any errors.

If I try to boot from it I get the XFS error we're combating.  If I turn off the machine, and check that image *again* I still don't seem to get any errors.  Maybe your FS errors were from the procedure I used to create the image and are a red herring...

Comment 25 Nicolas Frayer 2023-12-22 14:57:01 UTC
Yes it might have been corrupted during the image creation process.
Would it be possible for you to extract the metadata of a "-110" /boot partition by attaching it to a working VM (without mounting it) and follow the procedure in https://bugzilla.redhat.com/show_bug.cgi?id=2254370#c3 ?
The best way would be to send the compressed qcow "-110" image of the combined / and /boot disk but I am not sure you can or want to do it :)

Thanks a lot for the great help you've provided so far.

Comment 26 Shane Hart 2023-12-22 18:56:01 UTC
I have some images compressing that I hope are small enough to fit on my GDrive.  Here's what I did:

1. Cloned my problematic image (that was at -109).
2. Booted into this image and removed most of my "big" stuff.  It's a pretty vanilla Fedora install, I removed some of the bigger toolchains I had.  There isn't anything sensitive/important on here.
3. To make the image a sensible size I created a *new* QCOW2 image, formatted with XFS (6.4.0, from Fedora 39), and copied my stuff to that new image.
4. Did some stuff to make it boot (reupdate initrd, reinstall GRUB, set enforcing=0 boot parameter or else I got [!!!!!!]failed to allocate manager object).  Backed this image up as fedora39-109.qcow2.
5. Updated to grub -110.  Forgot to run grub2-install, made a backup as fedora39-110.qcow2.
6. It booted fine.
7. Ran grub2-install, overwrote a new backup as fedora39-110.qcow2.
8. Got the error.

I'll update with the images when I can or post if I can't.

As an aside; since my machines are pretty boring I installed a new Fedora 39 image.  I noticed that it used GPT instead of MSDOS for the partition table.  I did a netinst so it was already at -110 when I booted it for the first time.  As expected, it worked fine.

Comment 27 Shane Hart 2023-12-22 22:23:49 UTC
They fit!  The -109 should be bootable and you can recreate the -110 by just running:

dnf upgrade
grub2-install /dev/<whatever>

-109
https://drive.google.com/file/d/1bFqVz1knogc1dVgitNaCn-WNka-gw97c/view?usp=sharing

-110
https://drive.google.com/file/d/1NtIW_mg8e9K_CjsaVsAtcncMOyHuZ3-I/view?usp=sharing


The drive images are about 17-18 GB uncompressed.

If you have any holidays for the next few days I hope you enjoy them!

Comment 28 Fernando Datz 2024-01-01 21:07:38 UTC
Hi Shane, Nicolas and Marta,

I believe I'm running into this same bug on my machine. It also seems to be the same as the one discussed in this post:

https://discussion.fedoraproject.org/t/grub-broken-after-upgrade-to-f38/100319

I'm running F39, initially a F35 netinstall upgraded with dnf, with a separate /boot XFS partition and MBR. I ran grub2-install (2.06-110) after upgrading from F38 to F39.

I thought it was failing to boot, but after reading that Fedora Discussion post, I managed to get to the boot menu and boot after holding the spacebar to skip some 40 pages of
error: ../../grub-core/fs/xfs.c:541:not a correct inode

I already ran xfs_repair on /boot from a live USB and it seems OK.

If one of you is able to test that it reaches the boot menu by scrolling down the pages, we can find out if it's the same issue. I haven't tested it yet, but
set pager=0
in grub.cfg might save you some keystrokes.

I had to type some of the log out, since I couldn't find a way to dump it, and I'll send it as an attachment. There might be some typos.

The error comes from fs/xfs.c:grub_xfs_read_inode trying to check some magic number for an inode with (ino = 0, block = 0, offset = 0) while opening a file. It looks like it gets stuck in a loop before reaching a timeout after over 250 attempts and then resumes and opens the file properly.

Note that some files don't raise this error. Every grub2/i386-pc/*.mod seems to do, but grub2/grubenv doesn't. It probably has to do with directory traversal and the number of files in a directory, I think, and since xfsprogs-6.5.0 seems to make it disappear, it might be because it fails to check for an older layout. In particular, my partition has nrext64=0, which is related to the latest patches regarding extent parsing.

# xfs_info /dev/sda1
meta-data=/dev/sda1              isize=512    agcount=4, agsize=65536 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=0 inobtcount=0 nrext64=0
data     =                       bsize=4096   blocks=262144, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


Also Shane, you can setup early debug logging if you install it with this hidden flag: 
grub2-install --debug-image=all /dev/...
It still skips some early errors, but it will you give some more info before reaching your grub.cfg. It makes it very verbose, though.

Comment 29 Fernando Datz 2024-01-01 21:10:59 UTC
Created attachment 2006750 [details]
Truncated grub debug="all" log

Comment 30 Marta Lewandowska 2024-01-02 20:23:08 UTC
Fernando,
thank you very much for the log with debugging enabled. So far we have not been able to reproduce the problem in quite the same way-- VM does not boot, but we also do not see the errors you [all] are seeing. Could you perhaps attach the whole log without cutting anything out?

Shane,
could you please confirm whether what you're seeing is the same as described in the fedora discussion post? Are you also able to boot once you scroll through all the errors? Doing `set pager=0` either in the grub.cfg or adding it to grubenv, as Fernando describes, should make your life easier.

thanks!

Comment 31 Shane Hart 2024-01-03 13:19:19 UTC
All,

I can confirm that after:

* Holding space to skip all the "not a correct inode" messages I get to the GRUB boot menu
* I can select my normal boot kernel
* After holding space to skip all the "not a correct inode" messages again I get a "Press any key to continue..." prompt
* After pressing a key it boots successfully.

So that's a success I suppose!  I can leave a machine with the -110 revision, just can't do unattended reboots.

Comment 32 Helmut K. C. Tessarek 2024-01-08 12:49:57 UTC
All,

I was about to open a new ticket as suggested in https://discussion.fedoraproject.org/t/grub2-install-results-in-not-a-correct-xfs-inode/101103/ but then I saw this one, which seems exactly what I am running into. However, I didn't see a solution, thus I wanted to ask what the status was.

I am also available to collect any info that can help to narrow this down even further. See above link for the problem description and additional info regarding this issue. I can also paste everthing here, if this is better. Please let me know.

Cheers,
  K. C.

Comment 33 Nicolas Frayer 2024-01-08 14:51:45 UTC
All,

We think we have identified the patch that creates this issue.
I would like to do a grub2 build without this patch and wanted to know if you were willing to test it out ?
Please let me know and I'll post it.

Thanks,

Nicolas

Comment 34 Shane Hart 2024-01-08 15:01:20 UTC
Nicolas,

I can definitely test a patch.  All of my machines except two are still running -109.  One is running -110.  One I did a fresh install of F39 and is also running -110.

Can test on any combination of these.

Comment 35 Helmut K. C. Tessarek 2024-01-08 15:24:11 UTC
I am also more than happy to test it. I'm currently on 2.06-110.fc39.

Comment 36 Marcin Struzak 2024-01-08 15:57:41 UTC
Hello everyone,

Just a quick note that I ran into this issue with -108 in F38.  My XFS /boot & / partitions were created by vanilla anaconda in F31, if I recall correctly; the system was always upgraded release-to-release using dnf system-upgrade, but I ran grub2-install (as suggested in [1]) for the first time in F37 (with no issues), then this time in F38. In other words, in my case grub2-install -94 worked & -108 broke. 

[1] https://docs.fedoraproject.org/en-US/quick-docs/upgrading-fedora-offline/#sect-update-grub-bootloader-on-bios

Comment 37 Marta Lewandowska 2024-01-08 16:15:19 UTC
Hi everyone, 
here a repo with a build without the patch that Nicolas mentioned in comment#33 https://people.redhat.com/mlewando/grub2-2.06-112.fc39/
If one of you wonderful volunteers could test it, we would very much appreciate that (:

Comment 38 Marta Lewandowska 2024-01-08 16:37:45 UTC
(In reply to Marcin Struzak from comment #36)
> Hello everyone,
> 
> Just a quick note that I ran into this issue with -108 in F38.  My XFS /boot
> & / partitions were created by vanilla anaconda in F31, if I recall
> correctly; the system was always upgraded release-to-release using dnf
> system-upgrade, but I ran grub2-install (as suggested in [1]) for the first
> time in F37 (with no issues), then this time in F38. In other words, in my
> case grub2-install -94 worked & -108 broke. 
> 
> [1]
> https://docs.fedoraproject.org/en-US/quick-docs/upgrading-fedora-offline/
> #sect-update-grub-bootloader-on-bios

grub2-2.06-108 on f38 contains the same patches that are in grub2-2.06-110 in later fedoras, so what you're seeing should also get fixed by removing this patch. Once we have confirmation that this patch is the one responsible, you can expect a build for f38 as well.

Comment 39 Shane Hart 2024-01-08 17:32:29 UTC
Marta,

I can confirm that:

* Upgrading from -110 to -112
* Running grub2-install and grub2-mkconfig
* Rebooting

Works without the error message.  Running grub2-install *is* required.

Comment 40 Marta Lewandowska 2024-01-08 17:38:43 UTC
Hi Shane,

great news! thank you so much for testing.

Comment 41 Helmut K. C. Tessarek 2024-01-08 17:52:01 UTC
When running grub2-install I got the following error:

# grub2-install /dev/sda
Installing for i386-pc platform.
grub2-install: error: unable to identify a filesystem in hd0; safety check can't be performed.

Then I ran grub2-mkconfig and rebooted. Problem gone. Awesome! It worked. Thanks so much.

Comment 42 Helmut K. C. Tessarek 2024-01-08 17:54:12 UTC
Btw, if possible, is there a git diff I can see that fixed the issue?

Comment 43 Marcin Struzak 2024-01-08 18:00:15 UTC
(In reply to Marta Lewandowska from comment #38)
> [...]
> grub2-2.06-108 on f38 contains the same patches that are in grub2-2.06-110
> in later fedoras [...]

The submitter stated that downgrading to -109 was fixing the issue.  Are we saying that the defective patches were applied in -108, no longer applied in -109, and applied again in -110?  This may be moot now that the fix is almost ready...

Comment 44 Shane Hart 2024-01-08 18:16:49 UTC
Marcin,

-109 was the problematic revision in fc39.  -108 is the problematic version for fc38.

Comment 45 Shane Hart 2024-01-08 18:17:20 UTC
I mean, -110 is the problematic version in fc39.  -109 is fine, which is why downgrading to it for Fedora 39 is fine.

Comment 46 Marcin Struzak 2024-01-08 19:17:47 UTC
(In reply to Shane Hart from comment #45)
> I mean, -110 is the problematic version in fc39.  -109 is fine, which is why
> downgrading to it for Fedora 39 is fine.

Thanks for clarifying, never realized that -<rev> versions weren't identical between releases or strictly sequential, with later versions always containing all the fixes from earlier ones, even across releases.

Comment 47 Fernando Datz 2024-01-08 21:36:16 UTC
(In reply to Marta Lewandowska from comment #37)
> Hi everyone, 
> here a repo with a build without the patch that Nicolas mentioned in
> comment#33 https://people.redhat.com/mlewando/grub2-2.06-112.fc39/
> If one of you wonderful volunteers could test it, we would very much
> appreciate that (:

Hi everyone,

Just saw the notification and I can also confirm this build removes those error messages.

Also Marta, sorry for not replying earlier with a complete log. Is it still needed?
I'm not running a VM, so the only solution would have been to try to output through the serial port, which I can't do, or to replicate it in a VM. That previous log had to be manually typed out from some pictures I took.

Comment 48 Marta Lewandowska 2024-01-09 13:55:17 UTC
Everyone,
I'm really glad it's working for you. You can expect fixes soon for f38 and f39 (sorry Nicolas, but I know you're planning to do these!)

Helmut,
this is the patch that was causing the problem: https://github.com/rhboot/grub2/commit/1955d781ba20c1952adc820f3e587878b4f559e1

Fernando,
wow, you're a hero! thanks again, in that case.

If someone else could capture a log of the issue with debugging on, as described in comment#28, that would be great. Even though the issue has been mitigated for now, we will face it later, at least wrt upgrades.

Comment 49 Fedora Update System 2024-01-18 17:11:51 UTC
FEDORA-2024-53d986312e has been submitted as an update to Fedora 39. https://bodhi.fedoraproject.org/updates/FEDORA-2024-53d986312e

Comment 50 Fedora Update System 2024-01-18 17:44:56 UTC
FEDORA-2024-633dc7e183 has been submitted as an update to Fedora 38. https://bodhi.fedoraproject.org/updates/FEDORA-2024-633dc7e183

Comment 51 Helmut K. C. Tessarek 2024-01-18 18:04:47 UTC
Nicolas,

I can see the following in the notes:

Mon Jan 8 2024 Nicolas Frayer <nfrayer> - 2.06-113
xfs: some bios systems with /boot partition created with
xfsprog < 6.5.0 can't boot with one of the xfs upstream patches
Resolves: #2254370

The upstream patch was removed again as a workaround, but isn't this patch somewhat important? Won't it be required at one point? But what is the solution? Recreate the partition with xfsprog >= 6.5.0?
The problem is that even if I remove /boot from fstab (and /boot is not mounted), mkfs.xfs still tells me that the partition is in use, even though it is not and no processes are accessing it (checked with lsof and whatnot). One of my machines is in a DC and HP's remote console is not that great, so booting up with a LiveCD could be rather tricky.

Cheers,
  K. C.

Comment 52 Nicolas Frayer 2024-01-18 18:16:53 UTC
Hi K.C.,

Yes this patch has been removed for now as it seems to brick several bios machines after a grub update.
This is so we can work on a proper fix in the meantime.

A manual workaround would be to attach your drive to another VM/machine (if possible), backup your /boot, format your /boot partition (assuming you have a separate partition, one for /boot and one for /(root)) with xfsprog >= 6.5.0 and copy your backed up files back onto /boot.
I understand that your machine is in a data center and that could be a problem for this workaround.

Thanks.

Comment 53 Helmut K. C. Tessarek 2024-01-18 18:24:52 UTC
> Yes this patch has been removed for now as it seems to brick several bios machines after a grub update.

One of which was mine. ;-) Luckily I used a test VM on my Proxmox before doing the upgrade on the bare metal machine.

> I understand that your machine is in a data center and that could be a problem for this workaround.

Yep, since it's a bare metal box. Sometimes even the remote console does not work. Which is why I added an ssh server to my initramfs so that I can decrypt my / when booting up without having to use the remote console.
But maybe I can visit the DC at one point...

> A manual workaround....

However, I can try your workaround on my Proxmox. Thanks for your reply and help. Cheers, K. C.

Comment 54 Fedora Update System 2024-01-19 03:33:59 UTC
FEDORA-2024-633dc7e183 has been pushed to the Fedora 38 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-633dc7e183`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-633dc7e183

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 55 Fedora Update System 2024-01-19 18:04:31 UTC
FEDORA-2024-53d986312e has been pushed to the Fedora 39 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-53d986312e`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-53d986312e

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 56 Adam Williamson 2024-01-19 20:42:40 UTC
I suspect the fix for this causes https://bugzilla.redhat.com/show_bug.cgi?id=2259266 - BIOS boot from XFS /boot is broken, at least on qemu, in Rawhide.

Comment 57 Jon DeVree 2024-01-21 15:47:38 UTC
Sorry I'm a little late to the party. I just found out about the bug report.

So to start with:
1. Big thanks to Shane for that disk image
2. None of the bugs have anything to do with the version of xfsprogs used to create the filesystem.

This bug is caused by a directory entry being structured like this: (The output is what xfs_db shows for the directry entry for /boot/grub2/i386-pc from the fedora39-110.qcow2.xz image)

u3.bmx[0-3] = [startoff,startblock,blockcount,extentflag]
0:[0,8461707,1,0]
1:[1,8498650,1,0]
2:[3,9638708,1,0]
3:[8388608,8461706,1,0]

(Note: Don't worry about extent #3 with block #8388608. That is leaf information that GRUB's parser almost entirely ignores.)

So there are 3 valid data blocks 0, 1, and 3. But GRUB's iterator code doesn't parse this extent list, it actually infers the number of data blocks from the inode.size field. This inference includes the unmapped data blocks like block #2: (Although it excludes the leaf blocks like #8388608)

xfs_db> dblock 2
file data block is unmapped

So when GRUB reads this directory it proceeds to iterate over blocks 0, 1, 2, and 3. The original XFS parser ended up skipping this unmapped data block as a side effect of its attempt to incorrectly read leaf information from the end of the data block. When I fixed the code that was incorrectly parsing the leaf information, I unknowingly removed the only code preventing GRUB from trying to parse unmapped data blocks.

Assuming GRUB returns a zero'ed block of memory for the unmapped data block, the fix could be as simple as checking each block for the magic number. (Warning: there are at least four cromulent magic numbers.) If GRUB is returning garbage data for the unmapped data block then the fix is going to be much more complicated.

(In reply to Adam Williamson from comment #56)
> I suspect the fix for this causes
> https://bugzilla.redhat.com/show_bug.cgi?id=2259266 - BIOS boot from XFS
> /boot is broken, at least on qemu, in Rawhide.

yes, reverting this patch will cause a different XFS parser bug. If Fedora wishes to revert the patch they need to also revert the original fuzzer patch (https://git.savannah.gnu.org/cgit/grub.git/commit/grub-core/fs/xfs.c?id=ef7850c757fb3dd2462a512cfa0ff19c89fcc0b1) and probably the other followup fix (https://git.savannah.gnu.org/cgit/grub.git/commit/grub-core/fs/xfs.c?id=ad7fb8e2e02bb1dd0475ead9919c1c82514d2ef8) for it as well.

Comment 58 Helmut K. C. Tessarek 2024-01-21 16:16:12 UTC
> When I fixed the code that was incorrectly parsing the leaf information, I unknowingly removed the only code preventing GRUB from trying to parse unmapped data blocks.

Hmm, so putting that code back would fix the issue without having to revert anything, am I correct? If so, is this in the works and what has to be done on Fedora's side?

Comment 59 Jon DeVree 2024-01-21 16:48:38 UTC
(In reply to Helmut K. C. Tessarek from comment #58)
> > When I fixed the code that was incorrectly parsing the leaf information, I unknowingly removed the only code preventing GRUB from trying to parse unmapped data blocks.
> 
> Hmm, so putting that code back would fix the issue without having to revert
> anything, am I correct? If so, is this in the works and what has to be done
> on Fedora's side?

That is effectively the same as reverting the patch. Mis-parsing the XFS data structures may have been beneficial in this case, but it broke parsing directories in other cases. I suggested a fix in the paragraph after the one that you quoted.

Comment 60 Helmut K. C. Tessarek 2024-01-21 16:56:57 UTC
I wasn't able to edit my comment. By "putting that code back", I meant the part that checks for unmapped data blocks... I am not sure which project you are part of: XFS, grub upstream, grub fedora.
What I am asking was whether the code shouldn't be rather fixed upstream.

Comment 61 Jon DeVree 2024-01-21 17:34:22 UTC
(In reply to Helmut K. C. Tessarek from comment #60)
> I wasn't able to edit my comment. By "putting that code back", I meant the
> part that checks for unmapped data blocks...

I also wish I had an edit button, after posting comment #59 I might not have been clear in comment #57

GRUB was not checking for unmapped data blocks, it skipped them as a side effect of incorrectly parsing the XFS data structures. So the code that parsed the structures wrong and the code that "checked" for unmapped blocks is the same code.


> I am not sure which project you are part of: XFS, grub upstream, grub fedora.

None, I am the author of the patch. (I just updated my bugzilla display name to make that more obvious)

I'm just trying to fill in some blanks so that Fedora can make an informed decision on revert or fix.

Comment 62 Helmut K. C. Tessarek 2024-01-21 17:38:32 UTC
Thanks for the info. Now it makes sense to me.

Comment 63 Marta Lewandowska 2024-01-22 14:32:48 UTC
Hi Jon,

Thank you for all of your input. I should have pulled you into this discussion much sooner. 

AFAIK, removing your patch from rawhide was only meant as a temporary fix to mitigate the issues people are seeing. It seems that the patch should not be backported (at least in its current state) to f38 and f39, as this bug indicates, but really does need to be in rawhide because otherwise: https://bugzilla.redhat.com/show_bug.cgi?id=2259266 Based on testing that we did last week on VMs, it appeared that the patch was causing problems on rawhide as well, but the behavior of VMs (that Nicolas and I have been using to test) is different than what people are reporting. For us, the VMs just go down and then we don't see any output / serial console, no errors, nothing. They just hang. We will make sure to test on bare metal from now on as well.

Nicolas is out this week, but we can try to make some progress anyway in the meantime..? You know much more about xfs than either of us do, so your input is certainly welcome. (:

thanks again.

Comment 64 Fedora Update System 2024-02-05 01:24:21 UTC
FEDORA-2024-53d986312e has been pushed to the Fedora 39 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 65 Fedora Update System 2024-02-05 01:45:45 UTC
FEDORA-2024-633dc7e183 has been pushed to the Fedora 38 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 66 Shane Hart 2024-03-11 12:52:40 UTC
This problem appears to be back in Fedora 39 1:2.06-118.  I don't know if the proper etiquette is to open a new issue...

Comment 67 Nicolas Frayer 2024-03-11 13:17:15 UTC
Hi Shane,

Yes unfortunately we had to put the patch causing the issue back. A fix is being currently reviewed upstream and as soon as it's merged I'll port it downstream to Fedora ...
A work-around for now is to "re-create" the /boot/grub2/i386-pc folder, which should get rid off the unmapped directory blocks that may be present.
Something like:
- mv /boot/grub2/i386-pc /boot/grub2/i386-pc.bak
- cp -Rfp /boot/grub2/i386-pc.bak /boot/grub2/i386-pc

Comment 68 Shane Hart 2024-03-11 13:49:55 UTC
Nicolas,

I did the workaround and it was fine!  Good thing to document here.

Comment 69 Alexon Oliveira 2024-04-13 00:11:44 UTC
(In reply to Nicolas Frayer from comment #67)
> Hi Shane,
> 
> Yes unfortunately we had to put the patch causing the issue back. A fix is
> being currently reviewed upstream and as soon as it's merged I'll port it
> downstream to Fedora ...
> A work-around for now is to "re-create" the /boot/grub2/i386-pc folder,
> which should get rid off the unmapped directory blocks that may be present.
> Something like:
> - mv /boot/grub2/i386-pc /boot/grub2/i386-pc.bak
> - cp -Rfp /boot/grub2/i386-pc.bak /boot/grub2/i386-pc

Just FYI, the issue persists in the latest version of Fedora 39, but the workaround also worked for me. Posted in upstream as a reference too: https://discussion.fedoraproject.org/t/grub2-install-results-in-not-a-correct-xfs-inode/101103/28

Comment 70 Nicolas Frayer 2024-04-15 08:59:16 UTC
The patch has just made it upstream end of last week, I'll build an update this week for it.

Comment 71 Fedora Update System 2024-04-17 11:03:22 UTC
FEDORA-2024-2b545d3085 (grub2-2.06-121.fc40) has been submitted as an update to Fedora 40.
https://bodhi.fedoraproject.org/updates/FEDORA-2024-2b545d3085

Comment 72 Fedora Update System 2024-04-17 11:07:27 UTC
FEDORA-2024-d09797f550 (grub2-2.06-120.fc39) has been submitted as an update to Fedora 39.
https://bodhi.fedoraproject.org/updates/FEDORA-2024-d09797f550

Comment 73 Fedora Update System 2024-04-17 11:10:56 UTC
FEDORA-2024-01f402fae5 (grub2-2.06-118.fc38) has been submitted as an update to Fedora 38.
https://bodhi.fedoraproject.org/updates/FEDORA-2024-01f402fae5

Comment 74 Fedora Update System 2024-04-18 01:03:55 UTC
FEDORA-2024-2b545d3085 has been pushed to the Fedora 40 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-2b545d3085`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-2b545d3085

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 75 Fedora Update System 2024-04-18 01:37:49 UTC
FEDORA-2024-d09797f550 has been pushed to the Fedora 39 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-d09797f550`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-d09797f550

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 76 Fedora Update System 2024-04-18 02:01:23 UTC
FEDORA-2024-01f402fae5 has been pushed to the Fedora 38 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-01f402fae5`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-01f402fae5

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 77 Fedora Update System 2024-04-23 01:15:18 UTC
FEDORA-2024-2b545d3085 (grub2-2.06-121.fc40) has been pushed to the Fedora 40 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 78 Fedora Update System 2024-04-29 01:55:18 UTC
FEDORA-2024-d09797f550 (grub2-2.06-120.fc39) has been pushed to the Fedora 39 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 79 Fedora Update System 2024-05-03 01:34:05 UTC
FEDORA-2024-01f402fae5 (grub2-2.06-118.fc38) has been pushed to the Fedora 38 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.