Bug 2209760 - Cannot boot Fedora installed to a GPT-labeled disk on ppc64le qemu (with SLOF)
Summary: Cannot boot Fedora installed to a GPT-labeled disk on ppc64le qemu (with SLOF)
Keywords:
Status: ASSIGNED
Alias: None
Product: Fedora
Classification: Fedora
Component: SLOF
Version: 39
Hardware: ppc64le
OS: Linux
unspecified
low
Target Milestone: ---
Assignee: Fedora Virtualization Maintainers
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: openqa
Depends On:
Blocks: PPCTracker
TreeView+ depends on / blocked
 
Reported: 2023-05-24 16:46 UTC by Adam Williamson
Modified: 2023-08-16 08:15 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: ---
Embargoed:


Attachments (Terms of Use)
screenshot of the boot failure (39.95 KB, image/png)
2023-05-24 17:10 UTC, Adam Williamson
no flags Details

Description Adam Williamson 2023-05-24 16:46:06 UTC
The Fedora openQA tests for ppc64le have been showing a lot of failures for some time, but I only recently found time to investigate. Most of the tests fail, every day; the small number that pass all seemed to be ones that don't involve doing an install, or do an install to a pre-created disk image instead of a fresh one.

The failure mode is that the install works fine, but booting the installed system fails on a SLOF screen (see attachment). It shows:

Trying to load:   from: /pci@800000020000000/scsi@7:1 ...
E3405: No such device

E3407: Load failed

I figured out the tests started failing after 2022-08-15, and looking into what changed around then, I noticed: that's when we started using GPT by default when formatting disks, in anaconda-38.1-1.

So I did a scratch build of python-blivet which forces the use of msdos disk labels on ppc64le (it drops 'gpt' from the list of supported disk labels, so anaconda is forced to use 'msdos', the only choice remaining), tested an install with that, and...it works. The installed system boots.

I don't know if this is only a problem on qemu with SLOF, or if it also affects bare metal ppc64le systems with real firmwares.

Reproducible: Always

Steps to Reproduce:
1. Install any Fedora since Fedora-Rawhide-20220815.n.0 to a qemu ppc64le VM using a fresh disk, so anaconda will format it using a GPT disk label
2. Try and boot it
Actual Results:  
It fails to boot, showing the SLOF errors described above and attached

Expected Results:  
It should boot successfully

This happens with both our current (old) SLOF build - SLOF-20210217-6.git33a7322d.fc38 - and a build of the most recent SLOF tagged upstream (20220719). I did a build of that just to see if it would help, but it doesn't.

Here's some information on the disk layout, from lsblk:

NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS        PARTTYPE                             FSTYPE
loop0    7:0    0 543.3M  1 loop /run/rootfsbase                                         squashfs
sr0     11:0    1 682.1M  0 rom  /run/install/repo                                       iso9660
zram0  251:0    0   3.8G  0 disk [SWAP]                                                  
vda    252:0    0    10G  0 disk                                                         
├─vda1 252:1    0     4M  0 part                    9e1a2d38-c612-4316-aa26-8b49521e5a8b 
├─vda2 252:2    0     1G  0 part /mnt/sysroot/boot  0fc63daf-8483-4772-8e79-3d69d8477de4 ext4
│                                /mnt/sysimage/boot                                      
└─vda3 252:3    0     9G  0 part /mnt/sysroot/home  0fc63daf-8483-4772-8e79-3d69d8477de4 btrfs
                                 /mnt/sysroot                                            
                                 /mnt/sysimage/home                                      
                                 /mnt/sysimage                                           
vdb    252:16   0    10G  0 disk 

I did wonder if the PARTTYPE for the PReP boot partition or the /boot partition might be wrong, but they don't seem to be. 9e1a2d38-c612-4316-aa26-8b49521e5a8b is correct for a PReP boot partition, 0fc63daf-8483-4772-8e79-3d69d8477de4 is correct for a Linux data partition, and those are the UUIDs that SLOF looks for:

https://github.com/aik/SLOF/blob/master/slof/fs/packages/disk-label.fs#L387
https://github.com/aik/SLOF/blob/master/slof/fs/packages/disk-label.fs#L404

the most recent relevant commits to SLOF, https://github.com/aik/SLOF/commit/7b1fb8daf911d3d54a1246b69c1d06a6cd8471f5 and https://github.com/aik/SLOF/commit/12b5d02e378a204c986b52d56b4ca8a0dab6ba21 , kinda imply an attempt to support booting from GPT, but I'm not sure if it's *complete* or supports our exact setup. So I'm not sure if the problem here is best described as:

* SLOF intends to support our GPT layout but there's a bug preventing it, and we should fix that bug
* SLOF does not yet support our GPT layout, and we should enhance it to do so
* SLOF does not yet support our GPT layout, so we should somehow ensure we use an msdos label when installing on ppc64le (only on virt or when using SLOF? or always?)

It's easy enough to force blivet to always use an msdos disk label on ppc64le, but I'm not sure if that's the *right* or best fix.

Comment 1 Adam Williamson 2023-05-24 17:06:37 UTC
Additional note here: the change that resulted in us using GPT labels on ppc64le installs, arguably, should not have done so, since it's titled "Install Using GPT on x86_64 BIOS by Default". I added some notes on what exactly went on there in https://bugzilla.redhat.com/show_bug.cgi?id=2092091#c6 .

Comment 2 Adam Williamson 2023-05-24 17:10:28 UTC
Created attachment 1966649 [details]
screenshot of the boot failure

Comment 3 Adam Williamson 2023-05-24 17:14:36 UTC
Sorry, forgot to note: the reason some tests that install to a pre-created disk image work is that those pre-created disk images happen to use msdos disk labels, not GPT ones. I haven't tested, but I bet if I tweaked createhdds (the tool that creates the images) to use GPT disk labels and re-generated the images, those tests would start failing too.

Comment 4 Adam Williamson 2023-05-26 23:29:57 UTC
https://github.com/storaged-project/blivet/pull/1132 and https://github.com/rhinstaller/anaconda/pull/4795 would avoid this by going back to MBR labels on ppc64le installs. Arguably this would still be a bug/missing feature in SLOF, but it'd be much less important.

Comment 5 Thomas Huth 2023-05-30 10:33:13 UTC
FWIW, SLOF should have support for GPT since 2013:

 https://github.com/aik/SLOF/commit/a51e46c2384deb95c41b0ff6c3025724d5d6cc08

... but it's likely not tested very well, so I'd expect a bug in SLOF here.

Comment 6 Éric Fintzel 2023-05-30 14:04:01 UTC
Just to let you know an install on a GPT disk on a P9 or P10 LPAR is working as expected, and the system is functional.
Here is an example of the today Rawhide ppc64le Server image on a P9 LPAR.

Information from fdisk
----------------------
Disk /dev/sda: 20 GiB, 21474836480 bytes, 5242880 sectors
Disk model: VDASD
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: CA078465-8FB7-4BA2-82B7-4D479CB71CDE

Device      Start     End Sectors Size Type
/dev/sda1     256    2815    2560  10M PowerPC PReP boot
/dev/sda2    2816  264959  262144   1G Linux filesystem
/dev/sda3  264960 5242622 4977663  19G Linux filesystem

Disk layout from lsblk
----------------------
NAME MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS PARTTYPE                             FSTYPE
sda    8:0    0    20G  0 disk
├─sda1
│      8:1    0    10M  0 part             9e1a2d38-c612-4316-aa26-8b49521e5a8b
├─sda2
│      8:2    0     1G  0 part /boot       0fc63daf-8483-4772-8e79-3d69d8477de4 ext2
└─sda3
       8:3    0    19G  0 part /           0fc63daf-8483-4772-8e79-3d69d8477de4 ext4

Comment 7 Adam Williamson 2023-05-30 16:16:28 UTC
thuth: yeah, there's definitely *some* support there - I linked to some later commits that add some stuff missing from that original implementation - but I couldn't find any clear indication of how full the implementation is, what kinda disk layouts it's intended to work with, what it was tested on etc.

Comment 8 Adam Williamson 2023-06-16 13:20:47 UTC
Good news: we now have newer anaconda with the changes tagged in Rawhide, and this is confirmed worked around - we're now getting MBR labels on ppc64le installs by default, so the openQA tests don't all fail any more!

https://openqa.stg.fedoraproject.org/tests/overview?distri=fedora&version=Rawhide&build=Fedora-Rawhide-20230616.n.0&groupid=3

the underlying bug still exists and should ideally be addressed, though, so keeping this open but dropping the severity.

Comment 9 Fedora Release Engineering 2023-08-16 08:15:16 UTC
This bug appears to have been reported against 'rawhide' during the Fedora Linux 39 development cycle.
Changing version to 39.


Note You need to log in before you can comment on or make changes to this bug.