The Fedora openQA tests for ppc64le have been showing a lot of failures for some time, but I only recently found time to investigate. Most of the tests fail, every day; the small number that pass all seemed to be ones that don't involve doing an install, or do an install to a pre-created disk image instead of a fresh one. The failure mode is that the install works fine, but booting the installed system fails on a SLOF screen (see attachment). It shows: Trying to load: from: /pci@800000020000000/scsi@7:1 ... E3405: No such device E3407: Load failed I figured out the tests started failing after 2022-08-15, and looking into what changed around then, I noticed: that's when we started using GPT by default when formatting disks, in anaconda-38.1-1. So I did a scratch build of python-blivet which forces the use of msdos disk labels on ppc64le (it drops 'gpt' from the list of supported disk labels, so anaconda is forced to use 'msdos', the only choice remaining), tested an install with that, and...it works. The installed system boots. I don't know if this is only a problem on qemu with SLOF, or if it also affects bare metal ppc64le systems with real firmwares. Reproducible: Always Steps to Reproduce: 1. Install any Fedora since Fedora-Rawhide-20220815.n.0 to a qemu ppc64le VM using a fresh disk, so anaconda will format it using a GPT disk label 2. Try and boot it Actual Results: It fails to boot, showing the SLOF errors described above and attached Expected Results: It should boot successfully This happens with both our current (old) SLOF build - SLOF-20210217-6.git33a7322d.fc38 - and a build of the most recent SLOF tagged upstream (20220719). I did a build of that just to see if it would help, but it doesn't. Here's some information on the disk layout, from lsblk: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS PARTTYPE FSTYPE loop0 7:0 0 543.3M 1 loop /run/rootfsbase squashfs sr0 11:0 1 682.1M 0 rom /run/install/repo iso9660 zram0 251:0 0 3.8G 0 disk [SWAP] vda 252:0 0 10G 0 disk ├─vda1 252:1 0 4M 0 part 9e1a2d38-c612-4316-aa26-8b49521e5a8b ├─vda2 252:2 0 1G 0 part /mnt/sysroot/boot 0fc63daf-8483-4772-8e79-3d69d8477de4 ext4 │ /mnt/sysimage/boot └─vda3 252:3 0 9G 0 part /mnt/sysroot/home 0fc63daf-8483-4772-8e79-3d69d8477de4 btrfs /mnt/sysroot /mnt/sysimage/home /mnt/sysimage vdb 252:16 0 10G 0 disk I did wonder if the PARTTYPE for the PReP boot partition or the /boot partition might be wrong, but they don't seem to be. 9e1a2d38-c612-4316-aa26-8b49521e5a8b is correct for a PReP boot partition, 0fc63daf-8483-4772-8e79-3d69d8477de4 is correct for a Linux data partition, and those are the UUIDs that SLOF looks for: https://github.com/aik/SLOF/blob/master/slof/fs/packages/disk-label.fs#L387 https://github.com/aik/SLOF/blob/master/slof/fs/packages/disk-label.fs#L404 the most recent relevant commits to SLOF, https://github.com/aik/SLOF/commit/7b1fb8daf911d3d54a1246b69c1d06a6cd8471f5 and https://github.com/aik/SLOF/commit/12b5d02e378a204c986b52d56b4ca8a0dab6ba21 , kinda imply an attempt to support booting from GPT, but I'm not sure if it's *complete* or supports our exact setup. So I'm not sure if the problem here is best described as: * SLOF intends to support our GPT layout but there's a bug preventing it, and we should fix that bug * SLOF does not yet support our GPT layout, and we should enhance it to do so * SLOF does not yet support our GPT layout, so we should somehow ensure we use an msdos label when installing on ppc64le (only on virt or when using SLOF? or always?) It's easy enough to force blivet to always use an msdos disk label on ppc64le, but I'm not sure if that's the *right* or best fix.
Additional note here: the change that resulted in us using GPT labels on ppc64le installs, arguably, should not have done so, since it's titled "Install Using GPT on x86_64 BIOS by Default". I added some notes on what exactly went on there in https://bugzilla.redhat.com/show_bug.cgi?id=2092091#c6 .
Created attachment 1966649 [details] screenshot of the boot failure
Sorry, forgot to note: the reason some tests that install to a pre-created disk image work is that those pre-created disk images happen to use msdos disk labels, not GPT ones. I haven't tested, but I bet if I tweaked createhdds (the tool that creates the images) to use GPT disk labels and re-generated the images, those tests would start failing too.
https://github.com/storaged-project/blivet/pull/1132 and https://github.com/rhinstaller/anaconda/pull/4795 would avoid this by going back to MBR labels on ppc64le installs. Arguably this would still be a bug/missing feature in SLOF, but it'd be much less important.
FWIW, SLOF should have support for GPT since 2013: https://github.com/aik/SLOF/commit/a51e46c2384deb95c41b0ff6c3025724d5d6cc08 ... but it's likely not tested very well, so I'd expect a bug in SLOF here.
Just to let you know an install on a GPT disk on a P9 or P10 LPAR is working as expected, and the system is functional. Here is an example of the today Rawhide ppc64le Server image on a P9 LPAR. Information from fdisk ---------------------- Disk /dev/sda: 20 GiB, 21474836480 bytes, 5242880 sectors Disk model: VDASD Units: sectors of 1 * 4096 = 4096 bytes Sector size (logical/physical): 4096 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disklabel type: gpt Disk identifier: CA078465-8FB7-4BA2-82B7-4D479CB71CDE Device Start End Sectors Size Type /dev/sda1 256 2815 2560 10M PowerPC PReP boot /dev/sda2 2816 264959 262144 1G Linux filesystem /dev/sda3 264960 5242622 4977663 19G Linux filesystem Disk layout from lsblk ---------------------- NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS PARTTYPE FSTYPE sda 8:0 0 20G 0 disk ├─sda1 │ 8:1 0 10M 0 part 9e1a2d38-c612-4316-aa26-8b49521e5a8b ├─sda2 │ 8:2 0 1G 0 part /boot 0fc63daf-8483-4772-8e79-3d69d8477de4 ext2 └─sda3 8:3 0 19G 0 part / 0fc63daf-8483-4772-8e79-3d69d8477de4 ext4
thuth: yeah, there's definitely *some* support there - I linked to some later commits that add some stuff missing from that original implementation - but I couldn't find any clear indication of how full the implementation is, what kinda disk layouts it's intended to work with, what it was tested on etc.
Good news: we now have newer anaconda with the changes tagged in Rawhide, and this is confirmed worked around - we're now getting MBR labels on ppc64le installs by default, so the openQA tests don't all fail any more! https://openqa.stg.fedoraproject.org/tests/overview?distri=fedora&version=Rawhide&build=Fedora-Rawhide-20230616.n.0&groupid=3 the underlying bug still exists and should ideally be addressed, though, so keeping this open but dropping the severity.
This bug appears to have been reported against 'rawhide' during the Fedora Linux 39 development cycle. Changing version to 39.