Bug 2209760

Summary: Cannot boot Fedora installed to a GPT-labeled disk on ppc64le qemu (with SLOF)
Product: [Fedora] Fedora Reporter: Adam Williamson <awilliam>
Component: SLOFAssignee: Fedora Virtualization Maintainers <virt-maint>
Status: ASSIGNED --- QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: unspecified    
Version: 42CC: bugzilla, crobinso, dan, efintzel, orion, pbonzini, rjones, thuth, virt-maint
Target Milestone: ---   
Target Release: ---   
Hardware: ppc64le   
OS: Linux   
Whiteboard: openqa
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1071880    
Attachments:
Description Flags
screenshot of the boot failure none

Description Adam Williamson 2023-05-24 16:46:06 UTC
The Fedora openQA tests for ppc64le have been showing a lot of failures for some time, but I only recently found time to investigate. Most of the tests fail, every day; the small number that pass all seemed to be ones that don't involve doing an install, or do an install to a pre-created disk image instead of a fresh one.

The failure mode is that the install works fine, but booting the installed system fails on a SLOF screen (see attachment). It shows:

Trying to load:   from: /pci@800000020000000/scsi@7:1 ...
E3405: No such device

E3407: Load failed

I figured out the tests started failing after 2022-08-15, and looking into what changed around then, I noticed: that's when we started using GPT by default when formatting disks, in anaconda-38.1-1.

So I did a scratch build of python-blivet which forces the use of msdos disk labels on ppc64le (it drops 'gpt' from the list of supported disk labels, so anaconda is forced to use 'msdos', the only choice remaining), tested an install with that, and...it works. The installed system boots.

I don't know if this is only a problem on qemu with SLOF, or if it also affects bare metal ppc64le systems with real firmwares.

Reproducible: Always

Steps to Reproduce:
1. Install any Fedora since Fedora-Rawhide-20220815.n.0 to a qemu ppc64le VM using a fresh disk, so anaconda will format it using a GPT disk label
2. Try and boot it
Actual Results:  
It fails to boot, showing the SLOF errors described above and attached

Expected Results:  
It should boot successfully

This happens with both our current (old) SLOF build - SLOF-20210217-6.git33a7322d.fc38 - and a build of the most recent SLOF tagged upstream (20220719). I did a build of that just to see if it would help, but it doesn't.

Here's some information on the disk layout, from lsblk:

NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS        PARTTYPE                             FSTYPE
loop0    7:0    0 543.3M  1 loop /run/rootfsbase                                         squashfs
sr0     11:0    1 682.1M  0 rom  /run/install/repo                                       iso9660
zram0  251:0    0   3.8G  0 disk [SWAP]                                                  
vda    252:0    0    10G  0 disk                                                         
├─vda1 252:1    0     4M  0 part                    9e1a2d38-c612-4316-aa26-8b49521e5a8b 
├─vda2 252:2    0     1G  0 part /mnt/sysroot/boot  0fc63daf-8483-4772-8e79-3d69d8477de4 ext4
│                                /mnt/sysimage/boot                                      
└─vda3 252:3    0     9G  0 part /mnt/sysroot/home  0fc63daf-8483-4772-8e79-3d69d8477de4 btrfs
                                 /mnt/sysroot                                            
                                 /mnt/sysimage/home                                      
                                 /mnt/sysimage                                           
vdb    252:16   0    10G  0 disk 

I did wonder if the PARTTYPE for the PReP boot partition or the /boot partition might be wrong, but they don't seem to be. 9e1a2d38-c612-4316-aa26-8b49521e5a8b is correct for a PReP boot partition, 0fc63daf-8483-4772-8e79-3d69d8477de4 is correct for a Linux data partition, and those are the UUIDs that SLOF looks for:

https://github.com/aik/SLOF/blob/master/slof/fs/packages/disk-label.fs#L387
https://github.com/aik/SLOF/blob/master/slof/fs/packages/disk-label.fs#L404

the most recent relevant commits to SLOF, https://github.com/aik/SLOF/commit/7b1fb8daf911d3d54a1246b69c1d06a6cd8471f5 and https://github.com/aik/SLOF/commit/12b5d02e378a204c986b52d56b4ca8a0dab6ba21 , kinda imply an attempt to support booting from GPT, but I'm not sure if it's *complete* or supports our exact setup. So I'm not sure if the problem here is best described as:

* SLOF intends to support our GPT layout but there's a bug preventing it, and we should fix that bug
* SLOF does not yet support our GPT layout, and we should enhance it to do so
* SLOF does not yet support our GPT layout, so we should somehow ensure we use an msdos label when installing on ppc64le (only on virt or when using SLOF? or always?)

It's easy enough to force blivet to always use an msdos disk label on ppc64le, but I'm not sure if that's the *right* or best fix.

Comment 1 Adam Williamson 2023-05-24 17:06:37 UTC
Additional note here: the change that resulted in us using GPT labels on ppc64le installs, arguably, should not have done so, since it's titled "Install Using GPT on x86_64 BIOS by Default". I added some notes on what exactly went on there in https://bugzilla.redhat.com/show_bug.cgi?id=2092091#c6 .

Comment 2 Adam Williamson 2023-05-24 17:10:28 UTC
Created attachment 1966649 [details]
screenshot of the boot failure

Comment 3 Adam Williamson 2023-05-24 17:14:36 UTC
Sorry, forgot to note: the reason some tests that install to a pre-created disk image work is that those pre-created disk images happen to use msdos disk labels, not GPT ones. I haven't tested, but I bet if I tweaked createhdds (the tool that creates the images) to use GPT disk labels and re-generated the images, those tests would start failing too.

Comment 4 Adam Williamson 2023-05-26 23:29:57 UTC
https://github.com/storaged-project/blivet/pull/1132 and https://github.com/rhinstaller/anaconda/pull/4795 would avoid this by going back to MBR labels on ppc64le installs. Arguably this would still be a bug/missing feature in SLOF, but it'd be much less important.

Comment 5 Thomas Huth 2023-05-30 10:33:13 UTC
FWIW, SLOF should have support for GPT since 2013:

 https://github.com/aik/SLOF/commit/a51e46c2384deb95c41b0ff6c3025724d5d6cc08

... but it's likely not tested very well, so I'd expect a bug in SLOF here.

Comment 6 Éric Fintzel 2023-05-30 14:04:01 UTC
Just to let you know an install on a GPT disk on a P9 or P10 LPAR is working as expected, and the system is functional.
Here is an example of the today Rawhide ppc64le Server image on a P9 LPAR.

Information from fdisk
----------------------
Disk /dev/sda: 20 GiB, 21474836480 bytes, 5242880 sectors
Disk model: VDASD
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: CA078465-8FB7-4BA2-82B7-4D479CB71CDE

Device      Start     End Sectors Size Type
/dev/sda1     256    2815    2560  10M PowerPC PReP boot
/dev/sda2    2816  264959  262144   1G Linux filesystem
/dev/sda3  264960 5242622 4977663  19G Linux filesystem

Disk layout from lsblk
----------------------
NAME MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS PARTTYPE                             FSTYPE
sda    8:0    0    20G  0 disk
├─sda1
│      8:1    0    10M  0 part             9e1a2d38-c612-4316-aa26-8b49521e5a8b
├─sda2
│      8:2    0     1G  0 part /boot       0fc63daf-8483-4772-8e79-3d69d8477de4 ext2
└─sda3
       8:3    0    19G  0 part /           0fc63daf-8483-4772-8e79-3d69d8477de4 ext4

Comment 7 Adam Williamson 2023-05-30 16:16:28 UTC
thuth: yeah, there's definitely *some* support there - I linked to some later commits that add some stuff missing from that original implementation - but I couldn't find any clear indication of how full the implementation is, what kinda disk layouts it's intended to work with, what it was tested on etc.

Comment 8 Adam Williamson 2023-06-16 13:20:47 UTC
Good news: we now have newer anaconda with the changes tagged in Rawhide, and this is confirmed worked around - we're now getting MBR labels on ppc64le installs by default, so the openQA tests don't all fail any more!

https://openqa.stg.fedoraproject.org/tests/overview?distri=fedora&version=Rawhide&build=Fedora-Rawhide-20230616.n.0&groupid=3

the underlying bug still exists and should ideally be addressed, though, so keeping this open but dropping the severity.

Comment 9 Fedora Release Engineering 2023-08-16 08:15:16 UTC
This bug appears to have been reported against 'rawhide' during the Fedora Linux 39 development cycle.
Changing version to 39.

Comment 10 Aoife Moloney 2024-11-08 10:52:43 UTC
This message is a reminder that Fedora Linux 39 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 39 on 2024-11-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '39'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version. Note that the version field may be hidden.
Click the "Show advanced fields" button if you do not see it.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 39 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 11 Adam Williamson 2024-11-08 16:46:57 UTC
SLOF has barely changed since this bug was filed and the changes don't look relevant, so this is likely still valid.

Comment 12 Aoife Moloney 2025-02-26 12:53:42 UTC
This bug appears to have been reported against 'rawhide' during the Fedora Linux 42 development cycle.
Changing version to 42.