Bug 1975375 - Rawhide arm/aarch64 disk images fail to boot on Raspberry Pi SBCs
Summary: Rawhide arm/aarch64 disk images fail to boot on Raspberry Pi SBCs
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: python-blivet
Version: rawhide
Hardware: Unspecified
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Vojtech Trefny
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: F35BetaBlocker
TreeView+ depends on / blocked
 
Reported: 2021-06-23 14:26 UTC by Paul Whalen
Modified: 2021-07-22 19:48 UTC (History)
16 users (show)

Fixed In Version: python-blivet-3.4.0-4.fc35
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-21 13:20:03 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1930486 1 None None None 2021-07-08 06:19:20 UTC

Internal Links: 2119436

Description Paul Whalen 2021-06-23 14:26:09 UTC
Description of problem:

Rawhide disk images fail to boot on the Raspberry Pi family of SBC's. Looking at the partition layout the EFI partition has changed and is now listed as "(FAT-12/16/32)"


fdisk -l Fedora-Minimal-Rawhide-20210622.n.0.aarch64.raw
..

Device                                           Boot   Start      End Sectors  Size Id Type
Fedora-Minimal-Rawhide-20210622.n.0.aarch64.raw1 *       2048  1230847 1228800  600M ef EFI (FAT-12/16/32)
Fedora-Minimal-Rawhide-20210622.n.0.aarch64.raw2      1230848  3327999 2097152    1G 83 Linux
Fedora-Minimal-Rawhide-20210622.n.0.aarch64.raw3      3328000 12582911 9254912  4.4G 83 Linux

Working Fedora 34 disk images:

fdisk -l Fedora-Minimal-34-1.2.aarch64.raw
..
Device                             Boot   Start      End Sectors  Size Id Type
Fedora-Minimal-34-1.2.aarch64.raw1 *       2048  1230847 1228800  600M  6 FAT16
Fedora-Minimal-34-1.2.aarch64.raw2      1230848  3327999 2097152    1G 83 Linux
Fedora-Minimal-34-1.2.aarch64.raw3      3328000 12582911 9254912  4.4G 83 Linux




Version-Release number of selected component (if applicable):
anaconda-35.17-1.fc35

Comment 1 Fedora Blocker Bugs Application 2021-06-29 15:28:10 UTC
Proposed as a Blocker for 35-beta by Fedora user pbrobinson using the blocker tracking app because:

 This stops various Raspberry Pi and probably some other arm hardware booting. We've explicitly kept the fat16 partition option, which is allowed as an option as part of the UEFI spec, because these devices don't detect the ef ID type even if it's the same format on disk.

Comment 2 Adam Williamson 2021-06-29 16:48:31 UTC
Do we know when exactly this changed? Would help with pinning down what caused it.

Comment 3 Paul Whalen 2021-06-30 13:55:26 UTC
(In reply to Adam Williamson from comment #2)
> Do we know when exactly this changed? Would help with pinning down what
> caused it.

I checked the nightlies as far back as I could, unfortunately the best I could work out was it happened sometime between 20210428 - 20210529.

Fedora-Workstation-Rawhide-20210428.n.1.armhfp.raw1 *       2048  1230847  1228800  600M  6 FAT16
Fedora-IoT-35-20210529.0.aarch64.raw1 *       2048 1028095 1026048  501M ef EFI (FAT-12/16/32)

Comment 4 Vendula Poncova 2021-07-07 15:00:01 UTC
Vojta, do you know about any changes that could be related to this issue?

Comment 5 Vojtech Trefny 2021-07-08 06:19:21 UTC
This was actually intentional change in blivet 3.4.0[1], we are now using parted PARTITION_ESP flag for MBR EFI partitions to set the correct partition id (EF), see https://bugzilla.redhat.com/show_bug.cgi?id=1930486

I'm moving this bug blivet, we can simply revert the patch, if this is really causing the RPi to not boot.

Paul: Did you try to turn the ESP flag off on the EFI partition ("set 1 esp off" in parted)? Does it fix the boot issue?

Chris: You are the reporter of the original bug to change this, can you take a look at this bug? Do you have some examples of systems that need the ESP flag set to be able to boot?


[1] https://github.com/storaged-project/blivet/pull/933

Comment 6 Chris Murphy 2021-07-08 16:38:07 UTC
Oops. Yeah I'm definitely the instigator, but it spreading to pi images is not intended.

In bug 1930486, that is a UEFI x86_64 VM, with an existing MBR disk. And the installer creates an ESP (it has an EFI/ dir, and the assorted bootloaders) with type code 0x06. It's correct that it should be 0xEF.

And in the case of non-UEFI aarch64 images, the partition in question is not an ESP so it shouldn't get type code 0xEF. I guess ideally there'd be a specific kickstart type, ARMBOOTLDR or whatever, so that  the two bootloader volumes can be distinguished, and created by blivet correctly.

However, distinguishing between these two cases is probably sufficiently rare that the easiest thing to do in the near term is just revert PR 933. Another alternative is to not support UEFI on MBR, and goto fail. Just because the UEFI spec says you can do it, doesn't obligate anyone to support that scenario.

Comment 7 Paul Whalen 2021-07-08 21:00:41 UTC
(In reply to Vojtech Trefny from comment #5)
> This was actually intentional change in blivet 3.4.0[1], we are now using
> parted PARTITION_ESP flag for MBR EFI partitions to set the correct
> partition id (EF), see https://bugzilla.redhat.com/show_bug.cgi?id=1930486
> 
> I'm moving this bug blivet, we can simply revert the patch, if this is
> really causing the RPi to not boot.
> 
> Paul: Did you try to turn the ESP flag off on the EFI partition ("set 1 esp
> off" in parted)? Does it fix the boot issue?

Unfortunately not, after running parted:

Device     Boot   Start      End  Sectors  Size Id Type
/dev/sdb1  *       2048  1230847  1228800  600M  b W95 FAT32



(In reply to Chris Murphy from comment #6)

> And in the case of non-UEFI aarch64 images, the partition in question is not

The aarch64 and arm images are UEFI. The change to 32-bit arm was made in F34 - https://fedoraproject.org/wiki/Changes/ARMv7UEFI

Comment 8 Paul Whalen 2021-07-14 20:17:35 UTC
Ping - Can we revert this or somehow make it configurable for both use cases? This significantly impacts Rawhide testing for both armhfp and aarch64.

Comment 9 Chris Murphy 2021-07-15 03:40:29 UTC
Are the failing Raspberry Pi's UEFI? UEFI spec says 0xEF is the correct type code for an EFI system partition on MBR partition map. Wikipedia says 0x06 is Compaq FAT16. If you change the 1st partition to type code 0x06 does that fix things? I'm uncertain I understand what's actually causing the problem. I take it this one image is intended to boot UEFI and non-UEFI ARM devices, hence MBR instead of GPT? Because if they're all UEFI devices, then maybe part of the confusion is that this is still using MBR scheme instead of GPT?

Comment 10 Chris Murphy 2021-07-15 04:10:27 UTC
For what it's worth:

>doc/ChangeLog.dosfstools-2.x:12: - mkdosfs: by default, use FAT32 on devices >= 512MB

The images have FAT32, regardless of what the partition type code is. So if the problem is due to some boards not liking FAT32, and you need FAT16, then (a) the partition size needs to be reduced below 512M, not sure if that's really in SI, or IEC units; (b) use pykickstart 'part --mkfsoptions "-F 16"' to force FAT16.

Comment 11 Chris Murphy 2021-07-15 05:17:29 UTC
Downloaded Fedora-Minimal-34-1.2.aarch64.raw and put it on a loop device, partition 1 type 0x06 according to dosfsck is FAT32. And likely have been FAT32 ever since the EFI System partition size was bumped to 600M.

Comment 12 Chris Murphy 2021-07-15 06:48:23 UTC
Seems relevant.

Recognize efi partition (0xef) as a valid boot
https://github.com/raspberrypi/rpi-eeprom/issues/126

Comment 13 Vojtech Trefny 2021-07-15 13:59:43 UTC
(In reply to Paul Whalen from comment #7)
> (In reply to Vojtech Trefny from comment #5)
> > This was actually intentional change in blivet 3.4.0[1], we are now using
> > parted PARTITION_ESP flag for MBR EFI partitions to set the correct
> > partition id (EF), see https://bugzilla.redhat.com/show_bug.cgi?id=1930486
> > 
> > I'm moving this bug blivet, we can simply revert the patch, if this is
> > really causing the RPi to not boot.
> > 
> > Paul: Did you try to turn the ESP flag off on the EFI partition ("set 1 esp
> > off" in parted)? Does it fix the boot issue?
> 
> Unfortunately not, after running parted:
> 
> Device     Boot   Start      End  Sectors  Size Id Type
> /dev/sdb1  *       2048  1230847  1228800  600M  b W95 FAT32
> 

So it still doesn't boot after this change? If not, than the partition type is not a problem, but the FAT32 probably is and we should start forcing FAT16 for EFI partitions?

Comment 14 Paul Whalen 2021-07-15 14:34:53 UTC
(In reply to Chris Murphy from comment #11)
> Downloaded Fedora-Minimal-34-1.2.aarch64.raw and put it on a loop device,
> partition 1 type 0x06 according to dosfsck is FAT32. And likely have been
> FAT32 ever since the EFI System partition size was bumped to 600M.

Thanks Chris, confirmed that parted shows the partition as FAT32 in Fedora 34, but the partition ID is 06 so it works as expected (and fdisk thinks its FAT16)

I changed the Rawhide disk image partition ID 06 in fdisk and it now boots as well.

Before:
/dev/sdb1  *       2048  1230847  1228800  600M ef EFI (FAT-12/16/32)

After:
/dev/sdb1  *       2048  1230847  1228800  600M  6 FAT16

Comment 15 Chris Murphy 2021-07-16 01:09:31 UTC
What RBP models are affected? It is pretty weird that they'd explicitly choose 0x06, surprising they'd bake that into their firmware as a hard requirement.

OK the next thing to try is whether FAT16 with type code 0xEF is bootable. Based on the discussion in https://github.com/raspberrypi/rpi-eeprom/issues/126 I suppose it won't work, and that there's really a requirement for 0x06 type. Which then makes me wonder if we can reliably just move to GPT for all of these images? It sounds like the problem doesn't happen with GPT when using the correct EFI partition type GUID.

I suppose the path of least resistence for a short term work around is to just simply revert the 0x06->0xEF change, but that gives me a frowny face. The product isn't complying with a very laboriously documented spec that's been around for well over a decade. So my first position would be, it's busted firmware, vendor should fix it. Obviously that has some degree of impracticality, and it would be awesome if Anaconda/Blivet could special case the make/model, using 0x06 only there. Going down this road of doing the wrong thing everywhere just to paper over this sort of firmware defect is folly.

Comment 16 Chris Murphy 2021-07-16 01:15:25 UTC
>(and fdisk thinks its FAT16)

It's obscure knowledge but fdisk does not check any file system version, magic, or other metadata at all. It says FAT16 because type code 0x06 is ostensibly "reserved", via informal means, by Compaq. It's downright bizarre that 0x06 ended up being picked for this purpose instead of the well documented 0xEF. Maybe there's a more interesting  history that "oops someone screwed up" but I kinda doubt it.

Whereas parted *does* check the contents of some partitions, and in particular with FAT it knows whether it's FAT12/16/32. Parted doesn't trust the type code alone.

Comment 17 Vojtech Trefny 2021-07-19 09:01:54 UTC
upstream PR with revert commit: https://github.com/storaged-project/blivet/pull/967

Comment 18 Paul Whalen 2021-07-22 19:48:20 UTC
Confirmed fixed in python3-blivet-3.4.0-4.fc35

Disk image from the Fedora-Rawhide-20210722.n.0 compose:

Device     Boot   Start      End  Sectors  Size Id Type
/dev/sdb1  *       2048  1230847  1228800  600M  6 FAT16
/dev/sdb2       1230848  3327999  2097152    1G 83 Linux
/dev/sdb3       3328000 31186943 27858944 13.3G 83 Linux

Thank you!


Note You need to log in before you can comment on or make changes to this bug.