Bug 2339164 - aarch64: failing to boot with "Data abort: Translation fault, third level"
Summary: aarch64: failing to boot with "Data abort: Translation fault, third level"
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: grub2
Version: rawhide
Hardware: Unspecified
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Nicolas Frayer
QA Contact: Fedora Extras Quality Assurance
URL: https://github.com/coreos/fedora-core...
Whiteboard:
Depends On:
Blocks: ARMTracker
TreeView+ depends on / blocked
 
Reported: 2025-01-21 17:19 UTC by Dusty Mabe
Modified: 2025-01-31 17:55 UTC (History)
11 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2025-01-31 17:55:19 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Dusty Mabe 2025-01-21 17:19:17 UTC
our Fedora CoreOS ISO started failing to boot when we switched to grub2-2.12-18.fc42. The console shows a "Data abort: Translation fault, third level".

I'm not sure how to debug much further:



```
UEFI firmware (version edk2-20241117-5.fc41 built at 00:00:00 on Nov 27 2024)
SyncPcrAllocationsAndPcrMask!
PeiDelayedDispatchOnEndOfPei Count of dispatch cycles is 0
Tpm2GetCapabilityPcrs - 00000004
alg - 4
alg - B
alg - C
alg - D
[2J[01;01H[=3h[2J[01;01H[2J[01;01H[=3h[2J[01;01HBdsDxe: loading Boot0001 "UEFI Misc Device" from PciRoot(0x0)/Pci(0x3,0x0)
BdsDxe: starting Boot0001 "UEFI Misc Device" from PciRoot(0x0)/Pci(0x3,0x0)


Synchronous Exception at 0x000000013DA8A0E8
PC 0x00013DA8A0E8
PC 0x00013DA8A080
PC 0x00013DA8A32C
PC 0x00013DA8A420
PC 0x00013DF7524C
PC 0x00013DF779E4
PC 0x00013DF8068C
PC 0x00013DF811A8
PC 0x00013E78E574
PC 0x00013E78E624
PC 0x00013E78F9A8
PC 0x00013E78C030
PC 0x00004769B2C0 (0x000047694000+0x000072C0) [ 1] DxeCore.dll
PC 0x00013F14AECC (0x00013F144000+0x00006ECC) [ 2] BdsDxe.dll
PC 0x00013F14CCF0 (0x00013F144000+0x00008CF0) [ 2] BdsDxe.dll
PC 0x00004769DFF8 (0x000047694000+0x00009FF8) [ 3] DxeCore.dll
[ 1] /builddir/build/BUILD/edk2-20241117-build/edk2-0f3867fa6ef0/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Core/Dxe/DxeMain/DEBUG/DxeCore.dll
[ 2] /builddir/build/BUILD/edk2-20241117-build/edk2-0f3867fa6ef0/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Universal/BdsDxe/BdsDxe/DEBUG/BdsDxe.dll
[ 3] /builddir/build/BUILD/edk2-20241117-build/edk2-0f3867fa6ef0/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Core/Dxe/DxeMain/DEBUG/DxeCore.dll

  X0 0x0000000000000000   X1 0x0000000000000066   X2 0x0000000000000A23   X3 0x000000013DF8D9E8
  X4 0x000000013DA8ED80   X5 0x000000000018A857   X6 0x0000000000000200   X7 0x0000000000000000
  X8 0x000000013FFFF848   X9 0x0000000700000000  X10 0x000000013E918000  X11 0x000000013E91DFFF
 X12 0x0000000000000000  X13 0x0000000000026008  X14 0x0000000000000000  X15 0x0000000000000000
 X16 0x000000013F353290  X17 0xCE628BC691F2F420  X18 0x0000000000000011  X19 0x000000013E810000
 X20 0x0000000000000000  X21 0x000000013EC63E98  X22 0x000000013E824038  X23 0x000000013EC63E98
 X24 0x0000000047693A10  X25 0x000000013E824038  X26 0x000000013E824100  X27 0x000000013E824108
 X28 0x000000013E824110   FP 0x0000000047693750   LR 0x000000013DA8A080  

  V0 0xAFAFAFAFAFAFAFAF AFAFAFAFAFAFAFAF   V1 0x3832315F5345415F 534C543A36353241
  V2 0x3A4C4C4100363532 4148535F4D43475F   V3 0x0000000000000000 0000000100000000
  V4 0x4000000000000000 0000000000000000   V5 0x4010040140100401 4010040140100401
  V6 0x4000000000000100 4000000000000100   V7 0x0000000000000000 0000000000000000
  V8 0x0000000000000000 0000000000000000   V9 0x0000000000000000 0000000000000000
 V10 0x0000000000000000 0000000000000000  V11 0x0000000000000000 0000000000000000
 V12 0x0000000000000000 0000000000000000  V13 0x0000000000000000 0000000000000000
 V14 0x0000000000000000 0000000000000000  V15 0x0000000000000000 0000000000000000
 V16 0x0000000000000000 0000000000000000  V17 0x0000000000000000 0000000000000000
 V18 0x0000000000000000 0000000000000000  V19 0x0000000000000000 0000000000000000
 V20 0x0000000000000000 0000000000000000  V21 0x0000000000000000 0000000000000000
 V22 0x0000000000000000 0000000000000000  V23 0x0000000000000000 0000000000000000
 V24 0x0000000000000000 0000000000000000  V25 0x0000000000000000 0000000000000000
 V26 0x0000000000000000 0000000000000000  V27 0x0000000000000000 0000000000000000
 V28 0x0000000000000000 0000000000000000  V29 0x000000013EBD0C38 000000013EBCE780
 V30 0x00000000476936D0 00000000476936D0  V31 0xFFFFFF80FFFFFFE0 00000000476936B0

  SP 0x0000000047693750  ELR 0x000000013DA8A0E8  SPSR 0x20000205  FPSR 0x00000000
 ESR 0x96000007          FAR 0x0000000000000030

 ESR : EC 0x25  IL 0x1  ISS 0x00000007

Data abort: Translation fault, third level

Stack dump:
  0000047693650: 0000000100000000 0000000000000000 AFAFAFAFAFAFAFAF AFAFAFAFAFAFAFAF
  0000047693670: 534C543A36353241 3832315F5345415F 4148535F4D43475F 3A4C4C4100363532
  0000047693690: 0000000100000000 0000000000000000 0000000000000000 4000000000000000
  00000476936B0: 4010040140100401 4010040140100401 4000000000000100 4000000000000100
  00000476936D0: 0000000000000000 0000000000000000 000000013DA8ED80 000000000018A857
  00000476936F0: 0000000000000200 0000000000000000 0000000047693750 000000013DA8A080
  0000047693710: 000000013DA8EE80 000000013DA8ED80 00000000000016A7 000000013DA8A024
  0000047693730: 0000000000000000 000000013DA8ED00 000000013DA8ED80 000000013DF90D30
> 0000047693750: 0000000047693810 000000013DA8A32C 0000000047693820 000000013DA8EE80
  0000047693770: 000000013DF70AB8 00000000476937D0 000000013DA8EE80 0000000000000004
  0000047693790: 0000000047693810 000000013DA8A2E4 000000013E810000 000000013DF78858
  00000476937B0: 00000000476937F0 000000013EAD5718 0000000000000000 000000013EAD56AA
  00000476937D0: 0000000100306463 000000013E845040 000000013DA8EE6C 000000013DA8EE60
  00000476937F0: 000000013DA8EDE0 000000013DA8EE40 000000013DA8ED00 000000003EC63E98
  0000047693810: 0000000047693840 000000013DA8A420 0000000000000000 000000013DA8EE80
  0000047693830: 000000013E845040 0000000000000009 0000000047693860 000000013DF7524C


Synchronous Exception at 0x000000013DA8A0E8
ASSERT [ArmCpuDxe] DefaultExceptionHandler.c(340): ((BOOLEAN)(0==1))

```

Reproducible: Always

Steps to Reproduce:
1. Boot ISO
2. Observe error

Comment 1 Dusty Mabe 2025-01-21 17:21:17 UTC
It appears OpenQA tests are also failing: https://openqa.fedoraproject.org/tests/3156457#step/_boot_to_anaconda/3

Comment 2 Marta Lewandowska 2025-01-21 18:29:52 UTC
aarch64 only, right? not x86?

Comment 3 Dusty Mabe 2025-01-21 18:33:44 UTC
correct. All other architectures are passing tests.

Comment 4 Nicolas Frayer 2025-01-21 19:10:50 UTC
Hi Dusty, was it working fine with grub2-2.12-17.fc42 and the same EDK2 version (edk2-20241117-5.fc41) ?

Comment 5 Peter Jones 2025-01-21 19:11:30 UTC
Any chance this firmware has a way to turn on ConErr or edk2 debug output?  I can't even tell if it ran grub or which PC addresses might be grub vs firmware from this output.

Comment 6 Adam Williamson 2025-01-21 19:17:18 UTC
The openQA aarch64 worker hosts actually still have the old edk2 - 20240813 - and they are still hitting this. It started with Fedora-Rawhide-20250110.n.0 , which is exactly when grub went from 2.12-15.fc42 to 2.12-18.fc42 .

Comment 7 Adam Williamson 2025-01-21 19:18:18 UTC
note that -16 and -17 were never built, so we went straight from -15 to -18. the changelog shows:

  * Wed Nov 27 2024 Marta Lewandowska <mlewando> - 2.12-16
  - 99-grub-mkconfig.install: Disable BLS and run grub2-mkconfig when GRUB_ENABLE_BLSCFG is disable
  - Resolves: #2325960

  * Wed Nov 27 2024 Marta Lewandowska <mlewando> - 2.12-17
  - 99-grub-mkconfig.install: on PPC systems, remove petiboot's version checks

  * Thu Jan 09 2025 Nicolas Frayer <nfrayer> - 2.12-18
  - fs/xfs: fix large extent counters incompat feature support

Comment 8 Dusty Mabe 2025-01-21 21:31:02 UTC
(In reply to Nicolas Frayer from comment #4)
> Hi Dusty, was it working fine with grub2-2.12-17.fc42 and the same EDK2
> version (edk2-20241117-5.fc41) ?

As Adam mentioned the -17 was never built. 


(In reply to Peter Jones from comment #5)
> Any chance this firmware has a way to turn on ConErr or edk2 debug output? 
> I can't even tell if it ran grub or which PC addresses might be grub vs
> firmware from this output.


We're just using virt tools (i.e. QEMU) in a test framework so I assume the answer is yes. Here is a link to an ISO showing the problem if you'd like to poke:

https://dustymabe.fedorapeople.org/fedora-coreos-42.20250121.dev.0-live-iso.aarch64.iso

Comment 9 Gerd Hoffmann 2025-01-23 08:39:40 UTC
(In reply to Peter Jones from comment #5)
> Any chance this firmware has a way to turn on ConErr or edk2 debug output? 

Yes.  It's a compile time option, so there are two images (QEMU_EFI-pflash + QEMU_EFI-silent-pflash).  Use the one not named 'silent' to get the firmware log on the serial console.

Comment 10 Gerd Hoffmann 2025-01-23 08:44:17 UTC
(In reply to Peter Jones from comment #5)
> I can't even tell if it ran grub or which PC addresses might be grub vs
> firmware from this output.

It's grub.  Maybe shim.

(In reply to Dusty Mabe from comment #0)
> Synchronous Exception at 0x000000013DA8A0E8
> PC 0x00013DA8A0E8
> PC 0x00013DA8A080
> PC 0x00013DA8A32C
> PC 0x00013DA8A420
> PC 0x00013DF7524C
> PC 0x00013DF779E4
> PC 0x00013DF8068C
> PC 0x00013DF811A8
> PC 0x00013E78E574
> PC 0x00013E78E624
> PC 0x00013E78F9A8
> PC 0x00013E78C030
> PC 0x00004769B2C0 (0x000047694000+0x000072C0) [ 1] DxeCore.dll
> PC 0x00013F14AECC (0x00013F144000+0x00006ECC) [ 2] BdsDxe.dll
> PC 0x00013F14CCF0 (0x00013F144000+0x00008CF0) [ 2] BdsDxe.dll
> PC 0x00004769DFF8 (0x000047694000+0x00009FF8) [ 3] DxeCore.dll

This is a simple stack trace.  Anything the edk2 exception handler can't map to firmware modules is not firmware code.

Comment 11 Marta Lewandowska 2025-01-23 10:14:36 UTC
You're right, Gerd, we're in grub.

kern/device.c:37:device: opening device cd0
kern/disk.c:196:disk: Opening `cd0'...
disk/efi/efidisk.c:495:efidisk: opening cd0
disk/efi/efidisk.c:524:efidisk: m = 0x13eaaa510, last block = 7767a, block size
= 800, io align = 0
disk/efi/efidisk.c:542:efidisk: opening cd0 succeeded
kern/disk.c:288:disk: Opening `cd0' succeeded.
kern/disk.c:196:disk: Opening `cd0'...
disk/efi/efidisk.c:495:efidisk: opening cd0
disk/efi/efidisk.c:524:efidisk: m = 0x13eaaa510, last block = 7767a, block size
= 800, io align = 0
disk/efi/efidisk.c:542:efidisk: opening cd0 succeeded
kern/disk.c:288:disk: Opening `cd0' succeeded.


Synchronous Exception at 0x000000013D5A00E8
...

And this is at 2.12-15 (see comment#7), so now to revert more patches...

Comment 12 Marta Lewandowska 2025-01-23 11:47:07 UTC
I don't know what the problem is yet, but it looks like grub can't properly open the boot.iso, so I can only guess that something has changed in the way that it's being created? I've taken grub back to 2.12-9 (on a rawhide builder), which was originally built in the beginning of October, and I see that this test was working as recently as 10 Jan ... OTOH I can boot the f-41 iso without any issues.

I am building my boot isos with lorax and then genisoimage on a rawhide VM and they are clearly failing just like the official ones...
[root@fedora ~]# rpm -qa genisoimage lorax
lorax-42.4-1.fc42.aarch64
genisoimage-1.1.11-56.fc41.aarch64

And I'm on an f-41 hypervisor with
[root@ampere-hr330a-13 ~]# rpm -qa edk2-aarch64
edk2-aarch64-20241117-5.fc41.noarch

Comment 13 Adam Williamson 2025-01-23 16:51:56 UTC
I don't believe we've changed anything about how we build netinst and DVD ISOs, no. On openQA the failure is happening with the Server DVD ISO and I don't think we intentionally changed that build process. We have been moving a lot of stuff to Kiwi, but not those ISOs. As I said, the failures started exactly on the compose where grub2-1:2.12-18.fc42 landed - Fedora-Rawhide-20250110.n.0 . There were no changes to pungi-fedora between those composes AFAICS. No packages changed on the compose host in that time frame (there was a change on Jan 6 and then Jan 13). Lorax hasn't changed since November...

Comment 14 Dusty Mabe 2025-01-23 20:55:36 UTC
To further support the argument that it's likely grub and not the way we build things:

1. It happened in Fedora CoreOS and Fedora Server (in openQA) at the same time
2. When we reverted the change in Fedora CoreOS [1] we are now able to build and test aarch64 just fine. This is with everything else staying the same except GRUB getting reverted.

[1] https://github.com/coreos/fedora-coreos-config/pull/3330

Comment 15 Adam Williamson 2025-01-23 22:07:20 UTC
Oh, it's great that you tested that, Dusty, I was about to do the same thing, saves me the time :D

Comment 16 Jeremy Linton 2025-01-24 00:55:38 UTC
I just ran into this with Fedora-Server-dvd-aarch64-Rawhide-20250123.n.0.iso on both a F40 and a F41 qemu too. There have been some similar UTM reports from not long ago about CDROMs triggering similar behavior with fedora. SO it looks like a NULL pointer deference from the FAR, presumably something in grubs ISO handling? I'm building a more verbose edk2/grub lets see if that helps.

Comment 17 Jeremy Linton 2025-01-24 01:21:49 UTC
Sorta as expected grub2-efi-aa64-2.12.18 and grub2-efi-aa64-cdboot-2.12.18 FC42 boot from a disk.

Comment 18 Jeremy Linton 2025-01-24 02:33:48 UTC
Since I'm having a build issue, and am about to quit for the night, it might be useful to drop `19dcf16 aarch64/macros: Build gnulib with -mbranch-protection=standard` and see if that fixes it, since the behavior changed a bit with gcc15.

Comment 19 Marta Lewandowska 2025-01-24 13:09:09 UTC
After reading your comment yesterday, Dusty, I was feeling a little stupid and confused, but things are making a little bit more sense... maybe...

I built a DVD with f-41, and then tried grub 2.12-15.fc42 and 2.12-18.fc42, and -15 booted while -18 did not, as per Dusty's findings.

So then on an f-41 builder, I started from grub 2.12-15.fc42 and then built -16, -17, -18. I created a DVD with f-41 along with each built grub, and all of them booted.

Previously, what I reported in comment#12, was all built on a rawhide builder, and all of those were hitting the exception regardless of the included patches. So this must be something with builddeps, changes in libraries, or something along those lines.

Nicolas built 2.12-21.fc42 without branch protection that Jeremy mentioned, but that also hit the exception.

I'm gonna keep playing with this.

Comment 20 Jeremy Linton 2025-01-27 06:17:27 UTC
I can't seem to find valid debuginfo's for the shipped grub2-efi image. So if one opens the resulting image like:
gdb grubaa64.efi 
Reading symbols from ./usr/lib/debug/.dwz/grub2-2.12-20.fc42.aarch64...
(gdb) disassemble 0x2c11c0,+300
...
0x00000000002c12a0:  ldr     x0, [x0]
0x00000000002c12a4:  ldr     x0, [x0, #40]
0x00000000002c12a8:  ldr     x0, [x0, #48]<--- crash access 0x030
0x00000000002c12ac:  ldr     x2, [x0, #16]


This is based on the running memory offset/instructions offset into a standalone gdb image. I'm not sure I trust the edk2 stack unwinding in grub though, but that address matches the 0x30 FAR offset and X0=0 in the backtrace I see. So its likely the faulting instruction, the only question at the moment is why the instruction immediately before it didn't fault too. I waa trying to rebuilt an identical grub the other day to use its debug symbols but it wasn't helpful. If someone shoots me a busted image with debug symbols in the grub.efi (or the kernel core with symbols) I can use that to lookup the function name.

Comment 21 Marta Lewandowska 2025-01-27 08:16:54 UTC
ok, please disregard all of my other theories above ;) I found the thing that is causing this, although I don't know why it's causing this...
I did a diff between the working and non-working grubs, and I found a couple of surprises, based on the changelog:
[mlewando@loki r10]$ diff -qr 15/ 18/
Only in 18/: 0284-fs-xfs-Fix-large-extent-counters-incompat-feature-su.patch
Files 15/20-grub.install and 18/20-grub.install differ                      <------
Files 15/99-grub-mkconfig.install and 18/99-grub-mkconfig.install differ
Only in 15/: grub2-2.12-15.fc42.src.rpm
Only in 18/: grub2-2.12-18.fc42.src.rpm
Files 15/grub2.spec and 18/grub2.spec differ
Files 15/grub.macros and 18/grub.macros differ                              <------
Files 15/grub.patches and 18/grub.patches differ

It turns out that adding the (new) bli module to the global modules is the culprit. I looked briefly at the bli code, but haven't identified the issue there. I don't know why this isn't an issue on x86 or why it's only an issue on aarch booting from cd/dvd...
[mlewando@loki r10]$ diff -ruN 15/grub.macros 18/grub.macros 
--- 15/grub.macros	2024-11-21 01:00:00.000000000 +0100
+++ 18/grub.macros	2025-01-09 01:00:00.000000000 +0100
@@ -140,7 +140,7 @@
 %{?with_efi_only:%global without_efi_only 1}
 
 %ifarch %{efi_arch}
-%global efi_modules " efi_netfs efifwsetup efinet lsefi lsefimmap connectefi "
+%global efi_modules " efi_netfs efifwsetup efinet lsefi lsefimmap connectefi bli "
 %endif

Anyway, Jeremy, if you want a debug grub, I can build you one. I guess you want the broken version, -18?

Comment 22 Jeremy Linton 2025-01-28 01:05:43 UTC
Every time I debug grub i get PTSD.. it shouldn't be this hard to generate debug images.

So, your right it is the bli module, its in get_part_uuid, line 59:

   {
      status = grub_error (grub_errno, N_("cannot open disk: %s"), device_name);
      grub_device_close (device);
      return status;
    }

  if (grub_strcmp (device->disk->partition->partmap->name, "gpt") != 0)
    {
      status = grub_error (GRUB_ERR_BAD_PART_TABLE,
			   N_("this is not a GPT partition table: %s"), device_name);
      goto fail;
    }


Where I think partmap is null if i'm following the assembly dereferences correctly, since at this point the idea of successfully manipulating the build into generating a listing.

  dc:   f94057e0        ldr     x0, [sp, #168] <--- device
  e0:   f9400000        ldr     x0, [x0]       <--- disk
  e4:   f9401400        ldr     x0, [x0, #40]  <--- partition
  e8:   f9401800        ldr     x0, [x0, #48]  <--- partmap (death!)
  ec:   f9400802        ldr     x2, [x0, #16]  <---- name (parm #1 to grub_strcmp after it gets moved to x0 below, this code needs the optimizer...)
  f0:   90000000        adrp    x0, 0 <get_part_uuid> 
  f4:   91000000        add     x0, x0, #0x0
  f8:   f9400001        ldr     x1, [x0]
  fc:   aa0203e0        mov     x0, x2
 100:   94000000        bl      0 <grub_strcmp>

Comment 23 Jeremy Linton 2025-01-28 01:21:06 UTC
Right, so if the device name doesn't have a partition element the device->disk->partition doesn't get filled out by grub_disk_open() so at a minimum adding a

if (device->disk->partition && grub_strcmp(device->disk->partition-partmap->name, "gpt") != 0) 

is required.

Comment 24 Jeremy Linton 2025-01-28 01:46:49 UTC
Dope! Upstream commit: 

9537f4403 commands/bli: Fix crash in get_part_uuid()

Comment 25 Jeremy Linton 2025-01-28 01:50:11 UTC
From the bug:


    4. When booting from a CD-ROM, the ESP is a VFAT image indexed by the El
       Torito boot catalog. The boot device is set to (cd0), corresponding
       to the CD-ROM image mounted as an ISO 9660 filesystem.

So, this is probably because the x86 qemu/etc simply isn't crashing on zero page access.

Comment 26 Gerd Hoffmann 2025-01-28 08:06:49 UTC
> So, this is probably because the x86 qemu/etc simply isn't crashing on zero
> page access.

Correct.  That is going to change soon though, see
https://fedoraproject.org/wiki/Changes/Edk2Security

Comment 27 Marta Lewandowska 2025-01-28 08:45:20 UTC
Nice sleuthing Jeremy! I can confirm that https://git.savannah.gnu.org/cgit/grub.git/commit/?id=9537f4403dd836d5a8a1c4e57b165837fc7239cf fixes this issue on aarch64.

Comment 28 Dusty Mabe 2025-01-31 17:55:19 UTC
According to git [1] this should be taken care of with grub2-2.12-21.fc42 

[1] https://src.fedoraproject.org/rpms/grub2/c/20db2b22d863543f0f31b4c54cc7e65923473f62?branch=rawhide


Note You need to log in before you can comment on or make changes to this bug.