Bug 2159239 - [aarch64] EFI stub: ERROR: FIRMWARE BUG: efi_loaded_image_t::image_base has bogus value
Summary: [aarch64] EFI stub: ERROR: FIRMWARE BUG: efi_loaded_image_t::image_base has b...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: aarch64
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-01-09 09:15 UTC by Bruno Goncalves
Modified: 2023-06-13 06:30 UTC (History)
25 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2023-06-13 06:30:52 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Bruno Goncalves 2023-01-09 09:15:06 UTC
1. Please describe the problem:
Trying to provision recent Rawhide composes in some of aarch64 machines fail with:

EFI stub: Booting Linux Kernel... 
EFI stub: ERROR: FIRMWARE BUG: efi_loaded_image_t::image_base has bogus value 
EFI stub: Using DTB from configuration table 
EFI stub: Exiting boot services... 

2. What is the Version-Release number of the kernel:
kernel-6.2.0-0.rc2.20230105git41c03ba9beea.20.fc38.aarch64

Comment 2 Mark Salter 2023-01-17 22:34:56 UTC
The EFI stub "ERROR: FIRMWARE BUG" is informational and isn't related to the later failure during the install. The arm stub doesn't use image_base.

Later "Failed to set new efi boot target. This is most likely a kernel or firmware bug. " indicates that efibootmgr returned an error when anaconda tries to set an EFI boot variable. It probably isn't a firmware problem unless a kernel/efibootmgr related change uncovered a latent bug. More likely something in kernel efivars or efibootmgr changed and is causing the problem.

Comment 3 Don Zickus 2023-01-18 21:16:54 UTC
Thanks @msalter!  I am not that good with efivars and efibootmgr.  As this is hit during install, I am not sure what we can do to diagnose the problem?

Thoughts?

Cheers,
Don

Comment 4 Mark Salter 2023-01-19 12:57:22 UTC
It would be easier to debug outside of an install process. So I'd try installing a known good Fedora. Then install a kernel version which was used to pxe boot the failing install, then try efibootmgr to see if you can read/set boot variables. Use the version of efibootmgr that was used pxe initrd. The goal being to find if it is kernel or efibootmgr which is a problem. Then bisect to see which version is the first to fail.

Comment 5 Mark Salter 2023-01-20 16:03:03 UTC
Also, if you are connected to console, you should be able to answer "yes" to the "Would you like to ignore this and continue with installation?"

That could get you booting so you could have a look around. At worst, it will drop you into the efi shell, where you can manually boot with something like:

Shell> fs0:
(it may be fs1:, fs2: etc)
Shell> cd efi\fedora
Shell> grubaa64.efi

Comment 6 Mark Salter 2023-02-16 00:16:19 UTC
So this looks like a firmware bug uncovered by:

commit d3549a938b73f203ef522562ae9f2d38aa43d234
Author: Ard Biesheuvel <ardb>
Date:   Fri Sep 16 11:48:30 2022 +0200

    efi/arm64: libstub: avoid SetVirtualAddressMap() when possible
    
    EFI's SetVirtualAddressMap() runtime service is a horrid hack that we'd
    like to avoid using, if possible. For 64-bit architectures such as
    arm64, the user and kernel mappings are entirely disjoint, and given
    that we use the user region for mapping the UEFI runtime regions when
    running under the OS, we don't rely on SetVirtualAddressMap() in the
    conventional way, i.e., to permit kernel mappings of the OS to coexist
    with kernel region mappings of the firmware regions. This means that, in
    principle, we should be able to avoid SetVirtualAddressMap() altogether,
    and simply use the 1:1 mapping that UEFI uses at boot time. (Note that
    omitting SetVirtualAddressMap() is explicitly permitted by the UEFI
    spec).
    
    However, there is a corner case on arm64, which, if configured for
    3-level paging (or 2-level paging when using 64k pages), may not be able
    to cover the entire range of firmware mappings (which might contain both
    memory and MMIO peripheral mappings).
    
    So let's avoid SetVirtualAddressMap() on arm64, but only if the VA space
    is guaranteed to be of sufficient size.
    
    Signed-off-by: Ard Biesheuvel <ardb>

I found this by bisecting and verified it by causing a panic with efi=novamap on an otherwise working version of fedora 36.
I also see the problem on Ampere Mt Jade and Mt Collins.

Comment 7 Mark Salter 2023-02-16 13:45:24 UTC
Upstream is aware of the problem and are considering possible solutions including reverting commit d3549a938b73f. There is no current workaround other than reverting that commit.

Comment 8 Don Zickus 2023-02-16 13:54:21 UTC
@msalter - can you post an MR on ARK with that revert and the above explaination.  We won't necessarily merge it, but just tag it as a temp fix (for ark-latest not os-build) until upstream resolves it.   I tried reverting it but ran into conflicts.

Comment 9 Mark Salter 2023-02-16 20:44:50 UTC
Some patches landed upstream which should mitigate this for current kernel-ark.

Fedora, depending on kernel version will need backports of the functionality in :

550b33cfd445 "arm64: efi: Force the use of SetVirtualAddressMap() on Altra machines"
190233164cd7 "arm64: efi: Force the use of SetVirtualAddressMap() on eMAG and Altra Max machines"

Complicated somewhat by some bits being split into a separate file in:

d9ffe524a538 "efi/arm64: libstub: Split off kernel image relocation for builtin stub"

Comment 10 Scott Weaver 2023-02-17 19:31:18 UTC
Hi Mark. I checked kernel-ark and we have those patches but we seem to still be hitting this.
I cloned this job [1] and it failed using 6.2.0-0.rc8.20230214gitf6feea56f66d.58.fc39.aarch64 [2].

[1] https://beaker.engineering.redhat.com/jobs/7525054
[2] https://beaker.engineering.redhat.com/jobs/7546484

Comment 13 Eirik Fuller 2023-03-24 02:30:37 UTC
I believe the patches mentioned in comment 9 have no effect on Gigabyte R272-P30-JG systems (Ampere Mt.Snow) because the value of type1_family in system_needs_vamap is "Server".

This can be seen in dmidecode output.


[root@ampere-mtsnow-altra-01 linux]# dmidecode -H 1
# dmidecode 3.4
Getting SMBIOS data from sysfs.
SMBIOS 3.4.0 present.
Table at 0xEE820000.

Handle 0x0001, DMI type 1, 27 bytes
System Information
        Manufacturer: GIGABYTE
        Product Name: R272-P30-JG
        Version: 100
        Serial Number: 210602110
        UUID: 00000000-0000-4000-8000-18c04d0db698
        Wake-up Type: Power Switch
        SKU Number: 01234567890123456789AB
        Family: Server

[root@ampere-mtsnow-altra-01 linux]# 


The only dmidecode output I found on this system that mentions Altra follows.


[root@ampere-mtsnow-altra-01 linux]# dmidecode -H 0x0065
# dmidecode 3.4
Getting SMBIOS data from sysfs.
SMBIOS 3.4.0 present.
Table at 0xEE820000.

Handle 0x0065, DMI type 4, 48 bytes
Processor Information
        Socket Designation: CPU 0
        Type: Central Processor
        Family: ARMv8
        Manufacturer: Ampere(R)
        ID: 01 00 16 0A A1 00 00 00
        Signature: JEP-106 Bank 0x0a Manufacturer 0x16, SoC ID 0x0001, SoC Revision 0x000000a1
        Version: Ampere(R) Altra(R) Processor
        Voltage: 1.0 V
        External Clock: 1650 MHz
        Max Speed: 3000 MHz
        Current Speed: 3000 MHz
        Status: Populated, Enabled
        Upgrade: None
        L1 Cache Handle: Not Provided
        L2 Cache Handle: Not Provided
        L3 Cache Handle: Not Provided
        Serial Number: 0000000000000000914C0904033865B0
        Asset Tag: 00000001
        Part Number: Q80-30
        Core Count: 80
        Core Enabled: 80
        Thread Count: 80
        Characteristics:
                64-bit capable
                Multi-Core
                Power/Performance Control
                128-bit Capable
                Arm64 SoC ID

[root@ampere-mtsnow-altra-01 linux]# 


The Family value is not helpful in that record (the Version value is the one that mentions Altra, but that value would not have matched the upstream patch either).

I have a console log with the following messages from a modified kernel.


 EFI stub: Booting Linux Kernel... 
EFI stub: Server 
EFI stub: Using DTB from configuration table 
EFI stub: Exiting boot services... 


It doesn't say "Decompressing Linux Kernel..." because I did not compress the modified kernel. The usual EFI stub messages from an unmodified kernel follow.


EFI stub: Decompressing Linux Kernel... 
EFI stub: Using DTB from configuration table 
EFI stub: Exiting boot services... 


The modification to that kernel, along with reverting commit d3549a938b73f, prints the value of type1_family in system_needs_vamap, and calls system_needs_vamap from check_platform_features but ignores the return value.

On this system, efibootmgr does not work with an unmodified kernel (a kernel with commit d3549a938b73f), but it works with commit d3549a938b73f reverted. When efibootmgr doesn't work, it's because /sys/firmware/efi/efivars/ is empty, and it reports the following message.


[root@ampere-mtsnow-altra-01 ~]# efibootmgr
No BootOrder is set; firmware will attempt recovery
[root@ampere-mtsnow-altra-01 ~]# 


The empty /sys/firmware/efi/efivars/ also causes confusion during installation.

It might make sense to revert commit d3549a938b73f until we find a way to make the upstream patches work.

The efi=novamap kernel command line parameter is no help here because there is no way to clear it, short of code modifications. A way to force novamap to zero might provide a workaround, or maybe something like efi=vomap would make sense (invert the sense of the parameter).

Comment 15 Eirik Fuller 2023-04-30 00:10:12 UTC
Because of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eb684408f3ea4, newer kernels no longer have the problem described in comment 13.

Comment 16 Scott Weaver 2023-06-12 12:50:30 UTC
We retried the beaker job from comment#10 and it failed to provision.

https://beaker-archive.hosts.prod.psi.bos.redhat.com/beaker-logs/2023/06/79230/7923076/14017452/console.log
https://beaker.engineering.redhat.com/recipes/14017452#installation


[   27.383436] CPU: 28 PID: 202 Comm: kworker/u64:2 Tainted: G          I       -------  ---  6.4.0-0.rc4.20230529gite338142b39cf.35.fc39.aarch64 #1 
m - D-Bus System Message Bus.  
[   27.399235] Hardware name: Lenovo HR350A            7X35CTO1WW    /HR350A     , BIOS hve104r-1.15 02/26/2021 
  
[   27.411824] Workqueue: efi_rts_wq efi_call_rts 
[   27.416512] pstate: 00000085 (nzcv daIf -PAN -UAO -TCO -DIT -SSBS BTYPE=--) 
[   27.423467] pc : efi_call_virt_check_flags+0x48/0xb8 
[   27.428424] lr : efi_call_rts+0x3a8/0x4c8 
[   27.432422] sp : ffff80001275bd20 
[   27.435724] x29: ffff80001275bd20 x28: 0000000000000000 x27: 0000000000000000 
[   27.442848] x26: 0000000000000000 x25: ffff80000af9ca28 x24: ffff80001280bd88 
[   27.449973] x23: ffff80001280bd40 x22: ffff800009a6c530 x21: 0000000000000080 
[   27.457097] x20: ffff800009a6c530 x19: 0000000000000000 x18: ffffffffffffffff 
[   27.464221] x17: 0000000000000000 x16: ffff80000c49c000 x15: 0000000000000000 
[   27.471345] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 
[   27.478469] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800008edf2b8 
[   27.485594] x8 : 0000000000000000 x7 : 4a6851c6cc745c00 x6 : ffff80000c49bad0 
[   27.492718] x5 : ffff80000a3a5008 x4 : 000000fff6dc0018 x3 : 0000000000000001 
[   27.499842] x2 : ffff800009a6a627 x1 : ffff800009a6c530 x0 : 0000000000000080 
[   27.506966] Call trace: 
[   27.509401]  efi_call_virt_check_flags+0x48/0xb8 
[   27.514007]  efi_call_rts+0x3a8/0x4c8 
[   27.517658]  process_one_work+0x1e4/0x488 
[   27.521657]  worker_thread+0x74/0x418 
[   27.525306]  kthread+0xf4/0x108 
[   27.528438]  ret_from_fork+0x10/0x20 
[   27.532002] ---[ end trace 0000000000000000 ]--- 
[   27.536606] Disabling lock debugging due to kernel taint 
[   27.541905] efi: [Firmware Bug]: IRQ flags corrupted (0x00000000=>0x00000080) by EFI set_variable 
[   27.550947] ------------[ cut here ]------------ 
[   27.555556] WARNING: CPU: 26 PID: 224 at drivers/firmware/efi/runtime-wrappers.c:341 virt_efi_set_variable+0x194/0x1b0 
[   27.566244] Modules linked in: uas usb_storage nvme dwc3 igb nvme_core crct10dif_ce udc_core ast polyval_ce ulpi polyval_generic ghash_ce mlx4_core(+) sbsa_gwdt nvme_common i2c_algo_bit ahci_platform i2c_xgene_slimpro libahci_platform xgene_hwmon gpio_dwapb xhci_plat_hcd sunrpc lrw dm_crypt dm_round_robin linear raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 scsi_dh_hp_sw squashfs be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath fuse 
[   27.621611] CPU: 26 PID: 224 Comm: kworker/26:1 Tainted: G        W I       -------  ---  6.4.0-0.rc4.20230529gite338142b39cf.35.fc39.aarch64 #1 
[   27.634550] Hardware name: Lenovo HR350A            7X35CTO1WW    /HR350A     , BIOS hve104r-1.15 02/26/2021 
[   27.644362] Workqueue: events refresh_nv_rng_seed 
[   27.649056] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) 
[   27.656005] pc : virt_efi_set_variable+0x194/0x1b0 
[   27.660785] lr : virt_efi_set_variable+0x178/0x1b0 
[   27.665564] sp : ffff80001280bcf0 
[   27.668866] x29: ffff80001280bcf0 x28: 0000000000000000 x27: 0000000000000000 
[   27.675991] x26: ffff000807442674 x25: ffff80000b9a0848 x24: ffff80000b9a0000 
[   27.683116] x23: ffff800009a6a628 x22: ffff80001280bd78 x21: 8000000000000015 
[   27.690240] x20: ffff80000ae7db20 x19: ffff80000b9a07d0 x18: 0000000000000014 
[   27.697365] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 
[   27.704489] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 
[   27.711613] x11: 0000000000000000 x10: 0000000000001da0 x9 : ffff800009263ae8 
[   27.718737] x8 : ffff000807559e00 x7 : 0000000000000000 x6 : 00000000000000b0 
[   27.725861] x5 : 00000000500f0000 x4 : 0000000000000000 x3 : 0000000000000001 
[   27.732985] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 8000000000000015 
[   27.740110] Call trace: 
[   27.742544]  virt_efi_set_variable+0x194/0x1b0 
[   27.746977]  refresh_nv_rng_seed+0x88/0xc8 
[   27.751061]  process_one_work+0x1e4/0x488 
[   27.755059]  worker_thread+0x74/0x418 
[   27.758709]  kthread+0xf4/0x108 
[   27.761839]  ret_from_fork+0x10/0x20 
[   27.765403] ---[ end trace 0000000000000000 ]---

Comment 17 Eirik Fuller 2023-06-12 14:05:22 UTC
The recipe linked in comment 16 is not a retry of the failed jobs linked in comment 10.

In particular, the hostRequires tag in the comment 16 recipe did not pick up a MtSnow AltraMax. A more specific hostname tag should help dodge Lenovo HR350A systems, and other Ampere systems unrelated to this issue.

Comment 18 Scott Weaver 2023-06-12 15:23:37 UTC
(In reply to Eirik Fuller from comment #17)
> The recipe linked in comment 16 is not a retry of the failed jobs linked in
> comment 10.
> 
> In particular, the hostRequires tag in the comment 16 recipe did not pick up
> a MtSnow AltraMax. A more specific hostname tag should help dodge Lenovo
> HR350A systems, and other Ampere systems unrelated to this issue.

Thanks Eirik,

By clone I mean we cloned the beaker job xml as is but like you pointed out that did not mean that we would get the same system.

If RHEL9 is supported for a system then we test rawhide/eln there too. So should I open a new BZ to track this issue or should we not be testing upstream rawhide/eln on Lenovo HR350A systems?

Scott

Comment 19 Eirik Fuller 2023-06-12 15:37:48 UTC
Recent RHEL9 kernels seem to work on Lenovo HR350A systems, so yes, it is good for Rawhide to work on them also.

The kernel messages in comment 16 suggest a firmware issue, but not necessarily the same one tracked by this bug, in which case tracking it elsewhere makes sense.

Comment 20 Eirik Fuller 2023-06-12 15:40:59 UTC
https://beaker.engineering.redhat.com/recipes/14079178 reveals that a newer Rawhide (newer than in comment 16) boots and installs properly on a MtSnow AltraMax system.

Comment 21 Scott Weaver 2023-06-12 18:03:27 UTC
(In reply to Eirik Fuller from comment #20)
> https://beaker.engineering.redhat.com/recipes/14079178 reveals that a newer
> Rawhide (newer than in comment 16) boots and installs properly on a MtSnow
> AltraMax system.

Thank you. I created BZ2214351 to track the Lenovo HR350A systems.

Bruno, do you think we can close this as fixed in CURRENTRELEASE?

Comment 22 Bruno Goncalves 2023-06-13 06:30:52 UTC
Yes, based on comment#20 we can close this as fixed.


Note You need to log in before you can comment on or make changes to this bug.