1. Please describe the problem: Trying to provision recent Rawhide composes in some of aarch64 machines fail with: EFI stub: Booting Linux Kernel... EFI stub: ERROR: FIRMWARE BUG: efi_loaded_image_t::image_base has bogus value EFI stub: Using DTB from configuration table EFI stub: Exiting boot services... 2. What is the Version-Release number of the kernel: kernel-6.2.0-0.rc2.20230105git41c03ba9beea.20.fc38.aarch64
The EFI stub "ERROR: FIRMWARE BUG" is informational and isn't related to the later failure during the install. The arm stub doesn't use image_base. Later "Failed to set new efi boot target. This is most likely a kernel or firmware bug. " indicates that efibootmgr returned an error when anaconda tries to set an EFI boot variable. It probably isn't a firmware problem unless a kernel/efibootmgr related change uncovered a latent bug. More likely something in kernel efivars or efibootmgr changed and is causing the problem.
Thanks @msalter! I am not that good with efivars and efibootmgr. As this is hit during install, I am not sure what we can do to diagnose the problem? Thoughts? Cheers, Don
It would be easier to debug outside of an install process. So I'd try installing a known good Fedora. Then install a kernel version which was used to pxe boot the failing install, then try efibootmgr to see if you can read/set boot variables. Use the version of efibootmgr that was used pxe initrd. The goal being to find if it is kernel or efibootmgr which is a problem. Then bisect to see which version is the first to fail.
Also, if you are connected to console, you should be able to answer "yes" to the "Would you like to ignore this and continue with installation?" That could get you booting so you could have a look around. At worst, it will drop you into the efi shell, where you can manually boot with something like: Shell> fs0: (it may be fs1:, fs2: etc) Shell> cd efi\fedora Shell> grubaa64.efi
So this looks like a firmware bug uncovered by: commit d3549a938b73f203ef522562ae9f2d38aa43d234 Author: Ard Biesheuvel <ardb> Date: Fri Sep 16 11:48:30 2022 +0200 efi/arm64: libstub: avoid SetVirtualAddressMap() when possible EFI's SetVirtualAddressMap() runtime service is a horrid hack that we'd like to avoid using, if possible. For 64-bit architectures such as arm64, the user and kernel mappings are entirely disjoint, and given that we use the user region for mapping the UEFI runtime regions when running under the OS, we don't rely on SetVirtualAddressMap() in the conventional way, i.e., to permit kernel mappings of the OS to coexist with kernel region mappings of the firmware regions. This means that, in principle, we should be able to avoid SetVirtualAddressMap() altogether, and simply use the 1:1 mapping that UEFI uses at boot time. (Note that omitting SetVirtualAddressMap() is explicitly permitted by the UEFI spec). However, there is a corner case on arm64, which, if configured for 3-level paging (or 2-level paging when using 64k pages), may not be able to cover the entire range of firmware mappings (which might contain both memory and MMIO peripheral mappings). So let's avoid SetVirtualAddressMap() on arm64, but only if the VA space is guaranteed to be of sufficient size. Signed-off-by: Ard Biesheuvel <ardb> I found this by bisecting and verified it by causing a panic with efi=novamap on an otherwise working version of fedora 36. I also see the problem on Ampere Mt Jade and Mt Collins.
Upstream is aware of the problem and are considering possible solutions including reverting commit d3549a938b73f. There is no current workaround other than reverting that commit.
@msalter - can you post an MR on ARK with that revert and the above explaination. We won't necessarily merge it, but just tag it as a temp fix (for ark-latest not os-build) until upstream resolves it. I tried reverting it but ran into conflicts.
Some patches landed upstream which should mitigate this for current kernel-ark. Fedora, depending on kernel version will need backports of the functionality in : 550b33cfd445 "arm64: efi: Force the use of SetVirtualAddressMap() on Altra machines" 190233164cd7 "arm64: efi: Force the use of SetVirtualAddressMap() on eMAG and Altra Max machines" Complicated somewhat by some bits being split into a separate file in: d9ffe524a538 "efi/arm64: libstub: Split off kernel image relocation for builtin stub"
Hi Mark. I checked kernel-ark and we have those patches but we seem to still be hitting this. I cloned this job [1] and it failed using 6.2.0-0.rc8.20230214gitf6feea56f66d.58.fc39.aarch64 [2]. [1] https://beaker.engineering.redhat.com/jobs/7525054 [2] https://beaker.engineering.redhat.com/jobs/7546484
I believe the patches mentioned in comment 9 have no effect on Gigabyte R272-P30-JG systems (Ampere Mt.Snow) because the value of type1_family in system_needs_vamap is "Server". This can be seen in dmidecode output. [root@ampere-mtsnow-altra-01 linux]# dmidecode -H 1 # dmidecode 3.4 Getting SMBIOS data from sysfs. SMBIOS 3.4.0 present. Table at 0xEE820000. Handle 0x0001, DMI type 1, 27 bytes System Information Manufacturer: GIGABYTE Product Name: R272-P30-JG Version: 100 Serial Number: 210602110 UUID: 00000000-0000-4000-8000-18c04d0db698 Wake-up Type: Power Switch SKU Number: 01234567890123456789AB Family: Server [root@ampere-mtsnow-altra-01 linux]# The only dmidecode output I found on this system that mentions Altra follows. [root@ampere-mtsnow-altra-01 linux]# dmidecode -H 0x0065 # dmidecode 3.4 Getting SMBIOS data from sysfs. SMBIOS 3.4.0 present. Table at 0xEE820000. Handle 0x0065, DMI type 4, 48 bytes Processor Information Socket Designation: CPU 0 Type: Central Processor Family: ARMv8 Manufacturer: Ampere(R) ID: 01 00 16 0A A1 00 00 00 Signature: JEP-106 Bank 0x0a Manufacturer 0x16, SoC ID 0x0001, SoC Revision 0x000000a1 Version: Ampere(R) Altra(R) Processor Voltage: 1.0 V External Clock: 1650 MHz Max Speed: 3000 MHz Current Speed: 3000 MHz Status: Populated, Enabled Upgrade: None L1 Cache Handle: Not Provided L2 Cache Handle: Not Provided L3 Cache Handle: Not Provided Serial Number: 0000000000000000914C0904033865B0 Asset Tag: 00000001 Part Number: Q80-30 Core Count: 80 Core Enabled: 80 Thread Count: 80 Characteristics: 64-bit capable Multi-Core Power/Performance Control 128-bit Capable Arm64 SoC ID [root@ampere-mtsnow-altra-01 linux]# The Family value is not helpful in that record (the Version value is the one that mentions Altra, but that value would not have matched the upstream patch either). I have a console log with the following messages from a modified kernel. EFI stub: Booting Linux Kernel... EFI stub: Server EFI stub: Using DTB from configuration table EFI stub: Exiting boot services... It doesn't say "Decompressing Linux Kernel..." because I did not compress the modified kernel. The usual EFI stub messages from an unmodified kernel follow. EFI stub: Decompressing Linux Kernel... EFI stub: Using DTB from configuration table EFI stub: Exiting boot services... The modification to that kernel, along with reverting commit d3549a938b73f, prints the value of type1_family in system_needs_vamap, and calls system_needs_vamap from check_platform_features but ignores the return value. On this system, efibootmgr does not work with an unmodified kernel (a kernel with commit d3549a938b73f), but it works with commit d3549a938b73f reverted. When efibootmgr doesn't work, it's because /sys/firmware/efi/efivars/ is empty, and it reports the following message. [root@ampere-mtsnow-altra-01 ~]# efibootmgr No BootOrder is set; firmware will attempt recovery [root@ampere-mtsnow-altra-01 ~]# The empty /sys/firmware/efi/efivars/ also causes confusion during installation. It might make sense to revert commit d3549a938b73f until we find a way to make the upstream patches work. The efi=novamap kernel command line parameter is no help here because there is no way to clear it, short of code modifications. A way to force novamap to zero might provide a workaround, or maybe something like efi=vomap would make sense (invert the sense of the parameter).
Because of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eb684408f3ea4, newer kernels no longer have the problem described in comment 13.
We retried the beaker job from comment#10 and it failed to provision. https://beaker-archive.hosts.prod.psi.bos.redhat.com/beaker-logs/2023/06/79230/7923076/14017452/console.log https://beaker.engineering.redhat.com/recipes/14017452#installation [ 27.383436] CPU: 28 PID: 202 Comm: kworker/u64:2 Tainted: G I ------- --- 6.4.0-0.rc4.20230529gite338142b39cf.35.fc39.aarch64 #1 m - D-Bus System Message Bus. [ 27.399235] Hardware name: Lenovo HR350A 7X35CTO1WW /HR350A , BIOS hve104r-1.15 02/26/2021 [ 27.411824] Workqueue: efi_rts_wq efi_call_rts [ 27.416512] pstate: 00000085 (nzcv daIf -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 27.423467] pc : efi_call_virt_check_flags+0x48/0xb8 [ 27.428424] lr : efi_call_rts+0x3a8/0x4c8 [ 27.432422] sp : ffff80001275bd20 [ 27.435724] x29: ffff80001275bd20 x28: 0000000000000000 x27: 0000000000000000 [ 27.442848] x26: 0000000000000000 x25: ffff80000af9ca28 x24: ffff80001280bd88 [ 27.449973] x23: ffff80001280bd40 x22: ffff800009a6c530 x21: 0000000000000080 [ 27.457097] x20: ffff800009a6c530 x19: 0000000000000000 x18: ffffffffffffffff [ 27.464221] x17: 0000000000000000 x16: ffff80000c49c000 x15: 0000000000000000 [ 27.471345] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 27.478469] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800008edf2b8 [ 27.485594] x8 : 0000000000000000 x7 : 4a6851c6cc745c00 x6 : ffff80000c49bad0 [ 27.492718] x5 : ffff80000a3a5008 x4 : 000000fff6dc0018 x3 : 0000000000000001 [ 27.499842] x2 : ffff800009a6a627 x1 : ffff800009a6c530 x0 : 0000000000000080 [ 27.506966] Call trace: [ 27.509401] efi_call_virt_check_flags+0x48/0xb8 [ 27.514007] efi_call_rts+0x3a8/0x4c8 [ 27.517658] process_one_work+0x1e4/0x488 [ 27.521657] worker_thread+0x74/0x418 [ 27.525306] kthread+0xf4/0x108 [ 27.528438] ret_from_fork+0x10/0x20 [ 27.532002] ---[ end trace 0000000000000000 ]--- [ 27.536606] Disabling lock debugging due to kernel taint [ 27.541905] efi: [Firmware Bug]: IRQ flags corrupted (0x00000000=>0x00000080) by EFI set_variable [ 27.550947] ------------[ cut here ]------------ [ 27.555556] WARNING: CPU: 26 PID: 224 at drivers/firmware/efi/runtime-wrappers.c:341 virt_efi_set_variable+0x194/0x1b0 [ 27.566244] Modules linked in: uas usb_storage nvme dwc3 igb nvme_core crct10dif_ce udc_core ast polyval_ce ulpi polyval_generic ghash_ce mlx4_core(+) sbsa_gwdt nvme_common i2c_algo_bit ahci_platform i2c_xgene_slimpro libahci_platform xgene_hwmon gpio_dwapb xhci_plat_hcd sunrpc lrw dm_crypt dm_round_robin linear raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 scsi_dh_hp_sw squashfs be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath fuse [ 27.621611] CPU: 26 PID: 224 Comm: kworker/26:1 Tainted: G W I ------- --- 6.4.0-0.rc4.20230529gite338142b39cf.35.fc39.aarch64 #1 [ 27.634550] Hardware name: Lenovo HR350A 7X35CTO1WW /HR350A , BIOS hve104r-1.15 02/26/2021 [ 27.644362] Workqueue: events refresh_nv_rng_seed [ 27.649056] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 27.656005] pc : virt_efi_set_variable+0x194/0x1b0 [ 27.660785] lr : virt_efi_set_variable+0x178/0x1b0 [ 27.665564] sp : ffff80001280bcf0 [ 27.668866] x29: ffff80001280bcf0 x28: 0000000000000000 x27: 0000000000000000 [ 27.675991] x26: ffff000807442674 x25: ffff80000b9a0848 x24: ffff80000b9a0000 [ 27.683116] x23: ffff800009a6a628 x22: ffff80001280bd78 x21: 8000000000000015 [ 27.690240] x20: ffff80000ae7db20 x19: ffff80000b9a07d0 x18: 0000000000000014 [ 27.697365] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 27.704489] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 27.711613] x11: 0000000000000000 x10: 0000000000001da0 x9 : ffff800009263ae8 [ 27.718737] x8 : ffff000807559e00 x7 : 0000000000000000 x6 : 00000000000000b0 [ 27.725861] x5 : 00000000500f0000 x4 : 0000000000000000 x3 : 0000000000000001 [ 27.732985] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 8000000000000015 [ 27.740110] Call trace: [ 27.742544] virt_efi_set_variable+0x194/0x1b0 [ 27.746977] refresh_nv_rng_seed+0x88/0xc8 [ 27.751061] process_one_work+0x1e4/0x488 [ 27.755059] worker_thread+0x74/0x418 [ 27.758709] kthread+0xf4/0x108 [ 27.761839] ret_from_fork+0x10/0x20 [ 27.765403] ---[ end trace 0000000000000000 ]---
The recipe linked in comment 16 is not a retry of the failed jobs linked in comment 10. In particular, the hostRequires tag in the comment 16 recipe did not pick up a MtSnow AltraMax. A more specific hostname tag should help dodge Lenovo HR350A systems, and other Ampere systems unrelated to this issue.
(In reply to Eirik Fuller from comment #17) > The recipe linked in comment 16 is not a retry of the failed jobs linked in > comment 10. > > In particular, the hostRequires tag in the comment 16 recipe did not pick up > a MtSnow AltraMax. A more specific hostname tag should help dodge Lenovo > HR350A systems, and other Ampere systems unrelated to this issue. Thanks Eirik, By clone I mean we cloned the beaker job xml as is but like you pointed out that did not mean that we would get the same system. If RHEL9 is supported for a system then we test rawhide/eln there too. So should I open a new BZ to track this issue or should we not be testing upstream rawhide/eln on Lenovo HR350A systems? Scott
Recent RHEL9 kernels seem to work on Lenovo HR350A systems, so yes, it is good for Rawhide to work on them also. The kernel messages in comment 16 suggest a firmware issue, but not necessarily the same one tracked by this bug, in which case tracking it elsewhere makes sense.
https://beaker.engineering.redhat.com/recipes/14079178 reveals that a newer Rawhide (newer than in comment 16) boots and installs properly on a MtSnow AltraMax system.
(In reply to Eirik Fuller from comment #20) > https://beaker.engineering.redhat.com/recipes/14079178 reveals that a newer > Rawhide (newer than in comment 16) boots and installs properly on a MtSnow > AltraMax system. Thank you. I created BZ2214351 to track the Lenovo HR350A systems. Bruno, do you think we can close this as fixed in CURRENTRELEASE?
Yes, based on comment#20 we can close this as fixed.