RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2157656 - [edk2] [aarch64] Unable to initialize EFI firmware when using edk2-aarch64-20221207gitfff6d81270b5-1.el9 in some hardwares
Summary: [edk2] [aarch64] Unable to initialize EFI firmware when using edk2-aarch64-20...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: edk2
Version: 9.2
Hardware: aarch64
OS: Linux
high
urgent
Target Milestone: rc
: ---
Assignee: Oliver Steffen
QA Contact: Yihuang Yu
URL:
Whiteboard:
: 2165623 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-01-02 14:03 UTC by Yihuang Yu
Modified: 2023-05-09 07:58 UTC (History)
19 users (show)

Fixed In Version: edk2-20221207gitfff6d81270b5-5.el9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-05-09 07:23:58 UTC
Type: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gitlab redhat/centos-stream/src edk2 merge_requests 24 0 None opened Backport Cavium ThunderX(2) Erratum 2023-01-25 12:54:53 UTC
Gitlab redhat/centos-stream/src edk2 merge_requests 25 0 None opened Revert "ArmVirtPkg/ArmVirtQemu: enable initial ID map at early boot" 2023-01-31 14:55:34 UTC
Red Hat Issue Tracker RHELPLAN-143428 0 None None None 2023-01-02 14:07:35 UTC
Red Hat Product Errata RHSA-2023:2165 0 None None None 2023-05-09 07:25:22 UTC

Description Yihuang Yu 2023-01-02 14:03:09 UTC
Description of problem:
After edk2 was rebased to 202211, when I using edk2-aarch64-20221207gitfff6d81270b5-1.el9 in a virtlab machine, the qemu process could be launched, but it was unable to initialize the EFI firmware so that the guest will hang.

Version-Release number of selected component (if applicable):
edk2 version: edk2-aarch64-20221207gitfff6d81270b5-1.el9.noarch
qemu version: qemu-kvm-7.2.0-2.el9.aarch64
host kernel: kernel-5.14.0-226.el9.aarch64

How reproducible:
always, but I can only reproduce it on a virtlab machine(virtlab-arm02.virt.lab.eng.bos.redhat.com) right now.

Steps to Reproduce:
1. Try to install a guest using the following qemu command lines
MALLOC_PERTURB_=1  /usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -blockdev node-name=file_aavmf_code,driver=file,filename=/usr/share/edk2/aarch64/QEMU_EFI-silent-pflash.raw,auto-read-only=on,discard=unmap \
    -blockdev node-name=drive_aavmf_code,driver=raw,read-only=on,file=file_aavmf_code \
    -blockdev node-name=file_aavmf_vars,driver=file,filename=/root/avocado/data/avocado-vt/avocado-vt-vm1_rhel920-aarch64-virtio-scsi_qcow2_filesystem_VARS.fd,auto-read-only=on,discard=unmap \
    -blockdev node-name=drive_aavmf_vars,driver=raw,read-only=off,file=file_aavmf_vars \
    -machine virt,gic-version=host,memory-backend=mem-machine_mem,pflash0=drive_aavmf_code,pflash1=drive_aavmf_vars \
    -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
    -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
    -nodefaults \
    -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
    -device virtio-gpu-pci,bus=pcie-root-port-1,addr=0x0 \
    -m 13312 \
    -object '{"qom-type": "memory-backend-ram", "size": 13958643712, "id": "mem-machine_mem"}'  \
    -smp 112,maxcpus=112,cores=56,threads=1,clusters=1,sockets=2  \
    -cpu 'host' \
    -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
    -device qemu-xhci,id=usb1,bus=pcie-root-port-2,addr=0x0 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
    -device '{"id": "virtio_scsi_pci0", "driver": "virtio-scsi-pci", "bus": "pcie-root-port-3", "addr": "0x0"}' \
    -blockdev '{"node-name": "file_image1", "driver": "file", "auto-read-only": true, "discard": "unmap", "aio": "threads", "filename": "/home/kvm_autotest_root/images/rhel920-aarch64-virtio-scsi.qcow2", "cache": {"direct": true, "no-flush": false}}' \
    -blockdev '{"node-name": "drive_image1", "driver": "qcow2", "read-only": false, "cache": {"direct": true, "no-flush": false}, "file": "file_image1"}' \
    -device '{"driver": "scsi-hd", "id": "image1", "drive": "drive_image1", "write-cache": "on"}' \
    -device pcie-root-port,id=pcie-root-port-4,port=0x4,addr=0x1.0x4,bus=pcie.0,chassis=5 \
    -device virtio-net-pci,mac=9a:19:04:e8:73:ea,rombar=0,id=idjJMOvi,netdev=idd1GEZm,bus=pcie-root-port-4,addr=0x0  \
    -netdev tap,id=idd1GEZm,vhost=on \
    -blockdev '{"node-name": "file_cd1", "driver": "file", "auto-read-only": true, "discard": "unmap", "aio": "threads", "filename": "/home/kvm_autotest_root/iso/linux/RHEL-9.2.0-20230102.0-aarch64-dvd1.iso", "cache": {"direct": true, "no-flush": false}}' \
    -blockdev '{"node-name": "drive_cd1", "driver": "raw", "read-only": true, "cache": {"direct": true, "no-flush": false}, "file": "file_cd1"}' \
    -device '{"driver": "scsi-cd", "id": "cd1", "drive": "drive_cd1", "write-cache": "on"}' \
    -blockdev '{"node-name": "file_unattended", "driver": "file", "auto-read-only": true, "discard": "unmap", "aio": "threads", "filename": "/home/kvm_autotest_root/images/rhel920-aarch64/ks.iso", "cache": {"direct": true, "no-flush": false}}' \
    -blockdev '{"node-name": "drive_unattended", "driver": "raw", "read-only": true, "cache": {"direct": true, "no-flush": false}, "file": "file_unattended"}' \
    -device '{"driver": "scsi-cd", "id": "unattended", "drive": "drive_unattended", "write-cache": "on"}'  \
    -kernel '/home/kvm_autotest_root/images/rhel920-aarch64/vmlinuz'  \
    -append 'inst.sshd inst.repo=cdrom inst.ks=cdrom:/ks.cfg net.ifnames=0 console=ttyAMA0,38400'  \
    -initrd '/home/kvm_autotest_root/images/rhel920-aarch64/initrd.img'  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -no-shutdown \
    -enable-kvm \
    -serial stdio

Actual results:
The serial console will not display any info and the UEFI will not be initialized.

Expected results:
Downgrade to edk2-aarch64-20220826gitba0e0e4c6a-2.el9.noarch, then the guest can be installed.
UEFI firmware starting.
^@^@
SyncPcrAllocationsAndPcrMask!
Tpm2GetCapabilityPcrs - 00000004
...
...
Additional info:
I debugged the source code with "git bisect", and seems the issue is from this patch: https://github.com/tianocore/edk2/pull/3538, if I revert this series of commits, the edk2 I build works.
And I think the bad commit is this one: https://github.com/tianocore/edk2/commit/07be1d34d95460a238fcd0f6693efb747c28b329

Comment 2 Yihuang Yu 2023-01-05 11:19:33 UTC
Well, now I can also reproduce this issue in another HW(ampere-mtsnow-altramax) with basic ksm and nvdimm test cases.

 (1/2) Host_RHEL.m9.u2.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.9.2.0.aarch64.io-github-autotest-qemu.ksm_base.base.arm64-pci: STARTED
 (1/2) Host_RHEL.m9.u2.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.9.2.0.aarch64.io-github-autotest-qemu.ksm_base.base.arm64-pci: ERROR: No ipv4 DHCP lease for MAC 9a:02:d1:f1:27:ef (756.68 s)
 (2/2) Host_RHEL.m9.u2.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.9.2.0.aarch64.io-github-autotest-qemu.nvdimm.nvdimm_basic.arm64-pci: STARTED
 (2/2) Host_RHEL.m9.u2.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.9.2.0.aarch64.io-github-autotest-qemu.nvdimm.nvdimm_basic.arm64-pci: ERROR: No ipv4 DHCP lease for MAC 9a:bf:40:b2:65:2a (515.71 s)

serial log:

2023-01-05 05:19:03: Ncat: Connection reset by peer.
2023-01-05 05:19:03: (Process terminated with status 1)

Comment 3 Gerd Hoffmann 2023-01-05 14:20:39 UTC
https://edk2.groups.io/g/devel/message/97864

Comment 4 Gerd Hoffmann 2023-01-05 14:23:17 UTC
https://edk2.groups.io/g/devel/topic/96054879#97947 (patch, note reply with fix)

Comment 5 Gerd Hoffmann 2023-01-09 14:21:58 UTC
Oliver, can you do a scratch build for QE to test with the proposed patch (including the incremental fix, see comment 4)?

Comment 6 Oliver Steffen 2023-01-09 14:39:37 UTC
Hi Gerd,

we have continued the discussion meanwhile in Jira:
https://issues.redhat.com/browse/RHELPLAN-143428

Comment 7 Oliver Steffen 2023-01-10 14:01:04 UTC
Copying from Jira:

--------------------

Oliver Steffen added a comment:

Hi Yihuang Yu,

Here is a scratch build with the patch:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=49896954
Repo file: http://brew-task-repos.usersys.redhat.com/repos/scratch/osteffen/edk2/20221207gitfff6d81270b5/2.el9.test/edk2-20221207gitfff6d81270b5-2.el9.test-scratch.repo

Can you try this out and see if it fixes the issue?

Let me know if I can do anything to help.

Thanks,
   Oliver

--------------------

Yihuang Yu added a comment

Hello Oliver Steffen,

I used the scratch build and test again in a ThunderX machine, but seems it cannot really fix the issue. After the qemu process was started, the serial console still had no output. In my auto test case, it will use ncat to connect to the serial, following are the whole file contents, which only contain timeout info.

2023-01-09 04:42:57: Ncat: Connection reset by peer.
2023-01-09 04:42:57: (Process terminated with status 1) 

Besides, in another A64FX(Fujitsu vender) machine, I can also trigger the issue when the qemu command line contains a nvdimm device, this should be a generic reproducer I think, but not sure if they are the same problem.

[stdlog]     -machine virt,nvdimm=on,memory-backend=mem-machine_mem,pflash0=drive_aavmf_code,pflash1=drive_aavmf_vars \
[stdlog]     -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
[stdlog]     -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
[stdlog]     -nodefaults \
[stdlog]     -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
[stdlog]     -device virtio-gpu-pci,bus=pcie-root-port-1,addr=0x0 \
[stdlog]     -m 14336,maxmem=32G,slots=4 \
[stdlog]     -object memory-backend-file,size=1G,mem-path=/tmp/nvdimm0,share=yes,id=mem-mem1 \
[stdlog]     -device nvdimm,id=dimm-mem1,memdev=mem-mem1 \ 

--------------------

Comment 8 Gerd Hoffmann 2023-01-11 11:29:58 UTC
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=49896954
> Repo file:
> http://brew-task-repos.usersys.redhat.com/repos/scratch/osteffen/edk2/
> 20221207gitfff6d81270b5/2.el9.test/edk2-20221207gitfff6d81270b5-2.el9.test-
> scratch.repo

Any change you forgot to flip the switch?  The workaround is not enabled by
default so you have to add a line to the build config to turn it on.

Comment 9 Oliver Steffen 2023-01-12 08:43:48 UTC
Should be on, see these two commits:

Ard's patch, incl. fix:
https://gitlab.com/osteffen/edk2/-/commit/f7be6f884d1db372e4a79946f8e6351e0aac710a

Switch it on in the build:
https://gitlab.com/osteffen/edk2/-/commit/efd9b2d98e30c4bc338c53d88ba77c2f017daf37

Comment 10 Gerd Hoffmann 2023-01-12 16:18:12 UTC
> Switch it on in the build:
> https://gitlab.com/osteffen/edk2/-/commit/
> efd9b2d98e30c4bc338c53d88ba77c2f017daf37

Looks correct.

Merged meanwhile:
https://github.com/tianocore/edk2/commit/ec54ce1f1ab41b92782b37ae59e752fff0ef9c41

Comment 11 Li Xiaohui 2023-01-13 12:32:46 UTC
When I test postcopy migration on edk2-aarch64-20221207gitfff6d81270b5-1.el9.noarch, hit one issue:
Start postcopy migration from the source to the destination host, after postcopy migration, reboot the guest on the destination host, then the guest would hang on the reboot stage.

And I also tried edk2-aarch64-20220826gitba0e0e4c6a-2.el9.noarch, didn't hit the above issue. 


So seems the issue I met is the same as this bug. I would track the fix of this bug, and verify postcopy migration when we can do verification.

Comment 12 Oliver Steffen 2023-01-13 13:54:03 UTC
The upstream patch was modified since the scratch build.

I started a new build, based on the patches that were
actually merged upstream.

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=50054555

Please retry with this. Thanks.

-Oliver

Comment 13 Yihuang Yu 2023-01-16 02:52:58 UTC
Hum, it still does not work. The serial log is empty

# rpm -qa edk2-aarch64
edk2-aarch64-20221207gitfff6d81270b5-2.el9.test.noarch

# cat serial-serial0-avocado-vt-vm1.log
2023-01-14 09:46:48: Ncat: Connection reset by peer.
2023-01-14 09:46:48: (Process terminated with status 1)

Oliver, I am not sure if I missed something, if you need the host for debugging, please let me know, thanks.

Comment 14 Oliver Steffen 2023-01-17 13:26:29 UTC
I tried it out again on the machine and can confirm that the patch from upstream does not work on this hardware.

Asking about that on the upstream list.

In the meantime, I'll prepare a scratch build with the problematic commit reverted (as a fallback).

Comment 15 Oliver Steffen 2023-01-18 07:46:15 UTC
Scratch build with the revert:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=50131423

Code: https://gitlab.com/osteffen/edk2/-/tree/bz-215656-revert

I did a quick test of the revert-commit in the machine and it seems to
work. No need to test this scratch build right away.

Comment 16 Oliver Steffen 2023-01-19 13:08:23 UTC
The upstream fix works with the latest kernel:
6.1.7-200.fc37.aarch64 (https://koji.fedoraproject.org/koji/buildinfo?buildID=2112315)

Apparently this commit is required to make it work:
https://github.com/torvalds/linux/commit/406504c7b0405d74d74c15a667cd4c4620c3e7a9

I think we have two options now:

A) More work
  - Backport the kernel patch
  - and backport EDK2 fix
    (see Comment 12)

B) revert the breaking commit (see Comment 15)

Comment 17 Yihuang Yu 2023-01-19 14:32:23 UTC
Got it. So if we want the "A" solution, we also need the kernel fix. Since we are in the 9.2.0 kvm rebase, maybe we can also backport this fix. @cohuck, what do you think? Or report a kernel bug to avoid blocking the rebase process.

The "B" solution is an easy fix, I can accept it, but I still prefer the solution "A".

Comment 18 Eirik Fuller 2023-01-19 15:16:46 UTC
Bug 2162404 is already requesting the kernel patch linked in comment 16.

Comment 19 Cornelia Huck 2023-01-19 15:25:31 UTC
I'd prefer to fix it via that separate bz (and not hold up the rebase.)

Comment 20 Oliver Steffen 2023-01-25 12:49:52 UTC
see comments in bug 2162404 comment 9
-> going ahead with the backport of the fix.

For reference, this is the upstream PR:
https://github.com/tianocore/edk2/pull/3878/commits

scratch build:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=50281185

MR:
https://gitlab.com/redhat/centos-stream/src/edk2/-/merge_requests/24

Comment 32 Yihuang Yu 2023-02-07 05:54:07 UTC
Verify this bug with edk2-aarch64-20221207gitfff6d81270b5-5.el9.noarch and kernel-5.14.0-256.el9.aarch64

Env:
kernel-5.14.0-256.el9.aarch64
qemu-kvm-7.2.0-6.el9.aarch64
edk2-aarch64-20221207gitfff6d81270b5-5.el9.noarch

From 89 tests executed, 88 passed and 0 warned - success rate of 98.88% (excluding SKIP and CANCEL)
1 test case failed by another known issue

http://10.0.136.47/7486271/results.html

Comment 33 Li Xiaohui 2023-02-07 09:08:25 UTC
Verify postcopy cases on edk2-aarch64-20221207gitfff6d81270b5-5.el9.noarch, they all pass.

[root@ampere-hr330a-07 ipa]# python3 Start2Run.py --test_requirement=VIRT_49060_aarch64_blockdev --src_host_ip=10.19.241.172 --dst_host_ip=10.19.241.182 --share_images_dir=/mnt/xiaohli --sys_image_name=rhel920-aarch64-virtio-scsi.qcow2
**********************************************************************************************
RESULTS [VIRT-49060-AARCH64-BLOCKDEV]:
==>TOTAL : 11
==>PASS : 11 
   1: BASE-TEST-POSTCOPY-Migration basic precopy test without setting downtime and speed (5 min 4 sec)
   2: VIRT-49062-[postcopy] Migration finishes only with postcopy under high stress (rhel only) (15 min 25 sec)
   3: VIRT-58670-[postcopy] Cancel migration during the precopy phase (1 min 28 sec)
   4: VIRT-58672-[postcopy] Source should recovers when fail the destination during the precopy phase (1 min 28 sec)
   5: VIRT-85702-[postcopy] Post-copy migration with XBZRLE compression (3 min 16 sec)
   6: VIRT-86251-[postcopy] live migration post-copy support file-backed memory (3 min 44 sec)
   7: VIRT-93722-[postcopy] Postcopy migration with Numa pinned and Hugepage pinned guest (3 min 20 sec)
   8: VIRT-294886-[migration] Postcopy migration recover after migrate-pause (2 min 20 sec)
   9: RHEL-150076-[postcopy] Set postcopy migration speed(max-postcopy-bandwidth) (4 min 36 sec)
   10: RHEL-186017-[postcopy] Basic postcopy migration (3 min 12 sec)
   11: RHEL-189930-[postcopy] Post-copy migration with enabling auto-converge (3 min 24 sec)
==>ERROR : 0 
==>FAIL : 0 
==>CANCEL : 0 
==>SKIP : 0 
==>WARN : 0 
==>RUN TIME : 60 min 23 sec 
==>TEST LOG : /home/ipa/test_logs/VIRT_49060_aarch64_blockdev-2023-02-07-02:21:32 
**********************************************************************************************

Comment 34 lijin 2023-02-16 08:46:45 UTC
*** Bug 2165623 has been marked as a duplicate of this bug. ***

Comment 36 errata-xmlrpc 2023-05-09 07:23:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: edk2 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2165


Note You need to log in before you can comment on or make changes to this bug.