Bug 2157656
Summary: | [edk2] [aarch64] Unable to initialize EFI firmware when using edk2-aarch64-20221207gitfff6d81270b5-1.el9 in some hardwares | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Yihuang Yu <yihyu> |
Component: | edk2 | Assignee: | Oliver Steffen <osteffen> |
Status: | CLOSED ERRATA | QA Contact: | Yihuang Yu <yihyu> |
Severity: | urgent | Docs Contact: | |
Priority: | high | ||
Version: | 9.2 | CC: | alougovs, berrange, cohuck, coli, eauger, efuller, gshan, jinzhao, juzhang, kraxel, lijin, osteffen, pbonzini, pbunyan, vgoyal, virt-maint, xiaohli, xuwei, zhenyzha |
Target Milestone: | rc | Keywords: | Regression, Triaged |
Target Release: | --- | ||
Hardware: | aarch64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | edk2-20221207gitfff6d81270b5-5.el9 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-05-09 07:23:58 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Yihuang Yu
2023-01-02 14:03:09 UTC
Well, now I can also reproduce this issue in another HW(ampere-mtsnow-altramax) with basic ksm and nvdimm test cases. (1/2) Host_RHEL.m9.u2.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.9.2.0.aarch64.io-github-autotest-qemu.ksm_base.base.arm64-pci: STARTED (1/2) Host_RHEL.m9.u2.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.9.2.0.aarch64.io-github-autotest-qemu.ksm_base.base.arm64-pci: ERROR: No ipv4 DHCP lease for MAC 9a:02:d1:f1:27:ef (756.68 s) (2/2) Host_RHEL.m9.u2.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.9.2.0.aarch64.io-github-autotest-qemu.nvdimm.nvdimm_basic.arm64-pci: STARTED (2/2) Host_RHEL.m9.u2.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.9.2.0.aarch64.io-github-autotest-qemu.nvdimm.nvdimm_basic.arm64-pci: ERROR: No ipv4 DHCP lease for MAC 9a:bf:40:b2:65:2a (515.71 s) serial log: 2023-01-05 05:19:03: Ncat: Connection reset by peer. 2023-01-05 05:19:03: (Process terminated with status 1) https://edk2.groups.io/g/devel/topic/96054879#97947 (patch, note reply with fix) Oliver, can you do a scratch build for QE to test with the proposed patch (including the incremental fix, see comment 4)? Hi Gerd, we have continued the discussion meanwhile in Jira: https://issues.redhat.com/browse/RHELPLAN-143428 Copying from Jira: -------------------- Oliver Steffen added a comment: Hi Yihuang Yu, Here is a scratch build with the patch: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=49896954 Repo file: http://brew-task-repos.usersys.redhat.com/repos/scratch/osteffen/edk2/20221207gitfff6d81270b5/2.el9.test/edk2-20221207gitfff6d81270b5-2.el9.test-scratch.repo Can you try this out and see if it fixes the issue? Let me know if I can do anything to help. Thanks, Oliver -------------------- Yihuang Yu added a comment Hello Oliver Steffen, I used the scratch build and test again in a ThunderX machine, but seems it cannot really fix the issue. After the qemu process was started, the serial console still had no output. In my auto test case, it will use ncat to connect to the serial, following are the whole file contents, which only contain timeout info. 2023-01-09 04:42:57: Ncat: Connection reset by peer. 2023-01-09 04:42:57: (Process terminated with status 1) Besides, in another A64FX(Fujitsu vender) machine, I can also trigger the issue when the qemu command line contains a nvdimm device, this should be a generic reproducer I think, but not sure if they are the same problem. [stdlog] -machine virt,nvdimm=on,memory-backend=mem-machine_mem,pflash0=drive_aavmf_code,pflash1=drive_aavmf_vars \ [stdlog] -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \ [stdlog] -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0 \ [stdlog] -nodefaults \ [stdlog] -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \ [stdlog] -device virtio-gpu-pci,bus=pcie-root-port-1,addr=0x0 \ [stdlog] -m 14336,maxmem=32G,slots=4 \ [stdlog] -object memory-backend-file,size=1G,mem-path=/tmp/nvdimm0,share=yes,id=mem-mem1 \ [stdlog] -device nvdimm,id=dimm-mem1,memdev=mem-mem1 \ -------------------- > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=49896954
> Repo file:
> http://brew-task-repos.usersys.redhat.com/repos/scratch/osteffen/edk2/
> 20221207gitfff6d81270b5/2.el9.test/edk2-20221207gitfff6d81270b5-2.el9.test-
> scratch.repo
Any change you forgot to flip the switch? The workaround is not enabled by
default so you have to add a line to the build config to turn it on.
Should be on, see these two commits: Ard's patch, incl. fix: https://gitlab.com/osteffen/edk2/-/commit/f7be6f884d1db372e4a79946f8e6351e0aac710a Switch it on in the build: https://gitlab.com/osteffen/edk2/-/commit/efd9b2d98e30c4bc338c53d88ba77c2f017daf37 > Switch it on in the build: > https://gitlab.com/osteffen/edk2/-/commit/ > efd9b2d98e30c4bc338c53d88ba77c2f017daf37 Looks correct. Merged meanwhile: https://github.com/tianocore/edk2/commit/ec54ce1f1ab41b92782b37ae59e752fff0ef9c41 When I test postcopy migration on edk2-aarch64-20221207gitfff6d81270b5-1.el9.noarch, hit one issue: Start postcopy migration from the source to the destination host, after postcopy migration, reboot the guest on the destination host, then the guest would hang on the reboot stage. And I also tried edk2-aarch64-20220826gitba0e0e4c6a-2.el9.noarch, didn't hit the above issue. So seems the issue I met is the same as this bug. I would track the fix of this bug, and verify postcopy migration when we can do verification. The upstream patch was modified since the scratch build. I started a new build, based on the patches that were actually merged upstream. https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=50054555 Please retry with this. Thanks. -Oliver Hum, it still does not work. The serial log is empty # rpm -qa edk2-aarch64 edk2-aarch64-20221207gitfff6d81270b5-2.el9.test.noarch # cat serial-serial0-avocado-vt-vm1.log 2023-01-14 09:46:48: Ncat: Connection reset by peer. 2023-01-14 09:46:48: (Process terminated with status 1) Oliver, I am not sure if I missed something, if you need the host for debugging, please let me know, thanks. I tried it out again on the machine and can confirm that the patch from upstream does not work on this hardware. Asking about that on the upstream list. In the meantime, I'll prepare a scratch build with the problematic commit reverted (as a fallback). Scratch build with the revert: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=50131423 Code: https://gitlab.com/osteffen/edk2/-/tree/bz-215656-revert I did a quick test of the revert-commit in the machine and it seems to work. No need to test this scratch build right away. The upstream fix works with the latest kernel: 6.1.7-200.fc37.aarch64 (https://koji.fedoraproject.org/koji/buildinfo?buildID=2112315) Apparently this commit is required to make it work: https://github.com/torvalds/linux/commit/406504c7b0405d74d74c15a667cd4c4620c3e7a9 I think we have two options now: A) More work - Backport the kernel patch - and backport EDK2 fix (see Comment 12) B) revert the breaking commit (see Comment 15) Got it. So if we want the "A" solution, we also need the kernel fix. Since we are in the 9.2.0 kvm rebase, maybe we can also backport this fix. @cohuck, what do you think? Or report a kernel bug to avoid blocking the rebase process. The "B" solution is an easy fix, I can accept it, but I still prefer the solution "A". Bug 2162404 is already requesting the kernel patch linked in comment 16. I'd prefer to fix it via that separate bz (and not hold up the rebase.) see comments in bug 2162404 comment 9 -> going ahead with the backport of the fix. For reference, this is the upstream PR: https://github.com/tianocore/edk2/pull/3878/commits scratch build: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=50281185 MR: https://gitlab.com/redhat/centos-stream/src/edk2/-/merge_requests/24 Verify this bug with edk2-aarch64-20221207gitfff6d81270b5-5.el9.noarch and kernel-5.14.0-256.el9.aarch64 Env: kernel-5.14.0-256.el9.aarch64 qemu-kvm-7.2.0-6.el9.aarch64 edk2-aarch64-20221207gitfff6d81270b5-5.el9.noarch From 89 tests executed, 88 passed and 0 warned - success rate of 98.88% (excluding SKIP and CANCEL) 1 test case failed by another known issue http://10.0.136.47/7486271/results.html Verify postcopy cases on edk2-aarch64-20221207gitfff6d81270b5-5.el9.noarch, they all pass. [root@ampere-hr330a-07 ipa]# python3 Start2Run.py --test_requirement=VIRT_49060_aarch64_blockdev --src_host_ip=10.19.241.172 --dst_host_ip=10.19.241.182 --share_images_dir=/mnt/xiaohli --sys_image_name=rhel920-aarch64-virtio-scsi.qcow2 ********************************************************************************************** RESULTS [VIRT-49060-AARCH64-BLOCKDEV]: ==>TOTAL : 11 ==>PASS : 11 1: BASE-TEST-POSTCOPY-Migration basic precopy test without setting downtime and speed (5 min 4 sec) 2: VIRT-49062-[postcopy] Migration finishes only with postcopy under high stress (rhel only) (15 min 25 sec) 3: VIRT-58670-[postcopy] Cancel migration during the precopy phase (1 min 28 sec) 4: VIRT-58672-[postcopy] Source should recovers when fail the destination during the precopy phase (1 min 28 sec) 5: VIRT-85702-[postcopy] Post-copy migration with XBZRLE compression (3 min 16 sec) 6: VIRT-86251-[postcopy] live migration post-copy support file-backed memory (3 min 44 sec) 7: VIRT-93722-[postcopy] Postcopy migration with Numa pinned and Hugepage pinned guest (3 min 20 sec) 8: VIRT-294886-[migration] Postcopy migration recover after migrate-pause (2 min 20 sec) 9: RHEL-150076-[postcopy] Set postcopy migration speed(max-postcopy-bandwidth) (4 min 36 sec) 10: RHEL-186017-[postcopy] Basic postcopy migration (3 min 12 sec) 11: RHEL-189930-[postcopy] Post-copy migration with enabling auto-converge (3 min 24 sec) ==>ERROR : 0 ==>FAIL : 0 ==>CANCEL : 0 ==>SKIP : 0 ==>WARN : 0 ==>RUN TIME : 60 min 23 sec ==>TEST LOG : /home/ipa/test_logs/VIRT_49060_aarch64_blockdev-2023-02-07-02:21:32 ********************************************************************************************** *** Bug 2165623 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: edk2 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:2165 |