Bug 1678713
| Summary: | implement missing reset handler for cfi.pflash01 - fast train | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux Advanced Virtualization | Reporter: | Yanan Fu <yfu> | ||||
| Component: | qemu-kvm | Assignee: | Philippe Mathieu-Daudé <philmd> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Xueqiang Wei <xuwei> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 8.0 | CC: | berrange, chayang, ddepaula, jinzhao, juzhang, kraxel, pbonzini, philmd, rbalakri, virt-maint, xfu, xuwei, yfu, yuhuang | ||||
| Target Milestone: | rc | Flags: | knoel:
mirror+
|
||||
| Target Release: | 8.0 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | qemu-kvm-4.1.0-3.module+el8.1.0+4019+dbfaedf9 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1707192 (view as bug list) | Environment: | |||||
| Last Closed: | 2019-11-06 07:12:59 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1707192 | ||||||
| Attachments: |
|
||||||
Created attachment 1536326 [details]
related debug log and seabios log, screendump
(Yanan, next time, please compress the tarball before attaching it. The
"xz -9e" command can save 96.89% of the required storage in this case.
We should be frugal with Bugzilla storage.)
Reviewing the file
ovmf-black-screen/1-system_reset-balck_screen/seabios-avocado-vt-vm1.log
suggests a QEMU bug, namely in the "cfi.pflash01" device model.
(Assuming that the QEMU command line *never* changed between the 21
boots that are captured in the avocado log file mentioned above. If the
QEMU command line did change, then the log is unusable.)
Consider:
$ grep -h -o 'QEMU Flash: Attempting flash detection at .*$' xx* \
| sort \
| uniq -c
3 QEMU Flash: Attempting flash detection at FFC00000
18 QEMU Flash: Attempting flash detection at FFC00010
In three boots out of 21 (namely in #9, #13 and #14), the flash
detection is attempted at a wrong address.
Here's the QemuFlashDetected() function from
"OvmfPkg/QemuFlashFvbServicesRuntimeDxe/QemuFlash.c", which first
determines the probe address, and then probes for ROM / RAM / pflash at
that address:
> STATIC
> BOOLEAN
> QemuFlashDetected (
> VOID
> )
> {
> BOOLEAN FlashDetected;
> volatile UINT8 *Ptr;
>
> UINTN Offset;
> UINT8 OriginalUint8;
> UINT8 ProbeUint8;
>
> FlashDetected = FALSE;
> Ptr = QemuFlashPtr (0, 0);
>
> for (Offset = 0; Offset < mFdBlockSize; Offset++) {
> Ptr = QemuFlashPtr (0, Offset);
> ProbeUint8 = *Ptr;
> if (ProbeUint8 != CLEAR_STATUS_CMD &&
> ProbeUint8 != READ_STATUS_CMD &&
> ProbeUint8 != CLEARED_ARRAY_STATUS) {
> break;
> }
> }
>
> if (Offset >= mFdBlockSize) {
> DEBUG ((EFI_D_INFO, "QEMU Flash: Failed to find probe location\n"));
> return FALSE;
> }
>
> DEBUG ((EFI_D_INFO, "QEMU Flash: Attempting flash detection at %p\n", Ptr));
>
> OriginalUint8 = *Ptr;
> *Ptr = CLEAR_STATUS_CMD;
> ProbeUint8 = *Ptr;
> if (OriginalUint8 != CLEAR_STATUS_CMD &&
> ProbeUint8 == CLEAR_STATUS_CMD) {
> DEBUG ((EFI_D_INFO, "QemuFlashDetected => FD behaves as RAM\n"));
> *Ptr = OriginalUint8;
> } else {
> *Ptr = READ_STATUS_CMD;
> ProbeUint8 = *Ptr;
> if (ProbeUint8 == OriginalUint8) {
> DEBUG ((EFI_D_INFO, "QemuFlashDetected => FD behaves as ROM\n"));
> } else if (ProbeUint8 == READ_STATUS_CMD) {
> DEBUG ((EFI_D_INFO, "QemuFlashDetected => FD behaves as RAM\n"));
> *Ptr = OriginalUint8;
> } else if (ProbeUint8 == CLEARED_ARRAY_STATUS) {
> DEBUG ((EFI_D_INFO, "QemuFlashDetected => FD behaves as FLASH\n"));
> FlashDetected = TRUE;
> *Ptr = READ_ARRAY_CMD;
> }
> }
>
> DEBUG ((EFI_D_INFO, "QemuFlashDetected => %a\n",
> FlashDetected ? "Yes" : "No"));
> return FlashDetected;
> }
Quite obviously, the first loop, which determines the probe address,
should *always* produce the same address. The logs prove that this is
not the case, implying that the "cfi.pflash01" reset logic in QEMU is
incomplete.
Presumably, if "system_reset" is injected via the monitor at the wrong
time (possibly while the firmware is writing, in SMM, a non-volatile
UEFI variable in pflash), the pflash chip remains in a bad state well
into the next boot.
More analysis of the boot logs that support this below. (Again, this is
assuming that the QEMU configuration / cmdline never changed in the
middle.)
* Boots #01 through #08, and #10, #11, and #12, determine the right
probe address (FFC00010), and the probe successfully determines
pflash.
> QEMU Flash: Attempting flash detection at FFC00010
> QemuFlashDetected => FD behaves as FLASH
> QemuFlashDetected => Yes
> Installing QEMU flash SMM FVB
* Boot #09 determines a wrong probe address (FFC00000), but (strangely
enough) it "succeeds" in identifying pflash:
> QEMU Flash: Attempting flash detection at FFC00000
> QemuFlashDetected => FD behaves as FLASH
> QemuFlashDetected => Yes
> Installing QEMU flash SMM FVB
* Boot #13 determines the wrong probe address (FFC00000), which -- not
surprisingly -- fails the probe itself as well:
> QEMU Flash: Attempting flash detection at FFC00000
> QemuFlashDetected => FD behaves as ROM
> QemuFlashDetected => No
> ASSERT
> /builddir/build/BUILD/edk2-ee3198e672e2/OvmfPkg/QemuFlashFvbServicesRuntimeDxe/QemuFlash.c(257):
> !_gPcd_FixedAtBuild_PcdSmmSmramRequire
* Boot #14 is similar, except OVMF's flash driver is led to believe the
board has RAM (not ROM) at FFC00000:
> QEMU Flash: Attempting flash detection at FFC00000
> QemuFlashDetected => FD behaves as RAM
> QemuFlashDetected => No
> ASSERT
> /builddir/build/BUILD/edk2-ee3198e672e2/OvmfPkg/QemuFlashFvbServicesRuntimeDxe/QemuFlash.c(257):
> !_gPcd_FixedAtBuild_PcdSmmSmramRequire
* Boot #15 appears to detect flash, and at the right address; however,
the internal checksum of the firmware volume header is incorrect:
> QEMU Flash: Attempting flash detection at FFC00010
> QemuFlashDetected => FD behaves as FLASH
> QemuFlashDetected => Yes
> FV@FFC00000 Checksum is 0xB8AF, expected 0xB85F
> Variable FV header is not valid. It will be reinitialized.
> Installing QEMU flash SMM FVB
The fact that boot #15 then goes on to crash with
> Variable Store header is corrupted
>
> ASSERT_EFI_ERROR (Status = Volume Corrupt)
> ASSERT
> /builddir/build/BUILD/edk2-ee3198e672e2/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableSmm.c(936):
> !EFI_ERROR (Status)
is basically irrelevant / devoid of information, after the checksum
error message makes it evident that pflash isn't working correctly.
* The rest of the boots (#16 through #21) *seem* to detect flash
alright:
> QEMU Flash: Attempting flash detection at FFC00010
> QemuFlashDetected => FD behaves as FLASH
> QemuFlashDetected => Yes
> Installing QEMU flash SMM FVB
and then crash "out of the blue" with:
> Variable Store header is corrupted
>
> ASSERT_EFI_ERROR (Status = Volume Corrupt)
> ASSERT
> /builddir/build/BUILD/edk2-ee3198e672e2/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableSmm.c(936):
> !EFI_ERROR (Status)
However, given the previous boot results, I simply cannot trust the
seemingly successful flash detections here. They certainly happened by
chance only.
Moving the BZ to QEMU for further investigation.
... In fact, the "cfi.pflash01" device model doesn't seem to have any
reset handler at all!
(The only asynchronous callback that seems *remotely* relevant for
resetting the pflash chip is pflash_timer(). However, the timer whose
expiry would call pflash_timer() is never actually *armed*. It is
basically dead code.)
IOW, a reset handler for pflash is needed in QEMU. It should probably
do:
trace_pflash_reset();
memory_region_rom_device_set_romd(&pfl->mem, true);
pfl->wcycle = 0;
pfl->cmd = 0;
pfl->status = 0x0;
In addition, we should likely kill pflash_timer(), which is dead code.
*** Bug 1704584 has been marked as a duplicate of this bug. *** Series posted upstream: https://lists.gnu.org/archive/html/qemu-devel/2019-05/msg00962.html Hit the same issue with edk2-ovmf-20190308git89910a39dcfd-2.el8.noarch and qemu-kvm-4.0.0-2.module+el8.1.0+3258+4c45705b.x86_64. But ovmf log is different. This is part of log. .... 2019-05-26 08:25:23: Loading SMM driver at 0x0007FEC4000 EntryPoint=0x0007FED3361 VariableSmm.efi 2019-05-26 08:25:23: mSmmMemLibInternalMaximumSupportAddress = 0xFFFFFFFFF 2019-05-26 08:25:23: VarCheckLibRegisterSetVariableCheckHandler - 0x7FED16FC Success 2019-05-26 08:25:23: ASSERT /builddir/build/BUILD/edk2-89910a39dcfd/MdeModulePkg/Universal/Variable/RuntimeDxe/Variable.c(3870): VariableStore->Size == VariableStoreLength 2019-05-26 08:38:33: (Process terminated with status 0) (In reply to FuXiangChun from comment #8) > Hit the same issue with edk2-ovmf-20190308git89910a39dcfd-2.el8.noarch and > qemu-kvm-4.0.0-2.module+el8.1.0+3258+4c45705b.x86_64. But ovmf log is > different. This is part of log. > > .... > 2019-05-26 08:25:23: Loading SMM driver at 0x0007FEC4000 > EntryPoint=0x0007FED3361 VariableSmm.efi > 2019-05-26 08:25:23: mSmmMemLibInternalMaximumSupportAddress = 0xFFFFFFFFF > 2019-05-26 08:25:23: VarCheckLibRegisterSetVariableCheckHandler - 0x7FED16FC > Success > 2019-05-26 08:25:23: ASSERT > /builddir/build/BUILD/edk2-89910a39dcfd/MdeModulePkg/Universal/Variable/ > RuntimeDxe/Variable.c(3870): VariableStore->Size == VariableStoreLength > 2019-05-26 08:38:33: (Process terminated with status 0) Yes, this seems the same issue. How many times did you reset your VM? Do you have a reproducer handy? We should be able to have a 100% stable reproducer by resetting the machine in specific time frames (in the middle of a OVMF write access to the pflash device). Keeping record of the versions submitted to qemu-devel: v1: https://lists.gnu.org/archive/html/qemu-devel/2019-05/msg00962.html v2: https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg00395.html v3: https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg01668.html v4: https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg02785.html v5: https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg03366.html v6: https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg03929.html v7: https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg04294.html @QA: QA_ACK, please? Fixed upstream in qemu-4.1 Hit this issue by automation case "system_reset_during_boot" in avocado, so retested with avocado. Versions: Host: kernel-4.18.0-131.el8.x86_64 qemu-kvm-4.1.0-3.module+el8.1.0+4019+dbfaedf9 edk2-ovmf-20190308git89910a39dcfd-6.el8.noarch Guest: rhel8.0.0 windows10 Both guest rhel8.0.0 and guest windows10 are all work well after 40 times system_reset. So set status to VERIFIED. cmd lines: # python3 ConfigTest.py --testcase=system_reset_during_boot --imageformat=qcow2 --guestname=RHEL.8.0.0,Win10 --driveformat=virtio_scsi --nicmodel=virtio_net --machines=q35 --clone=yes --firmware=ovmf --platform=x86_64 --nrepeat=2 (1/7) Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.8.0.0.x86_64.io-github-autotest-qemu.unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads.q35: PASS (1242.26 s) (2/7) Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.8.0.0.x86_64.io-github-autotest-qemu.rh_kernel_update.q35: PASS (190.16 s) (3/7) Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads.q35: PASS (2134.97 s) (4/7) repeat1.Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.8.0.0.x86_64.io-github-autotest-qemu.system_reset_during_boot.q35: PASS (684.81 s) (5/7) repeat1.Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.system_reset_during_boot.q35: PASS (1373.77 s) (6/7) repeat2.Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.8.0.0.x86_64.io-github-autotest-qemu.system_reset_during_boot.q35: PASS (773.31 s) (7/7) repeat2.Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.system_reset_during_boot.q35: PASS (1063.44 s) boot guest after system_reset: (1/2) Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.8.0.0.x86_64.io-github-autotest-qemu.boot.q35: PASS (68.22 s) (2/2) Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.boot.q35: PASS (155.04 s) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3723 |
Description of problem: Boot a VM (RHEL7.6, Win10) with ovmf, keep doing "system_reset" during guest reboot every few seconds. After several times, guest will get black screen sometimes. If shut down and then boot the vm again, screen show: "Guest has not initialized the display(yet)" Version-Release number of selected component (if applicable): kernel: kernel-4.18.0-67.el8.x86_64 qemu: qemu-kvm-3.1.0-15.module+el8+2792+e33e01a0.x86_64 edk2-ovmf: edk2-ovmf-20180508gitee3198e672e2-9.el8.noarch How reproducible: 1/50 Steps to Reproduce: 1. Boot a vm with ovmf 2. Keep doing "system_reset" during guest reboot every few seconds 3. Hit black screen 4. shutdown the vm then boot it again, hit "Guest has not initialized the display(yet)" Actual results: guest can not boot up normally Expected results: guest should boot up normally and work well. Additional info: 1. Both linux(RHEL7.6) and windows(Win10) guest can hit this problem 2. qemu command line: MALLOC_PERTURB_=1 /usr/libexec/qemu-kvm \ -S \ -name 'avocado-vt-vm1' \ -machine q35 \ -nodefaults \ -device qxl-vga,bus=pcie.0,addr=0x1 \ -device pcie-root-port,id=pcie_root_port_0,slot=2,chassis=2,addr=0x2,bus=pcie.0 \ -device pcie-root-port,id=pcie_root_port_1,slot=3,chassis=3,addr=0x3,bus=pcie.0 \ -device pcie-root-port,id=pcie_root_port_2,slot=4,chassis=4,addr=0x4,bus=pcie.0 \ -device intel-hda,bus=pcie.0,addr=0x5 \ -device hda-duplex \ -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_e7tf57io/monitor-qmpmonitor1-20190214-153256-aQHs84JQ,server,nowait \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_e7tf57io/monitor-catch_monitor-20190214-153256-aQHs84JQ,server,nowait \ -mon chardev=qmp_id_catch_monitor,mode=control \ -device pvpanic,ioport=0x505,id=idZS2mRy \ -chardev socket,id=serial_id_serial0,path=/var/tmp/avocado_e7tf57io/serial-serial0-20190214-153256-aQHs84JQ,server,nowait \ -device isa-serial,chardev=serial_id_serial0 \ -device pcie-root-port,id=pcie.0-root-port-6,slot=6,chassis=6,addr=0x6,bus=pcie.0 \ -device virtio-serial-pci,id=virtio_serial_pci0,bus=pcie.0-root-port-6,addr=0x0 \ -chardev socket,id=idSts8g6,nowait,path=/var/tmp/avocado_e7tf57io/virtio_port-vs-20190214-153256-aQHs84JQ,server \ -device virtserialport,id=idTZJ3HL,name=vs,bus=virtio_serial_pci0.0,chardev=idSts8g6 \ -object rng-random,filename=/dev/random,id=passthrough-A4WgoRm1 \ -device pcie-root-port,id=pcie.0-root-port-7,slot=7,chassis=7,addr=0x7,bus=pcie.0 \ -device virtio-rng-pci,id=virtio-rng-pci-kmqRRWUS,rng=passthrough-A4WgoRm1,bus=pcie.0-root-port-7,addr=0x0 \ -chardev socket,id=seabioslog_id_20190214-153256-aQHs84JQ,path=/var/tmp/avocado_e7tf57io/seabios-20190214-153256-aQHs84JQ,server,nowait \ -device isa-debugcon,chardev=seabioslog_id_20190214-153256-aQHs84JQ,iobase=0x402 \ -device ich9-usb-ehci1,id=usb1,addr=0x1d.7,multifunction=on,bus=pcie.0 \ -device ich9-usb-uhci1,id=usb1.0,multifunction=on,masterbus=usb1.0,addr=0x1d.0,firstport=0,bus=pcie.0 \ -device ich9-usb-uhci2,id=usb1.1,multifunction=on,masterbus=usb1.0,addr=0x1d.2,firstport=2,bus=pcie.0 \ -device ich9-usb-uhci3,id=usb1.2,multifunction=on,masterbus=usb1.0,addr=0x1d.4,firstport=4,bus=pcie.0 \ -device pcie-root-port,id=pcie.0-root-port-8,slot=8,chassis=8,addr=0x8,bus=pcie.0 \ -device qemu-xhci,id=usb2,bus=pcie.0-root-port-8,addr=0x0 \ -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/kvm_autotest_root/images/rhel76-64-virtio.qcow2 \ -device pcie-root-port,id=pcie.0-root-port-9,slot=9,chassis=9,addr=0x9,bus=pcie.0 \ -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pcie.0-root-port-9,addr=0x0 \ -device pcie-root-port,id=pcie.0-root-port-10,slot=10,chassis=10,addr=0xa,bus=pcie.0 \ -device virtio-net-pci,mac=9a:a5:a6:a7:a8:a9,id=id8x4rHt,vectors=4,netdev=idGronRn,bus=pcie.0-root-port-10,addr=0x0 \ -netdev tap,id=idGronRn,vhost=on,vhostfd=22,fd=9 \ -m 14336 \ -smp 10,maxcpus=10,cores=5,threads=1,sockets=2 \ -cpu 'Broadwell',+kvm_pv_unhalt \ -device usb-tablet,id=usb-tablet1,bus=usb2.0,port=1 \ -spice port=3000,password=123456,addr=0,tls-port=3200,x509-dir=/tmp/spice_x509d,tls-channel=main,tls-channel=inputs,image-compression=auto_glz,zlib-glz-wan-compression=auto,streaming-video=all,agent-mouse=on,playback-compression=on,ipv4 \ -rtc base=utc,clock=host,driftfix=slew \ -boot order=cdn,once=c,menu=off,strict=off \ -no-hpet \ -drive if=pflash,format=raw,readonly=on,file=/usr/share/OVMF/OVMF_CODE.secboot.fd \ -drive if=pflash,format=raw,file=/home/kvm_autotest_root/images/rhel76-64-virtio.qcow2.fd \ -enable-kvm \ -watchdog i6300esb \ -watchdog-action reset \ -device pcie-root-port,id=pcie.0-root-port-11,slot=11,chassis=11,addr=0xb,bus=pcie.0 \ 3. When hit this problem, bios log show: 2019-02-14 15:38:15: ASSERT_EFI_ERROR (Status = Volume Corrupt) 2019-02-14 15:38:15: ASSERT /builddir/build/BUILD/edk2-ee3198e672e2/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableSmm.c(936): !EFI_ERROR (Status) 4. Full BIOS log will be added to the attachment, please refer.