Description of problem: Boot a VM (RHEL7.6, Win10) with ovmf, keep doing "system_reset" during guest reboot every few seconds. After several times, guest will get black screen sometimes. If shut down and then boot the vm again, screen show: "Guest has not initialized the display(yet)" Version-Release number of selected component (if applicable): kernel: kernel-4.18.0-67.el8.x86_64 qemu: qemu-kvm-3.1.0-15.module+el8+2792+e33e01a0.x86_64 edk2-ovmf: edk2-ovmf-20180508gitee3198e672e2-9.el8.noarch How reproducible: 1/50 Steps to Reproduce: 1. Boot a vm with ovmf 2. Keep doing "system_reset" during guest reboot every few seconds 3. Hit black screen 4. shutdown the vm then boot it again, hit "Guest has not initialized the display(yet)" Actual results: guest can not boot up normally Expected results: guest should boot up normally and work well. Additional info: 1. Both linux(RHEL7.6) and windows(Win10) guest can hit this problem 2. qemu command line: MALLOC_PERTURB_=1 /usr/libexec/qemu-kvm \ -S \ -name 'avocado-vt-vm1' \ -machine q35 \ -nodefaults \ -device qxl-vga,bus=pcie.0,addr=0x1 \ -device pcie-root-port,id=pcie_root_port_0,slot=2,chassis=2,addr=0x2,bus=pcie.0 \ -device pcie-root-port,id=pcie_root_port_1,slot=3,chassis=3,addr=0x3,bus=pcie.0 \ -device pcie-root-port,id=pcie_root_port_2,slot=4,chassis=4,addr=0x4,bus=pcie.0 \ -device intel-hda,bus=pcie.0,addr=0x5 \ -device hda-duplex \ -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_e7tf57io/monitor-qmpmonitor1-20190214-153256-aQHs84JQ,server,nowait \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_e7tf57io/monitor-catch_monitor-20190214-153256-aQHs84JQ,server,nowait \ -mon chardev=qmp_id_catch_monitor,mode=control \ -device pvpanic,ioport=0x505,id=idZS2mRy \ -chardev socket,id=serial_id_serial0,path=/var/tmp/avocado_e7tf57io/serial-serial0-20190214-153256-aQHs84JQ,server,nowait \ -device isa-serial,chardev=serial_id_serial0 \ -device pcie-root-port,id=pcie.0-root-port-6,slot=6,chassis=6,addr=0x6,bus=pcie.0 \ -device virtio-serial-pci,id=virtio_serial_pci0,bus=pcie.0-root-port-6,addr=0x0 \ -chardev socket,id=idSts8g6,nowait,path=/var/tmp/avocado_e7tf57io/virtio_port-vs-20190214-153256-aQHs84JQ,server \ -device virtserialport,id=idTZJ3HL,name=vs,bus=virtio_serial_pci0.0,chardev=idSts8g6 \ -object rng-random,filename=/dev/random,id=passthrough-A4WgoRm1 \ -device pcie-root-port,id=pcie.0-root-port-7,slot=7,chassis=7,addr=0x7,bus=pcie.0 \ -device virtio-rng-pci,id=virtio-rng-pci-kmqRRWUS,rng=passthrough-A4WgoRm1,bus=pcie.0-root-port-7,addr=0x0 \ -chardev socket,id=seabioslog_id_20190214-153256-aQHs84JQ,path=/var/tmp/avocado_e7tf57io/seabios-20190214-153256-aQHs84JQ,server,nowait \ -device isa-debugcon,chardev=seabioslog_id_20190214-153256-aQHs84JQ,iobase=0x402 \ -device ich9-usb-ehci1,id=usb1,addr=0x1d.7,multifunction=on,bus=pcie.0 \ -device ich9-usb-uhci1,id=usb1.0,multifunction=on,masterbus=usb1.0,addr=0x1d.0,firstport=0,bus=pcie.0 \ -device ich9-usb-uhci2,id=usb1.1,multifunction=on,masterbus=usb1.0,addr=0x1d.2,firstport=2,bus=pcie.0 \ -device ich9-usb-uhci3,id=usb1.2,multifunction=on,masterbus=usb1.0,addr=0x1d.4,firstport=4,bus=pcie.0 \ -device pcie-root-port,id=pcie.0-root-port-8,slot=8,chassis=8,addr=0x8,bus=pcie.0 \ -device qemu-xhci,id=usb2,bus=pcie.0-root-port-8,addr=0x0 \ -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/kvm_autotest_root/images/rhel76-64-virtio.qcow2 \ -device pcie-root-port,id=pcie.0-root-port-9,slot=9,chassis=9,addr=0x9,bus=pcie.0 \ -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pcie.0-root-port-9,addr=0x0 \ -device pcie-root-port,id=pcie.0-root-port-10,slot=10,chassis=10,addr=0xa,bus=pcie.0 \ -device virtio-net-pci,mac=9a:a5:a6:a7:a8:a9,id=id8x4rHt,vectors=4,netdev=idGronRn,bus=pcie.0-root-port-10,addr=0x0 \ -netdev tap,id=idGronRn,vhost=on,vhostfd=22,fd=9 \ -m 14336 \ -smp 10,maxcpus=10,cores=5,threads=1,sockets=2 \ -cpu 'Broadwell',+kvm_pv_unhalt \ -device usb-tablet,id=usb-tablet1,bus=usb2.0,port=1 \ -spice port=3000,password=123456,addr=0,tls-port=3200,x509-dir=/tmp/spice_x509d,tls-channel=main,tls-channel=inputs,image-compression=auto_glz,zlib-glz-wan-compression=auto,streaming-video=all,agent-mouse=on,playback-compression=on,ipv4 \ -rtc base=utc,clock=host,driftfix=slew \ -boot order=cdn,once=c,menu=off,strict=off \ -no-hpet \ -drive if=pflash,format=raw,readonly=on,file=/usr/share/OVMF/OVMF_CODE.secboot.fd \ -drive if=pflash,format=raw,file=/home/kvm_autotest_root/images/rhel76-64-virtio.qcow2.fd \ -enable-kvm \ -watchdog i6300esb \ -watchdog-action reset \ -device pcie-root-port,id=pcie.0-root-port-11,slot=11,chassis=11,addr=0xb,bus=pcie.0 \ 3. When hit this problem, bios log show: 2019-02-14 15:38:15: ASSERT_EFI_ERROR (Status = Volume Corrupt) 2019-02-14 15:38:15: ASSERT /builddir/build/BUILD/edk2-ee3198e672e2/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableSmm.c(936): !EFI_ERROR (Status) 4. Full BIOS log will be added to the attachment, please refer.
Created attachment 1536326 [details] related debug log and seabios log, screendump
(Yanan, next time, please compress the tarball before attaching it. The "xz -9e" command can save 96.89% of the required storage in this case. We should be frugal with Bugzilla storage.) Reviewing the file ovmf-black-screen/1-system_reset-balck_screen/seabios-avocado-vt-vm1.log suggests a QEMU bug, namely in the "cfi.pflash01" device model. (Assuming that the QEMU command line *never* changed between the 21 boots that are captured in the avocado log file mentioned above. If the QEMU command line did change, then the log is unusable.) Consider: $ grep -h -o 'QEMU Flash: Attempting flash detection at .*$' xx* \ | sort \ | uniq -c 3 QEMU Flash: Attempting flash detection at FFC00000 18 QEMU Flash: Attempting flash detection at FFC00010 In three boots out of 21 (namely in #9, #13 and #14), the flash detection is attempted at a wrong address. Here's the QemuFlashDetected() function from "OvmfPkg/QemuFlashFvbServicesRuntimeDxe/QemuFlash.c", which first determines the probe address, and then probes for ROM / RAM / pflash at that address: > STATIC > BOOLEAN > QemuFlashDetected ( > VOID > ) > { > BOOLEAN FlashDetected; > volatile UINT8 *Ptr; > > UINTN Offset; > UINT8 OriginalUint8; > UINT8 ProbeUint8; > > FlashDetected = FALSE; > Ptr = QemuFlashPtr (0, 0); > > for (Offset = 0; Offset < mFdBlockSize; Offset++) { > Ptr = QemuFlashPtr (0, Offset); > ProbeUint8 = *Ptr; > if (ProbeUint8 != CLEAR_STATUS_CMD && > ProbeUint8 != READ_STATUS_CMD && > ProbeUint8 != CLEARED_ARRAY_STATUS) { > break; > } > } > > if (Offset >= mFdBlockSize) { > DEBUG ((EFI_D_INFO, "QEMU Flash: Failed to find probe location\n")); > return FALSE; > } > > DEBUG ((EFI_D_INFO, "QEMU Flash: Attempting flash detection at %p\n", Ptr)); > > OriginalUint8 = *Ptr; > *Ptr = CLEAR_STATUS_CMD; > ProbeUint8 = *Ptr; > if (OriginalUint8 != CLEAR_STATUS_CMD && > ProbeUint8 == CLEAR_STATUS_CMD) { > DEBUG ((EFI_D_INFO, "QemuFlashDetected => FD behaves as RAM\n")); > *Ptr = OriginalUint8; > } else { > *Ptr = READ_STATUS_CMD; > ProbeUint8 = *Ptr; > if (ProbeUint8 == OriginalUint8) { > DEBUG ((EFI_D_INFO, "QemuFlashDetected => FD behaves as ROM\n")); > } else if (ProbeUint8 == READ_STATUS_CMD) { > DEBUG ((EFI_D_INFO, "QemuFlashDetected => FD behaves as RAM\n")); > *Ptr = OriginalUint8; > } else if (ProbeUint8 == CLEARED_ARRAY_STATUS) { > DEBUG ((EFI_D_INFO, "QemuFlashDetected => FD behaves as FLASH\n")); > FlashDetected = TRUE; > *Ptr = READ_ARRAY_CMD; > } > } > > DEBUG ((EFI_D_INFO, "QemuFlashDetected => %a\n", > FlashDetected ? "Yes" : "No")); > return FlashDetected; > } Quite obviously, the first loop, which determines the probe address, should *always* produce the same address. The logs prove that this is not the case, implying that the "cfi.pflash01" reset logic in QEMU is incomplete. Presumably, if "system_reset" is injected via the monitor at the wrong time (possibly while the firmware is writing, in SMM, a non-volatile UEFI variable in pflash), the pflash chip remains in a bad state well into the next boot. More analysis of the boot logs that support this below. (Again, this is assuming that the QEMU configuration / cmdline never changed in the middle.) * Boots #01 through #08, and #10, #11, and #12, determine the right probe address (FFC00010), and the probe successfully determines pflash. > QEMU Flash: Attempting flash detection at FFC00010 > QemuFlashDetected => FD behaves as FLASH > QemuFlashDetected => Yes > Installing QEMU flash SMM FVB * Boot #09 determines a wrong probe address (FFC00000), but (strangely enough) it "succeeds" in identifying pflash: > QEMU Flash: Attempting flash detection at FFC00000 > QemuFlashDetected => FD behaves as FLASH > QemuFlashDetected => Yes > Installing QEMU flash SMM FVB * Boot #13 determines the wrong probe address (FFC00000), which -- not surprisingly -- fails the probe itself as well: > QEMU Flash: Attempting flash detection at FFC00000 > QemuFlashDetected => FD behaves as ROM > QemuFlashDetected => No > ASSERT > /builddir/build/BUILD/edk2-ee3198e672e2/OvmfPkg/QemuFlashFvbServicesRuntimeDxe/QemuFlash.c(257): > !_gPcd_FixedAtBuild_PcdSmmSmramRequire * Boot #14 is similar, except OVMF's flash driver is led to believe the board has RAM (not ROM) at FFC00000: > QEMU Flash: Attempting flash detection at FFC00000 > QemuFlashDetected => FD behaves as RAM > QemuFlashDetected => No > ASSERT > /builddir/build/BUILD/edk2-ee3198e672e2/OvmfPkg/QemuFlashFvbServicesRuntimeDxe/QemuFlash.c(257): > !_gPcd_FixedAtBuild_PcdSmmSmramRequire * Boot #15 appears to detect flash, and at the right address; however, the internal checksum of the firmware volume header is incorrect: > QEMU Flash: Attempting flash detection at FFC00010 > QemuFlashDetected => FD behaves as FLASH > QemuFlashDetected => Yes > FV@FFC00000 Checksum is 0xB8AF, expected 0xB85F > Variable FV header is not valid. It will be reinitialized. > Installing QEMU flash SMM FVB The fact that boot #15 then goes on to crash with > Variable Store header is corrupted > > ASSERT_EFI_ERROR (Status = Volume Corrupt) > ASSERT > /builddir/build/BUILD/edk2-ee3198e672e2/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableSmm.c(936): > !EFI_ERROR (Status) is basically irrelevant / devoid of information, after the checksum error message makes it evident that pflash isn't working correctly. * The rest of the boots (#16 through #21) *seem* to detect flash alright: > QEMU Flash: Attempting flash detection at FFC00010 > QemuFlashDetected => FD behaves as FLASH > QemuFlashDetected => Yes > Installing QEMU flash SMM FVB and then crash "out of the blue" with: > Variable Store header is corrupted > > ASSERT_EFI_ERROR (Status = Volume Corrupt) > ASSERT > /builddir/build/BUILD/edk2-ee3198e672e2/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableSmm.c(936): > !EFI_ERROR (Status) However, given the previous boot results, I simply cannot trust the seemingly successful flash detections here. They certainly happened by chance only. Moving the BZ to QEMU for further investigation. ... In fact, the "cfi.pflash01" device model doesn't seem to have any reset handler at all! (The only asynchronous callback that seems *remotely* relevant for resetting the pflash chip is pflash_timer(). However, the timer whose expiry would call pflash_timer() is never actually *armed*. It is basically dead code.) IOW, a reset handler for pflash is needed in QEMU. It should probably do: trace_pflash_reset(); memory_region_rom_device_set_romd(&pfl->mem, true); pfl->wcycle = 0; pfl->cmd = 0; pfl->status = 0x0; In addition, we should likely kill pflash_timer(), which is dead code.
https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg05792.html
*** Bug 1704584 has been marked as a duplicate of this bug. ***
Series posted upstream: https://lists.gnu.org/archive/html/qemu-devel/2019-05/msg00962.html
Hit the same issue with edk2-ovmf-20190308git89910a39dcfd-2.el8.noarch and qemu-kvm-4.0.0-2.module+el8.1.0+3258+4c45705b.x86_64. But ovmf log is different. This is part of log. .... 2019-05-26 08:25:23: Loading SMM driver at 0x0007FEC4000 EntryPoint=0x0007FED3361 VariableSmm.efi 2019-05-26 08:25:23: mSmmMemLibInternalMaximumSupportAddress = 0xFFFFFFFFF 2019-05-26 08:25:23: VarCheckLibRegisterSetVariableCheckHandler - 0x7FED16FC Success 2019-05-26 08:25:23: ASSERT /builddir/build/BUILD/edk2-89910a39dcfd/MdeModulePkg/Universal/Variable/RuntimeDxe/Variable.c(3870): VariableStore->Size == VariableStoreLength 2019-05-26 08:38:33: (Process terminated with status 0)
(In reply to FuXiangChun from comment #8) > Hit the same issue with edk2-ovmf-20190308git89910a39dcfd-2.el8.noarch and > qemu-kvm-4.0.0-2.module+el8.1.0+3258+4c45705b.x86_64. But ovmf log is > different. This is part of log. > > .... > 2019-05-26 08:25:23: Loading SMM driver at 0x0007FEC4000 > EntryPoint=0x0007FED3361 VariableSmm.efi > 2019-05-26 08:25:23: mSmmMemLibInternalMaximumSupportAddress = 0xFFFFFFFFF > 2019-05-26 08:25:23: VarCheckLibRegisterSetVariableCheckHandler - 0x7FED16FC > Success > 2019-05-26 08:25:23: ASSERT > /builddir/build/BUILD/edk2-89910a39dcfd/MdeModulePkg/Universal/Variable/ > RuntimeDxe/Variable.c(3870): VariableStore->Size == VariableStoreLength > 2019-05-26 08:38:33: (Process terminated with status 0) Yes, this seems the same issue. How many times did you reset your VM? Do you have a reproducer handy? We should be able to have a 100% stable reproducer by resetting the machine in specific time frames (in the middle of a OVMF write access to the pflash device).
Keeping record of the versions submitted to qemu-devel: v1: https://lists.gnu.org/archive/html/qemu-devel/2019-05/msg00962.html v2: https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg00395.html v3: https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg01668.html v4: https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg02785.html v5: https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg03366.html v6: https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg03929.html v7: https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg04294.html
@QA: QA_ACK, please?
Fixed upstream in qemu-4.1
Hit this issue by automation case "system_reset_during_boot" in avocado, so retested with avocado. Versions: Host: kernel-4.18.0-131.el8.x86_64 qemu-kvm-4.1.0-3.module+el8.1.0+4019+dbfaedf9 edk2-ovmf-20190308git89910a39dcfd-6.el8.noarch Guest: rhel8.0.0 windows10 Both guest rhel8.0.0 and guest windows10 are all work well after 40 times system_reset. So set status to VERIFIED. cmd lines: # python3 ConfigTest.py --testcase=system_reset_during_boot --imageformat=qcow2 --guestname=RHEL.8.0.0,Win10 --driveformat=virtio_scsi --nicmodel=virtio_net --machines=q35 --clone=yes --firmware=ovmf --platform=x86_64 --nrepeat=2 (1/7) Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.8.0.0.x86_64.io-github-autotest-qemu.unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads.q35: PASS (1242.26 s) (2/7) Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.8.0.0.x86_64.io-github-autotest-qemu.rh_kernel_update.q35: PASS (190.16 s) (3/7) Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads.q35: PASS (2134.97 s) (4/7) repeat1.Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.8.0.0.x86_64.io-github-autotest-qemu.system_reset_during_boot.q35: PASS (684.81 s) (5/7) repeat1.Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.system_reset_during_boot.q35: PASS (1373.77 s) (6/7) repeat2.Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.8.0.0.x86_64.io-github-autotest-qemu.system_reset_during_boot.q35: PASS (773.31 s) (7/7) repeat2.Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.system_reset_during_boot.q35: PASS (1063.44 s) boot guest after system_reset: (1/2) Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.8.0.0.x86_64.io-github-autotest-qemu.boot.q35: PASS (68.22 s) (2/2) Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.boot.q35: PASS (155.04 s)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3723