Bug 1678713 - implement missing reset handler for cfi.pflash01 - fast train
Summary: implement missing reset handler for cfi.pflash01 - fast train
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.0
Hardware: x86_64
OS: Unspecified
medium
medium
Target Milestone: rc
: 8.0
Assignee: Philippe Mathieu-Daudé
QA Contact: Xueqiang Wei
URL:
Whiteboard:
: 1704584 (view as bug list)
Depends On:
Blocks: 1707192
TreeView+ depends on / blocked
 
Reported: 2019-02-19 13:03 UTC by Yanan Fu
Modified: 2019-11-06 07:13 UTC (History)
14 users (show)

Fixed In Version: qemu-kvm-4.1.0-3.module+el8.1.0+4019+dbfaedf9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1707192 (view as bug list)
Environment:
Last Closed: 2019-11-06 07:12:59 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
related debug log and seabios log, screendump (16.82 MB, application/x-tar)
2019-02-19 13:12 UTC, Yanan Fu
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:3723 0 None None None 2019-11-06 07:13:39 UTC

Description Yanan Fu 2019-02-19 13:03:32 UTC
Description of problem:
Boot a VM (RHEL7.6, Win10) with ovmf, keep doing "system_reset" during guest reboot every few seconds. After several times, guest will get black screen sometimes.
If shut down and then boot the vm again, screen show: "Guest has not initialized the display(yet)"


Version-Release number of selected component (if applicable):
kernel: kernel-4.18.0-67.el8.x86_64
qemu: qemu-kvm-3.1.0-15.module+el8+2792+e33e01a0.x86_64
edk2-ovmf: edk2-ovmf-20180508gitee3198e672e2-9.el8.noarch


How reproducible:
1/50

Steps to Reproduce:
1. Boot a vm with ovmf
2. Keep doing "system_reset" during guest reboot every few seconds
3. Hit black screen
4. shutdown the vm then boot it again, hit "Guest has not initialized the display(yet)"


Actual results:
guest can not boot up normally

Expected results:
guest should boot up normally and work well.

Additional info:
1. Both linux(RHEL7.6) and windows(Win10) guest can hit this problem
2. qemu command line:
MALLOC_PERTURB_=1  /usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1' \
    -machine q35  \
    -nodefaults \
    -device qxl-vga,bus=pcie.0,addr=0x1 \
    -device pcie-root-port,id=pcie_root_port_0,slot=2,chassis=2,addr=0x2,bus=pcie.0 \
    -device pcie-root-port,id=pcie_root_port_1,slot=3,chassis=3,addr=0x3,bus=pcie.0 \
    -device pcie-root-port,id=pcie_root_port_2,slot=4,chassis=4,addr=0x4,bus=pcie.0 \
    -device intel-hda,bus=pcie.0,addr=0x5 \
    -device hda-duplex  \
    -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_e7tf57io/monitor-qmpmonitor1-20190214-153256-aQHs84JQ,server,nowait \
    -mon chardev=qmp_id_qmpmonitor1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_e7tf57io/monitor-catch_monitor-20190214-153256-aQHs84JQ,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idZS2mRy  \
    -chardev socket,id=serial_id_serial0,path=/var/tmp/avocado_e7tf57io/serial-serial0-20190214-153256-aQHs84JQ,server,nowait \
    -device isa-serial,chardev=serial_id_serial0 \
    -device pcie-root-port,id=pcie.0-root-port-6,slot=6,chassis=6,addr=0x6,bus=pcie.0 \
    -device virtio-serial-pci,id=virtio_serial_pci0,bus=pcie.0-root-port-6,addr=0x0 \
    -chardev socket,id=idSts8g6,nowait,path=/var/tmp/avocado_e7tf57io/virtio_port-vs-20190214-153256-aQHs84JQ,server \
    -device virtserialport,id=idTZJ3HL,name=vs,bus=virtio_serial_pci0.0,chardev=idSts8g6 \
    -object rng-random,filename=/dev/random,id=passthrough-A4WgoRm1 \
    -device pcie-root-port,id=pcie.0-root-port-7,slot=7,chassis=7,addr=0x7,bus=pcie.0 \
    -device virtio-rng-pci,id=virtio-rng-pci-kmqRRWUS,rng=passthrough-A4WgoRm1,bus=pcie.0-root-port-7,addr=0x0  \
    -chardev socket,id=seabioslog_id_20190214-153256-aQHs84JQ,path=/var/tmp/avocado_e7tf57io/seabios-20190214-153256-aQHs84JQ,server,nowait \
    -device isa-debugcon,chardev=seabioslog_id_20190214-153256-aQHs84JQ,iobase=0x402 \
    -device ich9-usb-ehci1,id=usb1,addr=0x1d.7,multifunction=on,bus=pcie.0 \
    -device ich9-usb-uhci1,id=usb1.0,multifunction=on,masterbus=usb1.0,addr=0x1d.0,firstport=0,bus=pcie.0 \
    -device ich9-usb-uhci2,id=usb1.1,multifunction=on,masterbus=usb1.0,addr=0x1d.2,firstport=2,bus=pcie.0 \
    -device ich9-usb-uhci3,id=usb1.2,multifunction=on,masterbus=usb1.0,addr=0x1d.4,firstport=4,bus=pcie.0 \
    -device pcie-root-port,id=pcie.0-root-port-8,slot=8,chassis=8,addr=0x8,bus=pcie.0 \
    -device qemu-xhci,id=usb2,bus=pcie.0-root-port-8,addr=0x0 \
    -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/kvm_autotest_root/images/rhel76-64-virtio.qcow2 \
    -device pcie-root-port,id=pcie.0-root-port-9,slot=9,chassis=9,addr=0x9,bus=pcie.0 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pcie.0-root-port-9,addr=0x0 \
    -device pcie-root-port,id=pcie.0-root-port-10,slot=10,chassis=10,addr=0xa,bus=pcie.0 \
    -device virtio-net-pci,mac=9a:a5:a6:a7:a8:a9,id=id8x4rHt,vectors=4,netdev=idGronRn,bus=pcie.0-root-port-10,addr=0x0  \
    -netdev tap,id=idGronRn,vhost=on,vhostfd=22,fd=9 \
    -m 14336  \
    -smp 10,maxcpus=10,cores=5,threads=1,sockets=2  \
    -cpu 'Broadwell',+kvm_pv_unhalt \
    -device usb-tablet,id=usb-tablet1,bus=usb2.0,port=1  \
    -spice port=3000,password=123456,addr=0,tls-port=3200,x509-dir=/tmp/spice_x509d,tls-channel=main,tls-channel=inputs,image-compression=auto_glz,zlib-glz-wan-compression=auto,streaming-video=all,agent-mouse=on,playback-compression=on,ipv4  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot order=cdn,once=c,menu=off,strict=off  \
    -no-hpet \
    -drive if=pflash,format=raw,readonly=on,file=/usr/share/OVMF/OVMF_CODE.secboot.fd \
    -drive if=pflash,format=raw,file=/home/kvm_autotest_root/images/rhel76-64-virtio.qcow2.fd \
    -enable-kvm  \
    -watchdog i6300esb \
    -watchdog-action reset \
    -device pcie-root-port,id=pcie.0-root-port-11,slot=11,chassis=11,addr=0xb,bus=pcie.0 \


3. When hit this problem, bios log show:
2019-02-14 15:38:15: ASSERT_EFI_ERROR (Status = Volume Corrupt)
2019-02-14 15:38:15: ASSERT /builddir/build/BUILD/edk2-ee3198e672e2/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableSmm.c(936): !EFI_ERROR (Status)

4. Full BIOS log will be added to the attachment, please refer.

Comment 1 Yanan Fu 2019-02-19 13:12:07 UTC
Created attachment 1536326 [details]
related debug log and seabios log, screendump

Comment 2 Laszlo Ersek 2019-02-19 17:02:06 UTC
(Yanan, next time, please compress the tarball before attaching it. The
"xz -9e" command can save 96.89% of the required storage in this case.
We should be frugal with Bugzilla storage.)

Reviewing the file

  ovmf-black-screen/1-system_reset-balck_screen/seabios-avocado-vt-vm1.log

suggests a QEMU bug, namely in the "cfi.pflash01" device model.

(Assuming that the QEMU command line *never* changed between the 21
boots that are captured in the avocado log file mentioned above. If the
QEMU command line did change, then the log is unusable.)

Consider:

$ grep -h -o 'QEMU Flash: Attempting flash detection at .*$' xx* \
  | sort \
  | uniq -c
      3 QEMU Flash: Attempting flash detection at FFC00000
     18 QEMU Flash: Attempting flash detection at FFC00010

In three boots out of 21 (namely in #9, #13 and #14), the flash
detection is attempted at a wrong address.

Here's the QemuFlashDetected() function from
"OvmfPkg/QemuFlashFvbServicesRuntimeDxe/QemuFlash.c", which first
determines the probe address, and then probes for ROM / RAM / pflash at
that address:

> STATIC
> BOOLEAN
> QemuFlashDetected (
>   VOID
>   )
> {
>   BOOLEAN  FlashDetected;
>   volatile UINT8  *Ptr;
>
>   UINTN Offset;
>   UINT8 OriginalUint8;
>   UINT8 ProbeUint8;
>
>   FlashDetected = FALSE;
>   Ptr = QemuFlashPtr (0, 0);
>
>   for (Offset = 0; Offset < mFdBlockSize; Offset++) {
>     Ptr = QemuFlashPtr (0, Offset);
>     ProbeUint8 = *Ptr;
>     if (ProbeUint8 != CLEAR_STATUS_CMD &&
>         ProbeUint8 != READ_STATUS_CMD &&
>         ProbeUint8 != CLEARED_ARRAY_STATUS) {
>       break;
>     }
>   }
>
>   if (Offset >= mFdBlockSize) {
>     DEBUG ((EFI_D_INFO, "QEMU Flash: Failed to find probe location\n"));
>     return FALSE;
>   }
>
>   DEBUG ((EFI_D_INFO, "QEMU Flash: Attempting flash detection at %p\n", Ptr));
>
>   OriginalUint8 = *Ptr;
>   *Ptr = CLEAR_STATUS_CMD;
>   ProbeUint8 = *Ptr;
>   if (OriginalUint8 != CLEAR_STATUS_CMD &&
>       ProbeUint8 == CLEAR_STATUS_CMD) {
>     DEBUG ((EFI_D_INFO, "QemuFlashDetected => FD behaves as RAM\n"));
>     *Ptr = OriginalUint8;
>   } else {
>     *Ptr = READ_STATUS_CMD;
>     ProbeUint8 = *Ptr;
>     if (ProbeUint8 == OriginalUint8) {
>       DEBUG ((EFI_D_INFO, "QemuFlashDetected => FD behaves as ROM\n"));
>     } else if (ProbeUint8 == READ_STATUS_CMD) {
>       DEBUG ((EFI_D_INFO, "QemuFlashDetected => FD behaves as RAM\n"));
>       *Ptr = OriginalUint8;
>     } else if (ProbeUint8 == CLEARED_ARRAY_STATUS) {
>       DEBUG ((EFI_D_INFO, "QemuFlashDetected => FD behaves as FLASH\n"));
>       FlashDetected = TRUE;
>       *Ptr = READ_ARRAY_CMD;
>     }
>   }
>
>   DEBUG ((EFI_D_INFO, "QemuFlashDetected => %a\n",
>                       FlashDetected ? "Yes" : "No"));
>   return FlashDetected;
> }

Quite obviously, the first loop, which determines the probe address,
should *always* produce the same address. The logs prove that this is
not the case, implying that the "cfi.pflash01" reset logic in QEMU is
incomplete.

Presumably, if "system_reset" is injected via the monitor at the wrong
time (possibly while the firmware is writing, in SMM, a non-volatile
UEFI variable in pflash), the pflash chip remains in a bad state well
into the next boot.

More analysis of the boot logs that support this below. (Again, this is
assuming that the QEMU configuration / cmdline never changed in the
middle.)

* Boots #01 through #08, and #10, #11, and #12, determine the right
  probe address (FFC00010), and the probe successfully determines
  pflash.

> QEMU Flash: Attempting flash detection at FFC00010
> QemuFlashDetected => FD behaves as FLASH
> QemuFlashDetected => Yes
> Installing QEMU flash SMM FVB

* Boot #09 determines a wrong probe address (FFC00000), but (strangely
  enough) it "succeeds" in identifying pflash:

> QEMU Flash: Attempting flash detection at FFC00000
> QemuFlashDetected => FD behaves as FLASH
> QemuFlashDetected => Yes
> Installing QEMU flash SMM FVB

* Boot #13 determines the wrong probe address (FFC00000), which -- not
  surprisingly -- fails the probe itself as well:

> QEMU Flash: Attempting flash detection at FFC00000
> QemuFlashDetected => FD behaves as ROM
> QemuFlashDetected => No
> ASSERT
> /builddir/build/BUILD/edk2-ee3198e672e2/OvmfPkg/QemuFlashFvbServicesRuntimeDxe/QemuFlash.c(257):
> !_gPcd_FixedAtBuild_PcdSmmSmramRequire

* Boot #14 is similar, except OVMF's flash driver is led to believe the
  board has RAM (not ROM) at FFC00000:

> QEMU Flash: Attempting flash detection at FFC00000
> QemuFlashDetected => FD behaves as RAM
> QemuFlashDetected => No
> ASSERT
> /builddir/build/BUILD/edk2-ee3198e672e2/OvmfPkg/QemuFlashFvbServicesRuntimeDxe/QemuFlash.c(257):
> !_gPcd_FixedAtBuild_PcdSmmSmramRequire

* Boot #15 appears to detect flash, and at the right address; however,
  the internal checksum of the firmware volume header is incorrect:

> QEMU Flash: Attempting flash detection at FFC00010
> QemuFlashDetected => FD behaves as FLASH
> QemuFlashDetected => Yes
> FV@FFC00000 Checksum is 0xB8AF, expected 0xB85F
> Variable FV header is not valid. It will be reinitialized.
> Installing QEMU flash SMM FVB

  The fact that boot #15 then goes on to crash with

> Variable Store header is corrupted
>
> ASSERT_EFI_ERROR (Status = Volume Corrupt)
> ASSERT
> /builddir/build/BUILD/edk2-ee3198e672e2/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableSmm.c(936):
> !EFI_ERROR (Status)

  is basically irrelevant / devoid of information, after the checksum
  error message makes it evident that pflash isn't working correctly.

* The rest of the boots (#16 through #21) *seem* to detect flash
  alright:

> QEMU Flash: Attempting flash detection at FFC00010
> QemuFlashDetected => FD behaves as FLASH
> QemuFlashDetected => Yes
> Installing QEMU flash SMM FVB

  and then crash "out of the blue" with:

> Variable Store header is corrupted
>
> ASSERT_EFI_ERROR (Status = Volume Corrupt)
> ASSERT
> /builddir/build/BUILD/edk2-ee3198e672e2/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableSmm.c(936):
> !EFI_ERROR (Status)

  However, given the previous boot results, I simply cannot trust the
  seemingly successful flash detections here. They certainly happened by
  chance only.

Moving the BZ to QEMU for further investigation.

... In fact, the "cfi.pflash01" device model doesn't seem to have any
reset handler at all!

(The only asynchronous callback that seems *remotely* relevant for
resetting the pflash chip is pflash_timer(). However, the timer whose
expiry would call pflash_timer() is never actually *armed*. It is
basically dead code.)

IOW, a reset handler for pflash is needed in QEMU. It should probably
do:

    trace_pflash_reset();
    memory_region_rom_device_set_romd(&pfl->mem, true);
    pfl->wcycle = 0;
    pfl->cmd = 0;
    pfl->status = 0x0;

In addition, we should likely kill pflash_timer(), which is dead code.

Comment 6 Philippe Mathieu-Daudé 2019-04-30 13:59:22 UTC
*** Bug 1704584 has been marked as a duplicate of this bug. ***

Comment 7 Philippe Mathieu-Daudé 2019-05-05 21:47:20 UTC
Series posted upstream:
https://lists.gnu.org/archive/html/qemu-devel/2019-05/msg00962.html

Comment 8 FuXiangChun 2019-05-27 09:41:43 UTC
Hit the same issue with edk2-ovmf-20190308git89910a39dcfd-2.el8.noarch and qemu-kvm-4.0.0-2.module+el8.1.0+3258+4c45705b.x86_64. But ovmf log is different. This is part of log. 

....
2019-05-26 08:25:23: Loading SMM driver at 0x0007FEC4000 EntryPoint=0x0007FED3361 VariableSmm.efi
2019-05-26 08:25:23: mSmmMemLibInternalMaximumSupportAddress = 0xFFFFFFFFF
2019-05-26 08:25:23: VarCheckLibRegisterSetVariableCheckHandler - 0x7FED16FC Success
2019-05-26 08:25:23: ASSERT /builddir/build/BUILD/edk2-89910a39dcfd/MdeModulePkg/Universal/Variable/RuntimeDxe/Variable.c(3870): VariableStore->Size == VariableStoreLength
2019-05-26 08:38:33: (Process terminated with status 0)

Comment 9 Philippe Mathieu-Daudé 2019-05-27 12:20:22 UTC
(In reply to FuXiangChun from comment #8)
> Hit the same issue with edk2-ovmf-20190308git89910a39dcfd-2.el8.noarch and
> qemu-kvm-4.0.0-2.module+el8.1.0+3258+4c45705b.x86_64. But ovmf log is
> different. This is part of log. 
> 
> ....
> 2019-05-26 08:25:23: Loading SMM driver at 0x0007FEC4000
> EntryPoint=0x0007FED3361 VariableSmm.efi
> 2019-05-26 08:25:23: mSmmMemLibInternalMaximumSupportAddress = 0xFFFFFFFFF
> 2019-05-26 08:25:23: VarCheckLibRegisterSetVariableCheckHandler - 0x7FED16FC
> Success
> 2019-05-26 08:25:23: ASSERT
> /builddir/build/BUILD/edk2-89910a39dcfd/MdeModulePkg/Universal/Variable/
> RuntimeDxe/Variable.c(3870): VariableStore->Size == VariableStoreLength
> 2019-05-26 08:38:33: (Process terminated with status 0)

Yes, this seems the same issue.
How many times did you reset your VM? Do you have a reproducer handy?
We should be able to have a 100% stable reproducer by resetting the machine in specific time frames (in the middle of a OVMF write access to the pflash device).

Comment 14 Danilo de Paula 2019-07-29 14:45:58 UTC
@QA: QA_ACK, please?

Comment 15 Danilo de Paula 2019-08-19 23:54:45 UTC
Fixed upstream in qemu-4.1

Comment 17 Xueqiang Wei 2019-08-21 02:30:42 UTC
Hit this issue by automation case "system_reset_during_boot" in avocado, so retested with avocado.

Versions:
Host:
kernel-4.18.0-131.el8.x86_64
qemu-kvm-4.1.0-3.module+el8.1.0+4019+dbfaedf9
edk2-ovmf-20190308git89910a39dcfd-6.el8.noarch

Guest:
rhel8.0.0
windows10
 

Both guest rhel8.0.0 and guest windows10 are all work well after 40 times system_reset. So set status to VERIFIED.


cmd lines:
# python3 ConfigTest.py --testcase=system_reset_during_boot --imageformat=qcow2 --guestname=RHEL.8.0.0,Win10 --driveformat=virtio_scsi --nicmodel=virtio_net --machines=q35 --clone=yes --firmware=ovmf --platform=x86_64 --nrepeat=2

 (1/7) Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.8.0.0.x86_64.io-github-autotest-qemu.unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads.q35: PASS (1242.26 s)
 (2/7) Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.8.0.0.x86_64.io-github-autotest-qemu.rh_kernel_update.q35: PASS (190.16 s)
 (3/7) Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads.q35: PASS (2134.97 s)
 (4/7) repeat1.Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.8.0.0.x86_64.io-github-autotest-qemu.system_reset_during_boot.q35: PASS (684.81 s)
 (5/7) repeat1.Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.system_reset_during_boot.q35: PASS (1373.77 s)
 (6/7) repeat2.Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.8.0.0.x86_64.io-github-autotest-qemu.system_reset_during_boot.q35: PASS (773.31 s)
 (7/7) repeat2.Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.system_reset_during_boot.q35: PASS (1063.44 s)

boot guest after system_reset:
 (1/2) Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.8.0.0.x86_64.io-github-autotest-qemu.boot.q35: PASS (68.22 s)
 (2/2) Host_RHEL.m8.u1.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.boot.q35: PASS (155.04 s)

Comment 19 errata-xmlrpc 2019-11-06 07:12:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3723


Note You need to log in before you can comment on or make changes to this bug.