Bug 1346237

Summary: win 10.x86_64 guest coredump when execute avocado test case: win_virtio_update.install_driver
Product: Red Hat Enterprise Linux 7 Reporter: Yanan Fu <yfu>
Component: qemu-kvm-rhevAssignee: Stefan Hajnoczi <stefanha>
Status: CLOSED ERRATA QA Contact: FuXiangChun <xfu>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.3CC: chayang, jsnow, juzhang, knoel, mrezanin, stefanha, virt-maint, yfu
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Windows   
Whiteboard:
Fixed In Version: qemu-kvm-rhev-2.6.0-11.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-07 21:17:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
avocado test log for this bug none

Description Yanan Fu 2016-06-14 10:47:13 UTC
Created attachment 1167829 [details]
avocado test log for this bug

Description of problem:
This issue was hit with avocado test case win_virtio_update.install_driver.
After "Install drivers balloon", then reboot the guest, coredump occur.
Both intel and amd host all hit this issue.

Version-Release number of selected component (if applicable):
kernel:3.10.0-422.el7.x86_64
qemu:qemu-kvm-rhev-2.6.0-5.el7.x86_64
virtio-win: virtio-win-1.8.0-4.iso

How reproducible:
100%

Steps to Reproduce:
1.Boot one win10.x86_64 guest
2.Install driver balloon from virtio-win

Actual results:
guest coredump

Expected results:
win_virtio_update.install_driver should finished successfully.

Additional info:
#gdb /usr/libexec/qemu-kvm core.10236

(gdb) bt
#0  0x00007ff893b935f7 in raise () from /lib64/libc.so.6
#1  0x00007ff893b94ce8 in abort () from /lib64/libc.so.6
#2  0x00007ff89c13b3d6 in bdrv_aio_cancel (acb=0x7ff8a441c0a0) at block/io.c:2048
#3  0x00007ff89c130535 in blk_aio_cancel (acb=<optimized out>) at block/block-backend.c:1044
#4  0x00007ff89c040d5a in ide_bus_reset (bus=bus@entry=0x7ff8a28e7480) at hw/ide/core.c:2326
#5  0x00007ff89c044088 in piix3_reset (opaque=0x7ff8a28e6c00) at hw/ide/piix.c:115
#6  0x00007ff89bfd18dd in qemu_devices_reset () at vl.c:1738
#7  0x00007ff89bf4d216 in pc_machine_reset () at /usr/src/debug/qemu-2.6.0/hw/i386/pc.c:1936
#8  0x00007ff89bfd1946 in qemu_system_reset (report=report@entry=true) at vl.c:1751
#9  0x00007ff89bec879b in main_loop_should_exit () at vl.c:1898
#10 main_loop () at vl.c:1938
#11 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4667

Comment 3 Gal Hammer 2016-06-16 11:49:19 UTC
1. I'm unable to download the core file ("Forbidden. You don't have permission to access /pub/section2/coredump/var/crash/yfu/bug-1346237/core.10236 on this server.").

2. Is the Windows guest is a new installation or a prepared one? Because using the same qemu command line I'm unable to start Windows installation on my host.

Comment 5 Stefan Hajnoczi 2016-06-20 16:26:48 UTC
Please try this build:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11223356

It includes a fix for IDE/DMA helpers which should solve the bdrv_aio_cancel() abort(3) you experienced.

This bug appears to occur when Windows issues an IDE TRIM request.  I'm not sure how often Windows does this so it may be hard to reproduce again.

Comment 10 Miroslav Rezanina 2016-07-01 08:24:10 UTC
Fix included in qemu-kvm-rhev-2.6.0-11.el7

Comment 12 Yanan Fu 2016-07-06 10:56:11 UTC
rerun the avocado case "win_virtio_update.install_driver" for many times, but it always be blocked by one autotest bug:
Bug 1233526 - [KVM-AUTOTEST] win_virtio_update.install_driver:Unhandled ShellTimeoutError: Timeout expired while waiting for shell command to complete: 'cmd /c E:\\install_driver.bat F:\\NetKVM\\w7\\amd64' (WIn7, Win8) 

So i have not reproduce this bug yet.
And the autotest bug is under handling in avocado framework. I will update the test result later.

Comment 13 Yanan Fu 2016-09-09 10:50:24 UTC
Hi Stefan,
Just as you have said in comment 5, this bug is so hard to be reproduced again.
And the autotest case "win_virtio_update.install_driver" has been replaced by "single_driver_install", i have rerun this case for 50 times,can not hit this issue.

With fixed version: 
kernel-3.10.0-501.el7.x86_64
qemu-kvm-rhev-2.6.0-23.el7.x86_64

Rerun "singel_driver_install" for 10*5 = 50 times, can not hit this issue too.
job link:
http://10.66.4.244/kvm_autotest_job_log/?jobid=1495497

Do you have a reproducer that can easy trigger this bug? or can we verify it with the result of my job?

Comment 14 Stefan Hajnoczi 2016-09-14 11:47:11 UTC
Here is a deterministic reproducer using a Linux guest:

$ qemu-img create -f qcow2 -o preallocation=full test.qcow2 128M

This is a fully preallocated image so all clusters have been created and are filled with zeroes.  We will send an IDE TRIM request to discard a cluster and then reboot the guest to trigger the same code path as Windows.

$ qemu-system-x86_64 -enable-kvm -m 1024 -drive if=ide,id=ide-drive,discard=unmap,file=blkdebug::test.qcow2,format=qcow2 -drive if=none,id=virtio-drive,file=rhel72.img,format=raw -device virtio-blk-pci,drive=virtio-drive,bootindex=0

The IDE drive is for the test and the virtio-blk device is just there to boot a Linux guest.  Notice that the IDE drive has discard=unmap so the TRIM request will be handled instead of ignored.  The blkdebug protocol in the filename enables debugging support that let's us suspend the TRIM request to make this reproducer reliable and not based on timing.

(qemu) qemu-io ide-drive "break cluster_free A"

This adds an I/O request breakpoint.  When qcow2 processes a discard request it will free a cluster and the breakpoint suspends the request at that time.  It's as if we have an infinitely slow disk.  This way we avoid race conditions in the test steps.

guest# blkdiscard -l 65536 /dev/sda

Submit an IDE TRIM request for a full 64 KB qcow2 cluster.  This causes qcow2 to free a cluster and triggers our I/O request breakpoint.  blkdiscard(8) should hang inside the guest because it is waiting for the suspended IDE TRIM request to complete.

(qemu) system_reset

Now reboot the guest to trigger the ide_bus_reset()/bdrv_aio_cancel() code path.

Expected behavior (qemu-kvm-rhev-2.6.0-11.el7):

QEMU hangs because we didn't provide a way to resume the suspended IDE TRIM request:
blkdebug: Suspended request 'A'

(In the Windows scenario QEMU will not hang because the IDE TRIM request isn't suspended.  As soon as the request completes the guest will reboot and be responsive.)

Actual behavior (qemu-kvm-rhev-2.6.0-10.el7):

QEMU calls abort(3):
blkdebug: Suspended request 'A'
Aborted (core dumped)

Comment 15 Yanan Fu 2016-09-15 03:42:00 UTC
Thanks for stefanha's reproducer.

------------------------reproduce-------------------
Test version:
qemu: qemu-kvm-rhev-2.6.0-10.el7.x86_64
guest: rhel7.3

Test steps:
1. Create one preallocated image.
    #qemu-img create -f qcow2 -o preallocation=full test.qcow2 128M

2. Boot one guest with following commands:
    -drive id=ide-drive,if=ide,discard=unmap,file=blkdebug::/home/test.qcow2,format=qcow2 \
    -drive id=virtio-drive,if=none,file=/home/RHEL-Server-7.3-64-virtio.qcow2,format=qcow2 \
    -device virtio-blk-pci,drive=virtio-drive,id=virtio-blk-disk,bootindex=0 \

3. In host qemu monitor, execute:
   (qemu) qemu-io ide-drive "break cluster_free A"

4. In guest, execute:
   # blkdiscard -l 65536 /dev/sda
   It will block in guest,and qemu print "blkdebug: Suspended request 'A'"

5. In qemu monitor:
   # system_reset  
   QEMU coredump , (qemu) *****  Aborted       (core dumped)

reproduce this bug successfully.


------------------------verification-------------------
Test version:
qemu: qemu-kvm-rhev-2.6.0-25.el7.x86_64
guest: rhel7.3

Test steps:
Same test steps with above.
And after system_reset in step 5, guest hang, can not input with qemu monitor, only "kill -9 $QEMU_PID" can quit.

According to the comment 14 and the test result above, move it to VERIFIED


CLI:
/usr/libexec/qemu-kvm \
    -enable-kvm \
    -m 2048 \
    -drive id=ide-drive,if=ide,discard=unmap,file=blkdebug::/home/test.qcow2,format=qcow2 \
    -drive id=virtio-drive,if=none,file=/home/win2012r2-virtio-blk.qcow2,format=qcow2 \
    -device virtio-blk-pci,drive=virtio-drive,id=virtio-blk-disk,bootindex=0 \
    -usb \
    -device usb-tablet  \
    -vnc :0 \
    -monitor stdio

Be sure, do not add other command lines, because:
"there is a chance that some options could involve a call to bdrv_drain_all() inside QEMU.  This function waits until all I/O requests have completed. That would hang QEMU (including the monitor)". ---->analysis from stefanha.
so you can not input "system_reset" in step 5, and this bug need system_reset to trigger.

Comment 17 errata-xmlrpc 2016-11-07 21:17:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2673.html