Bug 1370703

Summary: [Balloon] Whql Job "Commom scenario stress with IO" failed on 2008-32/64
Product: Red Hat Enterprise Linux 7 Reporter: Peixiu Hou <phou>
Component: qemu-kvm-rhevAssignee: Stefan Hajnoczi <stefanha>
Status: CLOSED ERRATA QA Contact: Yumei Huang <yuhuang>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.3CC: ailan, chayang, ghammer, jen, juzhang, knoel, lijin, lmiksik, michen, mrezanin, phou, rbalakri, stefanha, virt-maint, yfu, yuhuang
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qemu-kvm-rhev-2.6.0-25.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-07 21:32:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
125BLN200832 none

Description Peixiu Hou 2016-08-27 04:37:33 UTC
Created attachment 1194598 [details]
125BLN200832

Description of problem:
Balloon Whql Job "Commom scenario stress with IO" failed on 2008-32/64
When ranjob- Commom_scenario_stress_with_IO, the guest quited with qemu error message "virtqueue size exceeded"


Version-Release number of selected component (if applicable):
kernel-3.10.0-493.el7.x86_64
qemu-kvm-rhev-2.6.0-22.el7.x86_64
virtio-win-prewhql-125

How reproducible:
100%

Steps to Reproduce:
1.Boot a client guest:
/usr/libexec/qemu-kvm -name 125BLN200832EYJ -enable-kvm -m 4G -smp 4 -uuid 267a6a93-2c55-42b3-b5e5-c1cd984ad009 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/tmp/125BLN200832EYJ,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime,driftfix=slew -boot order=cd,menu=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=125BLN200832EYJ,if=none,id=drive-ide0-0-0,format=raw,serial=mike_cao,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -drive file=en_windows_server_2008_datacenter_enterprise_standard_sp2_x86_dvd_342333.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=125BLN200832EYJ.vfd,if=none,id=drive-fdc0-0-0,format=raw,cache=none -global isa-fdc.driveA=drive-fdc0-0-0 -netdev tap,script=/etc/qemu-ifup,downscript=no,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=00:52:1f:43:46:da,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=isa_serial0 -device usb-tablet,id=input0 -vnc 0.0.0.0:1 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7,disable-legacy=off,disable-modern=off

2. Run the job "Commom scenario stress with IO"
3. Check the guest state

Actual results:
guest quit, job fail

Expected results:
job pass

Additional info:
1. Tried without virtio1.0, it's also failed.
2. The wlk log as attachment 125 [details]BLN200832.cpk

Comment 2 Peixiu Hou 2016-08-30 15:49:42 UTC
Isolation:
1. Retested this case with virtio-win-prewhql-126 + qemu(rhel7.2) on rhel7.2 host, it's passed.
qemu-kvm-rhev-2.3.0-31.el7_2.3.x86_64
kernel-3.10.0-327.3.1.el7.x86_64
virtio-win-prewhql-126

2. Retested this case with virtio-win-prewhql-126 + qemu(rhel7.3) on rhel7.2 host, it's failed.
qemu-kvm-rhev-2.6.0-22.el7.x86_64
kernel-3.10.0-327.3.1.el7.x86_64
virtio-win-prewhql-126

Accordding to above, it's should a rhel7.3 qemu's issue.


Best Regards~
Peixiu Hou

Comment 4 Peixiu Hou 2016-08-31 09:23:57 UTC
Retested this case with qemu-kvm-rhev-2.6.0-14.el7.x86_64, it can be passed.

Best Regards~
Peixiu Hou

Comment 5 Gal Hammer 2016-08-31 09:42:10 UTC
The relevant QEMU's commit is afd9096e.

Are you using the new HCK to test the driver? I can't find this test and it seems it belongs to the old WHQL test suite.

Comment 6 Peixiu Hou 2016-09-01 07:36:32 UTC
(In reply to Gal Hammer from comment #5)
> The relevant QEMU's commit is afd9096e.
> 
> Are you using the new HCK to test the driver? I can't find this test and it
> seems it belongs to the old WHQL test suite.

Hi Gal,

We used WLK to test 2008-32&64 whql job, used wlk version is Windows Logo Kit 1.6. You can use this tool to find the "Commom scenario stress with IO" job.
https://www.microsoft.com/en-us/download/details.aspx?id=39359.

And we used HCK to test 2008-R2 ~ 2012-R2 whql job. It doesn't include "Commom scenario stress with IO" job.


Best Regards~
Peixiu Hou

Comment 7 Gal Hammer 2016-09-05 13:00:49 UTC
(In reply to Peixiu Hou from comment #6)
> (In reply to Gal Hammer from comment #5)
> > The relevant QEMU's commit is afd9096e.
> > 
> > Are you using the new HCK to test the driver? I can't find this test and it
> > seems it belongs to the old WHQL test suite.
> 
> Hi Gal,
> 
> We used WLK to test 2008-32&64 whql job, used wlk version is Windows Logo
> Kit 1.6. You can use this tool to find the "Commom scenario stress with IO"
> job.
> https://www.microsoft.com/en-us/download/details.aspx?id=39359.
> 
> And we used HCK to test 2008-R2 ~ 2012-R2 whql job. It doesn't include
> "Commom scenario stress with IO" job.
> 
> 
> Best Regards~
> Peixiu Hou

I was unable to reproduce with the same qemu/kernel/driver versions.

Did you install the balloon service on the client machine?

Comment 12 Gal Hammer 2016-09-07 13:49:18 UTC
A quicker way to reproduce this bug is using the devcon.exe util in an administrator command prompt window:

FOR /L %i IN (1,1,130) DO devcon.exe restart "PCI\VEN_1AF4&DEV_1045"

Comment 13 Stefan Hajnoczi 2016-09-12 15:44:05 UTC
I'm concerned this bug can be triggered by rebooting guests.  Therefore it could affect customers and become urgent in RHEL 7.2.z/7.3.

Please try this simplified reproducer as root in a Linux guest:

  host# qemu-system-x86_64 -enable-kvm -m 1024 -cpu host -drive if=virtio,file=rhel72.img,format=raw -device virtio-balloon-pci,id=virtio-balloon0,guest-stats-polling-interval=5
  guest# for ((i = 0; i < 129; i++)); do rmmod virtio_balloon; modprobe virtio_balloon; done

Expected result:
The for loop completes successfully.

Actual result:
The VM terminates and QEMU prints "Virtqueue size exceeded".

Bug description:
The problem is that the vq->inuse counter is not zeroed when the device resets.  This causes virtqueue_pop() to abort with the error message when the counter exceeds the virtqueue size.  A real life scenario would be rebooting a guest with virtio-balloon (and stats polling enabled) 129 times.

I will backport two patches from upstream that address this issue.

Comment 14 lijin 2016-09-13 01:29:33 UTC
Gal and Ladi,
According to comment#13,linux guest hit the same issue.
Could you help to confirm whether this bug is qemu or virtio-win bug,so that we can change to the correct component.

Comment 16 Ladi Prosek 2016-09-13 06:27:24 UTC
Hi Li Jin,

(In reply to lijin from comment #14)
> Gal and Ladi,
> According to comment#13,linux guest hit the same issue.
> Could you help to confirm whether this bug is qemu or virtio-win bug,so that
> we can change to the correct component.

This is a QEMU bug. It was found on Windows because the "Common scenario stress with IO" test repeatedly restarts the driver, but is not virtio-win specific.

Comment 18 lijin 2016-09-13 07:06:11 UTC
change component to qemu according to comment#16 and comment#17

Comment 19 Miroslav Rezanina 2016-09-13 12:49:14 UTC
Fix included in qemu-kvm-rhev-2.6.0-25.el7

Comment 21 Yanan Fu 2016-09-14 09:46:16 UTC
------------------------reproduce-----------------------
Test version:
kernel: kernel-3.10.0-418.el7.x86_64
qemu:   qemu-kvm-rhev-2.6.0-24.el7.x86_64

1. boot one linux guest with:
  -drive id=virtio-blk-drive,if=virtio,format=qcow2,file=/home/rhel7.3.qcow2 \
  -device virtio-balloon-pci,id=virtio-balloon0,guest-stats-polling-interval=5 \

2. after guest boot up, in the guest, do:
   for ((i = 0; i < 129; i++)); do rmmod virtio_balloon; modprobe virtio_balloon; done

Guest terminates immediately, and qemu output:
(qemu) qemu-kvm: Virtqueue size exceeded

reproduce this bug successfully.

------------------------verification-----------------------
Test version:
kernel: kernel-3.10.0-418.el7.x86_64
qemu:   qemu-kvm-rhev-2.6.0-25.el7.x86_64

1. boot one linux guest with:
  -drive id=virtio-blk-drive,if=virtio,format=qcow2,file=/home/rhel7.3.qcow2 \
  -device virtio-balloon-pci,id=virtio-balloon0,guest-stats-polling-interval=5 \

2. after guest boot up, in the guest, do:
   for ((i = 0; i < 129; i++)); do rmmod virtio_balloon; modprobe virtio_balloon; done

   This loop complete successfully

So,move this bug to VERIFIED according to the test result above.


cmd:
/usr/libexec/qemu-kvm \
    -name 'VM1'  \
    -sandbox off  \
    -machine pc  \
    -nodefaults  \
    -vga qxl \
    -global kvm-pit.lost_tick_policy=delay \
    -chardev socket,id=qmp_monitor,path=/var/tmp/qmpmonitor,server,nowait \
    -mon chardev=qmp_monitor,mode=control  \
    -device pvpanic,ioport=0x505,id=idkP1Yip  \
    -device nec-usb-xhci,id=usb1,bus=pci.0,addr=05 \
    -drive id=virtio-blk-drive,if=virtio,format=qcow2,file=/home/rhel7.3.qcow2 \
    -device virtio-balloon-pci,id=virtio-balloon0,guest-stats-polling-interval=5 \
    -m 2048  \
    -smp 4,maxcpus=8,cores=4,threads=1,sockets=2  \
    -cpu host \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1  \
    -vnc :0 \
    -boot order=cdn,once=c,menu=on,strict=off \
    -enable-kvm \
    -monitor stdio \
    -qmp tcp:0:4444,server,nowait  \
    -netdev tap,id=hostnet,vhost=on  \
    -device virtio-net-pci,netdev=hostnet,id=virtio-net \

Comment 23 errata-xmlrpc 2016-11-07 21:32:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2673.html