Bug 843000

Summary: [balloon]Guest BSOD during 10000 times balloon device hotplug/unplug
Product: Red Hat Enterprise Linux 7 Reporter: Mike Cao <bcao>
Component: virtio-winAssignee: Vadim Rozenfeld <vrozenfe>
Status: CLOSED WONTFIX QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: medium    
Version: 7.0CC: areis, bcao, bsarathy, juzhang, lijin, michen, mkenneth, qzhang, rbalakri, rhod, shuyu, virt-bugs, virt-maint, vrozenfe
Target Milestone: rc   
Target Release: 7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-11-09 12:30:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
BSOD screenshot
none
memory dump file and windbg analyze file
none
Memory dump file on rhel6.5 host none

Description Mike Cao 2012-07-25 09:17:07 UTC
Description of problem:
Guest BSOD during 10000 times balloon device hotplug/unplug

Version-Release number of selected component (if applicable):
# uname -r
2.6.32-279.el6.x86_64
# rpm -q qemu-kvm
qemu-kvm-0.12.1.2-2.295.el6.x86_64
virtio-win-prewhql-30
win2k8R2 guests


How reproducible:
only 1 time

Steps to Reproduce:
1.Start guest w/ virtio-balloon-pci :
CLI:/usr/libexec/qemu-kvm -M rhel6.3.0 -enable-kvm -m 14G -smp 4,sockets=4,cores=1,threads=1 -cpu SandyBridge,+xsave,+x2apic,check -name win2k8R2 -uuid 4254eff9-1c7c-a3e0-8186-96c479395380 -rtc base=localtime,driftfix=slew -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/home/win2k8R2.qcow2,if=none,id=drive-ide0-0-0,format=qcow2,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -drive file=/home/en_windows_server_2008_r2_standard_enterprise_datacenter_and_web_with_sp1_x64_dvd_617601.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:7d:d7:db,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -spice port=5910,disable-ticketing -vga qxl -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -bios /usr/share/seabios/bios-pm.bin -monitor unix:/tmp/tt,server,nowait
2.do 10000 times pci hotplug/unplug
for ((i=1;i<=10000;i++))
do
  echo device_del balloon0 |nc -U /tmp/tt
  sleep 1 ;
  echo device_add virtio-balloon-pci,id=balloon,addr=0x5 |nc -U /tmp/tt
  sleep 1 ;
done


Actual Results:
During step2, most of times ,guest hang due to lots of rundll32.exe processes. Guest BSOD one time 

  
Actual results:
No BSOD occurs 


Expected results:


Additional info:

Comment 1 Mike Cao 2012-07-25 09:18:07 UTC
The context is partially valid. Only x86 user-mode context is available.
The wow64exts extension must be loaded to access 32-bit state.
.load wow64exts will do this if you haven't loaded it already.
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck 9F, {4, 258, fffffa800af61680, fffff800013da3d0}

Implicit thread is now fffffa80`0af61680
Probably caused by : Unknown_Image ( ANALYSIS_INCONCLUSIVE )

Followup: MachineOwner
---------

16.0: kd:x86> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

DRIVER_POWER_STATE_FAILURE (9f)
A driver has failed to complete a power IRP within a specific time (usually 10 minutes).
Arguments:
Arg1: 0000000000000004, The power transition timed out waiting to synchronize with the Pnp
	subsystem.
Arg2: 0000000000000258, Timeout in seconds.
Arg3: fffffa800af61680, The thread currently holding on to the Pnp lock.
Arg4: fffff800013da3d0, nt!TRIAGE_9F_PNP on Win7

Debugging Details:
------------------

Implicit thread is now fffffa80`0af61680

DRVPOWERSTATE_SUBCODE:  4

FAULTING_THREAD:  fffffa800af61680

DEFAULT_BUCKET_ID:  WIN7_DRIVER_FAULT

BUGCHECK_STR:  0x9F

CURRENT_IRQL:  0

LAST_CONTROL_TRANSFER:  from 0000000000000000 to 0000000000000000

STACK_TEXT:  
00000000 00000000 00000000 00000000 00000000 0x0


STACK_COMMAND:  kb

SYMBOL_NAME:  ANALYSIS_INCONCLUSIVE

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: Unknown_Module

IMAGE_NAME:  Unknown_Image

DEBUG_FLR_IMAGE_TIMESTAMP:  0

BUCKET_ID:  INVALID_KERNEL_CONTEXT

Followup: MachineOwner
---------

Comment 3 Mike Cao 2012-07-26 02:40:50 UTC
hit one more time when shutdown guest after 10000 times balloon hotplug/unplug

Comment 5 Vadim Rozenfeld 2012-08-02 11:14:14 UTC
Hi Mike,
Do you have the balloon service running during this test?

Thank you,
Vadim.

Comment 6 Mike Cao 2012-08-02 11:36:45 UTC
(In reply to comment #5)
> Hi Mike,
> Do you have the balloon service running during this test?
No. only do hotplug and hotunplug in a loop 
> 
> Thank you,
> Vadim.

Comment 10 lijin 2013-04-03 09:50:41 UTC
Reproduced this issue on RHEL7.0(qemu-kvm-1.4.0-1.el7.x86_64 && kernel-3.8.0-0.40.el7.x86_64 ),similar issue happened.

steps:
1.boot guest:
/usr/libexec/qemu-kvm  \
-drive file=/home/whql-test/win7-32-virtio.qcow2,if=none,cache=writethrough,media=disk,format=qcow2,id=disk1 -device ide-drive,id=ide0-0-0,drive=disk1,bootindex=0 \
-netdev tap,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:7f:f9:56,bus=pci.0 \
-monitor unix:/tmp/tt,server,nowait  \
-boot menu=on \
-spice port=5900,disable-ticketing -vga qxl \
-chardev file,path=/root/console.log,id=serial1 \
-device isa-serial,chardev=serial1,id=s1 \
-usb -device usb-tablet,id=tablet1 \
-M pc-i440fx-1.4 -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 -m 2G \
-enable-kvm \
-fda /usr/share/virtio-win/virtio-win-1.6.3_x86.vfd \
-cdrom /usr/share/virtio-win/virtio-win-1.6.3.iso

2.do 10000 times pci hotplug/unplug
for ((i=1;i<=10000;i++))
do
  echo device_del balloon0 |nc -U /tmp/tt
  sleep 1 ;
  echo device_add virtio-balloon-pci,id=balloon,addr=0x5 |nc -U /tmp/tt
  sleep 1 ;
done

3.do s3,the guest still alive
4.shutdown guest,the win7.32 guest BSOD

the attachment"bsod.png" is the BSOD screenshot
the attachment"memory dump file&analyze"is the dump file and the windbg analyze file

Comment 11 lijin 2013-04-03 09:51:46 UTC
Created attachment 731085 [details]
BSOD screenshot

Comment 12 lijin 2013-04-03 09:56:17 UTC
Created attachment 731089 [details]
memory dump file and windbg analyze file

Comment 13 Qunfang Zhang 2013-05-09 04:38:58 UTC
Still can be reproduced with the following version when reboot guest after the hotplug/unplug loop.

kernel-2.6.32-358.6.1.el6.x86_64
qemu-kvm-0.12.1.2-2.362.el6.x86_64
virtio-win-prewhql-59

Comment 14 Qunfang Zhang 2013-05-09 04:51:43 UTC
Created attachment 745534 [details]
Memory dump file on rhel6.5 host

Comment 16 Ronen Hod 2014-08-06 09:02:04 UTC
QE, can you please check again. with the latest drivers and QEMU.

Comment 17 Shuang Yu 2014-08-13 07:43:23 UTC
retest this issue on latest rhel6.6 host w/ windows 2008R2,during 10000 times balloon device hotplug/unplug,the guest work well,and after the 1000 times hotplug/unplug,the guest can reboot and shutdown without any error.

qemu-kvm-rhev-0.12.1.2-2.434.el6.x86_64
kernel-2.6.32-495.el6.x86_64
seabios-0.6.1.2-28.el6.x86_64
virtio-win-prewhql-86

steps:
1.boot guest:
/usr/libexec/qemu-kvm -m 2G -smp 2,maxcpus=2,cores=2,threads=1,scokets=1 -netdev tap,id=hostnet1,script=/etc/qemu-ifup -device e1000,netdev=hostnet1,id=net1,mac=00:52:00:00:11:22 -usb -device usb-tablet,id=tablet1 -drive file=win2008r2.raw,format=raw,if=none,id=drive1 -device ide-drive,drive=drive1,id=disk1 -cdrom en_windows_server_2008_r2_standard_enterprise_datacenter_and_web_with_sp1_x64_dvd_617601.iso -uuid 6adb29a6-4e36-46df-84eb-c463ecfdc2ba -name win2008R2 -device virtio-balloon-pci,id=balloon,addr=0x9 -boot menu=on -spice port=5900,disable-ticketing -vga qxl -monitor unix:/tmp/tt,server,nowait

2.do 10000 times pci hotplug/unplug
for((i=1;i<=1000;i++)); do echo device_del  | nc -U /tmp/tt; sleep 5; echo device_add virtio-balloon-pci,id=balloon,addr=0x9 | nc -U /tmp/tt; sleep 5; done

3.reboot guest successfully
4.shutdown guest successfully

Based on above,the issue has been fixed already

Comment 18 Shuang Yu 2014-08-13 11:22:42 UTC
(In reply to shuyu from comment #17)
> retest this issue on latest rhel6.6 host w/ windows 2008R2,during 10000

s/10000/1000/


retest this issue on rhel 6.6 host w/ windows 2008R2 & virtio-win-prewhql-89,during 1000 times balloon device hotplug/unplug,the guest work well,and after the 1000 times hotplug/unplug,the guest can reboot and shutdown without any error.

qemu-kvm-rhev-0.12.1.2-2.434.el6.x86_64
kernel-2.6.32-495.el6.x86_64
seabios-0.6.1.2-28.el6.x86_64
virtio-win-prewhql-89

the steps same as comment17

Comment 19 Mike Cao 2014-08-14 07:58:46 UTC
Retest this issue on virtio-win-prewhql-89 on RHEL7.0 guest ,guest BSOD at last 

Packages:
3.10.0-121.el7.x86_64
qemu-kvm-1.5.3-62.el7.x86_64

Steps:
1.Start VM
 /usr/libexec/qemu-kvm -drive file=089BLNWIN732EBK,if=none,id=drive-ide0-0-0,format=raw,serial=mike_cao,cache=writethrough,media=disk -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -monitor unix:/tmp/tt,server,nowait -boot menu=on -spice port=5900,disable-ticketing -vga qxl -chardev file,path=/root/console.log,id=serial1 -device isa-serial,chardev=serial1,id=s1 -usb -device usb-tablet,id=tablet1 -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 -m 2G -enable-kvm
2.hotplug/unplug in a loop
for ((i=1;i<=10000;i++)); do   echo device_del balloon0 |nc -U /tmp/tt; sleep 5;  echo device_add virtio-balloon-pci,id=balloon0,addr=0x5 |nc -U /tmp/tt; sleep 5; done

Actual Results:
Guest BSOD occurs 
CRITICAL_OBJECT_TERMINATION (f4)
A process or thread crucial to system operation has unexpectedly exited or been
terminated.
Several processes and threads are necessary for the operation of the
system; when they are terminated (for any reason), the system can no
longer function.
Arguments:
Arg1: 00000003, Process
Arg2: 8545bb08, Terminating object
Arg3: 8545bc74, Process image file name
Arg4: 82866cf0, Explanatory message (ascii)

Debugging Details:
------------------

Page 13102 not present in the dump file. Type ".hh dbgerr004" for details

KERNEL_LOG_FAILING_PROCESS:  

PROCESS_OBJECT: 8545bb08

IMAGE_NAME:  csrss.exe

DEBUG_FLR_IMAGE_TIMESTAMP:  0

MODULE_NAME: csrss

FAULTING_MODULE: 00000000 

PROCESS_NAME:  csrss.exe

EXCEPTION_CODE: (NTSTATUS) 0xc0000006 - The instruction at 0x%p referenced memory at 0x%p. The required data was not placed into memory because of an I/O error status of 0x%x.

BUGCHECK_STR:  0xF4_IOERR

DEFAULT_BUCKET_ID:  WIN7_DRIVER_FAULT

CURRENT_IRQL:  0

ANALYSIS_VERSION: 6.3.9600.16384 (debuggers(dbg).130821-1623) amd64fre

STACK_TEXT:  
8e9d5c9c 8292c067 000000f4 00000003 8545bb08 nt!KeBugCheckEx+0x1e
8e9d5cc0 828a9c1e 82866cf0 8545bc74 8545bd78 nt!PspCatchCriticalBreak+0x71
8e9d5cf0 828a9b61 8545bb08 85efe5f8 c0000006 nt!PspTerminateAllThreads+0x2d
8e9d5d24 8268b1ea ffffffff c0000006 0170f5c4 nt!NtTerminateProcess+0x1a2
8e9d5d24 779470b4 ffffffff c0000006 0170f5c4 nt!KiFastCallEntry+0x12a
WARNING: Frame IP not in any known module. Following frames may be wrong.
0170f5c4 00000000 00000000 00000000 00000000 0x779470b4


STACK_COMMAND:  kb

FOLLOWUP_NAME:  MachineOwner

IMAGE_VERSION:  

FAILURE_BUCKET_ID:  0xF4_IOERR_IMAGE_csrss.exe

BUCKET_ID:  0xF4_IOERR_IMAGE_csrss.exe

ANALYSIS_SOURCE:  KM

FAILURE_ID_HASH_STRING:  km:0xf4_ioerr_image_csrss.exe

FAILURE_ID_HASH:  {2b68738d-6c37-fd75-d711-1229511b3eea}

Followup: MachineOwner
---------

Comment 21 Shuang Yu 2014-08-18 01:40:35 UTC
retest this issue on rhel6.6 host w/ win7-32 & virtio-win-prewhql86,during 10000 times balloon device hotplug/unplug,the guest work well,and after the 10000 times hotplug/unplug,the guest can reboot and shutdown without any error.


kernel-2.6.32-495.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.434.el6.x86_64
seabios-0.6.1.2-28.el6.x86_64
virtio-win-prewhql-86

Steps:

1.boot guest:
/usr/libexec/qemu-kvm -m 2G -smp 2,maxcpus=2,cores=2,threads=1,scokets=1 -netdev tap,id=hostnet1,script=/etc/qemu-ifup -device e1000,netdev=hostnet1,id=net1,mac=00:52:00:00:11:22 -usb -device usb-tablet,id=tablet1 -drive file=win7-32-balloon.raw,format=raw,if=none,id=drive1 -device ide-drive,drive=drive1,id=disk1 -cdrom en_windows_7_ultimate_x86_dvd_x15-65921.iso -uuid 56200569-1761-4a09-94ff-383cfd9e2e01 -name win7-32-balloon -spice port=5900,disable-ticketing -vga qxl -monitor unix:/tmp/tt,server,nowait -device virtio-balloon-pci,id=balloon,addr=0x9

2.hotplug/unplug in a loop
for((i=1;i<=10000;i++)); do echo device_del balloon | nc -U /tmp/tt; sleep 7; echo device_add virtio-balloon-pci,id=balloon,addr=0x9 | nc -U /tmp/tt; sleep 7; done

3.reboot guest successfully

4.shutdown guest successfully

Comment 23 Ronen Hod 2014-11-03 14:14:44 UTC
Based on comment 19 and comment 10, it looks like RHEL7 might still have this issue.
QE, since it seems as if it does not reproduce on RHEL6.6, can you please also verify it on RHEL7
Thanks.

Comment 24 Mike Cao 2014-11-06 01:11:26 UTC
Retest this issue on RHEL7.1 

Packages 
3.10.0-186.el7.x86_64
qemu-kvm-rhev-2.1.2-1.el7.x86_64
seabios-1.7.5-4.el7.x86_64

Steps :
if sleep 2 sec between every cycle of hotunlug/hot-plug 

Actual Results: guest will response slowly and failed to shutdown (shutdown -t 0 -s -f does not work)

if sleep 7 sec between each hotplug/unplug 
Actual Results: Guest works fine after 18 hours

Based on above ,Vadim Can you  provide QE a standard time langency for each round hot-unplug/plug operation ?

Thanks,
Mike

Comment 25 Vadim Rozenfeld 2014-11-06 08:27:30 UTC
(In reply to Mike Cao from comment #24)
> Retest this issue on RHEL7.1 
> 
> Packages 
> 3.10.0-186.el7.x86_64
> qemu-kvm-rhev-2.1.2-1.el7.x86_64
> seabios-1.7.5-4.el7.x86_64
> 
> Steps :
> if sleep 2 sec between every cycle of hotunlug/hot-plug 
> 
> Actual Results: guest will response slowly and failed to shutdown (shutdown
> -t 0 -s -f does not work)
> 
> if sleep 7 sec between each hotplug/unplug 
> Actual Results: Guest works fine after 18 hours
> 
> Based on above ,Vadim Can you  provide QE a standard time langency for each
> round hot-unplug/plug operation ?
> 

Hi Mike,
I don't think I can give any exact numbers. PCI device plug/unplug is a very complicated process from sides - HW (emulated by host), OS, and device driver itself. Add more load to host and latency will be changed. I think we can close 
this bug, but lets run this test from time to time as addition to HCK PnP tests.

Best regards,
Vadim.
> Thanks,
> Mike

Comment 26 Mike Cao 2014-11-06 08:42:28 UTC
(In reply to Vadim Rozenfeld from comment #25)
> (In reply to Mike Cao from comment #24)
> > Retest this issue on RHEL7.1 
> > 
> > Packages 
> > 3.10.0-186.el7.x86_64
> > qemu-kvm-rhev-2.1.2-1.el7.x86_64
> > seabios-1.7.5-4.el7.x86_64
> > 
> > Steps :
> > if sleep 2 sec between every cycle of hotunlug/hot-plug 
> > 
> > Actual Results: guest will response slowly and failed to shutdown (shutdown
> > -t 0 -s -f does not work)
> > 
> > if sleep 7 sec between each hotplug/unplug 
> > Actual Results: Guest works fine after 18 hours
> > 
> > Based on above ,Vadim Can you  provide QE a standard time langency for each
> > round hot-unplug/plug operation ?
> > 
> 
> Hi Mike,
> I don't think I can give any exact numbers. PCI device plug/unplug is a very
> complicated process from sides - HW (emulated by host), OS, and device
> driver itself. Add more load to host and latency will be changed. I think we
> can close 
> this bug, but lets run this test from time to time as addition to HCK PnP
> tests.
I agree to closing the bug.
Regarding to the HCK pnp job ,I think it is similar as the operation click "eject" in the task bar .Is it same as we run device_del in qemu monitor ?

Thanks,
Mike

Comment 27 Vadim Rozenfeld 2014-11-06 09:22:37 UTC
(In reply to Mike Cao from comment #26)
> (In reply to Vadim Rozenfeld from comment #25)
> > (In reply to Mike Cao from comment #24)
> > > Retest this issue on RHEL7.1 
> > > 
> > > Packages 
> > > 3.10.0-186.el7.x86_64
> > > qemu-kvm-rhev-2.1.2-1.el7.x86_64
> > > seabios-1.7.5-4.el7.x86_64
> > > 
> > > Steps :
> > > if sleep 2 sec between every cycle of hotunlug/hot-plug 
> > > 
> > > Actual Results: guest will response slowly and failed to shutdown (shutdown
> > > -t 0 -s -f does not work)
> > > 
> > > if sleep 7 sec between each hotplug/unplug 
> > > Actual Results: Guest works fine after 18 hours
> > > 
> > > Based on above ,Vadim Can you  provide QE a standard time langency for each
> > > round hot-unplug/plug operation ?
> > > 
> > 
> > Hi Mike,
> > I don't think I can give any exact numbers. PCI device plug/unplug is a very
> > complicated process from sides - HW (emulated by host), OS, and device
> > driver itself. Add more load to host and latency will be changed. I think we
> > can close 
> > this bug, but lets run this test from time to time as addition to HCK PnP
> > tests.
> I agree to closing the bug.
> Regarding to the HCK pnp job ,I think it is similar as the operation click
> "eject" in the task bar .Is it same as we run device_del in qemu monitor ?

No, they are not the same. Eject is a gentle way to ask the system to tear  
the device stack down and remove device. While device_del is some sort of brute
force action similar to pulling device out of PCI slot, which will activate surprise removal path.

Cheers,
Vadim.
> 
> Thanks,
> Mike