Bug 843000 - [balloon]Guest BSOD during 10000 times balloon device hotplug/unplug
[balloon]Guest BSOD during 10000 times balloon device hotplug/unplug
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: virtio-win (Show other bugs)
7.0
Unspecified Unspecified
medium Severity high
: rc
: 7.0
Assigned To: Vadim Rozenfeld
Virtualization Bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-07-25 05:17 EDT by Mike Cao
Modified: 2015-11-22 22:35 EST (History)
14 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-11-09 07:30:15 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
BSOD screenshot (16.43 KB, image/png)
2013-04-03 05:51 EDT, lijin
no flags Details
memory dump file and windbg analyze file (40.04 MB, application/zip)
2013-04-03 05:56 EDT, lijin
no flags Details
Memory dump file on rhel6.5 host (44.65 MB, application/x-gzip)
2013-05-09 00:51 EDT, Qunfang Zhang
no flags Details

  None (edit)
Description Mike Cao 2012-07-25 05:17:07 EDT
Description of problem:
Guest BSOD during 10000 times balloon device hotplug/unplug

Version-Release number of selected component (if applicable):
# uname -r
2.6.32-279.el6.x86_64
# rpm -q qemu-kvm
qemu-kvm-0.12.1.2-2.295.el6.x86_64
virtio-win-prewhql-30
win2k8R2 guests


How reproducible:
only 1 time

Steps to Reproduce:
1.Start guest w/ virtio-balloon-pci :
CLI:/usr/libexec/qemu-kvm -M rhel6.3.0 -enable-kvm -m 14G -smp 4,sockets=4,cores=1,threads=1 -cpu SandyBridge,+xsave,+x2apic,check -name win2k8R2 -uuid 4254eff9-1c7c-a3e0-8186-96c479395380 -rtc base=localtime,driftfix=slew -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/home/win2k8R2.qcow2,if=none,id=drive-ide0-0-0,format=qcow2,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -drive file=/home/en_windows_server_2008_r2_standard_enterprise_datacenter_and_web_with_sp1_x64_dvd_617601.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:7d:d7:db,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -spice port=5910,disable-ticketing -vga qxl -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -bios /usr/share/seabios/bios-pm.bin -monitor unix:/tmp/tt,server,nowait
2.do 10000 times pci hotplug/unplug
for ((i=1;i<=10000;i++))
do
  echo device_del balloon0 |nc -U /tmp/tt
  sleep 1 ;
  echo device_add virtio-balloon-pci,id=balloon,addr=0x5 |nc -U /tmp/tt
  sleep 1 ;
done


Actual Results:
During step2, most of times ,guest hang due to lots of rundll32.exe processes. Guest BSOD one time 

  
Actual results:
No BSOD occurs 


Expected results:


Additional info:
Comment 1 Mike Cao 2012-07-25 05:18:07 EDT
The context is partially valid. Only x86 user-mode context is available.
The wow64exts extension must be loaded to access 32-bit state.
.load wow64exts will do this if you haven't loaded it already.
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck 9F, {4, 258, fffffa800af61680, fffff800013da3d0}

Implicit thread is now fffffa80`0af61680
Probably caused by : Unknown_Image ( ANALYSIS_INCONCLUSIVE )

Followup: MachineOwner
---------

16.0: kd:x86> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

DRIVER_POWER_STATE_FAILURE (9f)
A driver has failed to complete a power IRP within a specific time (usually 10 minutes).
Arguments:
Arg1: 0000000000000004, The power transition timed out waiting to synchronize with the Pnp
	subsystem.
Arg2: 0000000000000258, Timeout in seconds.
Arg3: fffffa800af61680, The thread currently holding on to the Pnp lock.
Arg4: fffff800013da3d0, nt!TRIAGE_9F_PNP on Win7

Debugging Details:
------------------

Implicit thread is now fffffa80`0af61680

DRVPOWERSTATE_SUBCODE:  4

FAULTING_THREAD:  fffffa800af61680

DEFAULT_BUCKET_ID:  WIN7_DRIVER_FAULT

BUGCHECK_STR:  0x9F

CURRENT_IRQL:  0

LAST_CONTROL_TRANSFER:  from 0000000000000000 to 0000000000000000

STACK_TEXT:  
00000000 00000000 00000000 00000000 00000000 0x0


STACK_COMMAND:  kb

SYMBOL_NAME:  ANALYSIS_INCONCLUSIVE

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: Unknown_Module

IMAGE_NAME:  Unknown_Image

DEBUG_FLR_IMAGE_TIMESTAMP:  0

BUCKET_ID:  INVALID_KERNEL_CONTEXT

Followup: MachineOwner
---------
Comment 3 Mike Cao 2012-07-25 22:40:50 EDT
hit one more time when shutdown guest after 10000 times balloon hotplug/unplug
Comment 5 Vadim Rozenfeld 2012-08-02 07:14:14 EDT
Hi Mike,
Do you have the balloon service running during this test?

Thank you,
Vadim.
Comment 6 Mike Cao 2012-08-02 07:36:45 EDT
(In reply to comment #5)
> Hi Mike,
> Do you have the balloon service running during this test?
No. only do hotplug and hotunplug in a loop 
> 
> Thank you,
> Vadim.
Comment 10 lijin 2013-04-03 05:50:41 EDT
Reproduced this issue on RHEL7.0(qemu-kvm-1.4.0-1.el7.x86_64 && kernel-3.8.0-0.40.el7.x86_64 ),similar issue happened.

steps:
1.boot guest:
/usr/libexec/qemu-kvm  \
-drive file=/home/whql-test/win7-32-virtio.qcow2,if=none,cache=writethrough,media=disk,format=qcow2,id=disk1 -device ide-drive,id=ide0-0-0,drive=disk1,bootindex=0 \
-netdev tap,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:7f:f9:56,bus=pci.0 \
-monitor unix:/tmp/tt,server,nowait  \
-boot menu=on \
-spice port=5900,disable-ticketing -vga qxl \
-chardev file,path=/root/console.log,id=serial1 \
-device isa-serial,chardev=serial1,id=s1 \
-usb -device usb-tablet,id=tablet1 \
-M pc-i440fx-1.4 -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 -m 2G \
-enable-kvm \
-fda /usr/share/virtio-win/virtio-win-1.6.3_x86.vfd \
-cdrom /usr/share/virtio-win/virtio-win-1.6.3.iso

2.do 10000 times pci hotplug/unplug
for ((i=1;i<=10000;i++))
do
  echo device_del balloon0 |nc -U /tmp/tt
  sleep 1 ;
  echo device_add virtio-balloon-pci,id=balloon,addr=0x5 |nc -U /tmp/tt
  sleep 1 ;
done

3.do s3,the guest still alive
4.shutdown guest,the win7.32 guest BSOD

the attachment"bsod.png" is the BSOD screenshot
the attachment"memory dump file&analyze"is the dump file and the windbg analyze file
Comment 11 lijin 2013-04-03 05:51:46 EDT
Created attachment 731085 [details]
BSOD screenshot
Comment 12 lijin 2013-04-03 05:56:17 EDT
Created attachment 731089 [details]
memory dump file and windbg analyze file
Comment 13 Qunfang Zhang 2013-05-09 00:38:58 EDT
Still can be reproduced with the following version when reboot guest after the hotplug/unplug loop.

kernel-2.6.32-358.6.1.el6.x86_64
qemu-kvm-0.12.1.2-2.362.el6.x86_64
virtio-win-prewhql-59
Comment 14 Qunfang Zhang 2013-05-09 00:51:43 EDT
Created attachment 745534 [details]
Memory dump file on rhel6.5 host
Comment 16 Ronen Hod 2014-08-06 05:02:04 EDT
QE, can you please check again. with the latest drivers and QEMU.
Comment 17 Shuang Yu 2014-08-13 03:43:23 EDT
retest this issue on latest rhel6.6 host w/ windows 2008R2,during 10000 times balloon device hotplug/unplug,the guest work well,and after the 1000 times hotplug/unplug,the guest can reboot and shutdown without any error.

qemu-kvm-rhev-0.12.1.2-2.434.el6.x86_64
kernel-2.6.32-495.el6.x86_64
seabios-0.6.1.2-28.el6.x86_64
virtio-win-prewhql-86

steps:
1.boot guest:
/usr/libexec/qemu-kvm -m 2G -smp 2,maxcpus=2,cores=2,threads=1,scokets=1 -netdev tap,id=hostnet1,script=/etc/qemu-ifup -device e1000,netdev=hostnet1,id=net1,mac=00:52:00:00:11:22 -usb -device usb-tablet,id=tablet1 -drive file=win2008r2.raw,format=raw,if=none,id=drive1 -device ide-drive,drive=drive1,id=disk1 -cdrom en_windows_server_2008_r2_standard_enterprise_datacenter_and_web_with_sp1_x64_dvd_617601.iso -uuid 6adb29a6-4e36-46df-84eb-c463ecfdc2ba -name win2008R2 -device virtio-balloon-pci,id=balloon,addr=0x9 -boot menu=on -spice port=5900,disable-ticketing -vga qxl -monitor unix:/tmp/tt,server,nowait

2.do 10000 times pci hotplug/unplug
for((i=1;i<=1000;i++)); do echo device_del  | nc -U /tmp/tt; sleep 5; echo device_add virtio-balloon-pci,id=balloon,addr=0x9 | nc -U /tmp/tt; sleep 5; done

3.reboot guest successfully
4.shutdown guest successfully

Based on above,the issue has been fixed already
Comment 18 Shuang Yu 2014-08-13 07:22:42 EDT
(In reply to shuyu from comment #17)
> retest this issue on latest rhel6.6 host w/ windows 2008R2,during 10000

s/10000/1000/


retest this issue on rhel 6.6 host w/ windows 2008R2 & virtio-win-prewhql-89,during 1000 times balloon device hotplug/unplug,the guest work well,and after the 1000 times hotplug/unplug,the guest can reboot and shutdown without any error.

qemu-kvm-rhev-0.12.1.2-2.434.el6.x86_64
kernel-2.6.32-495.el6.x86_64
seabios-0.6.1.2-28.el6.x86_64
virtio-win-prewhql-89

the steps same as comment17
Comment 19 Mike Cao 2014-08-14 03:58:46 EDT
Retest this issue on virtio-win-prewhql-89 on RHEL7.0 guest ,guest BSOD at last 

Packages:
3.10.0-121.el7.x86_64
qemu-kvm-1.5.3-62.el7.x86_64

Steps:
1.Start VM
 /usr/libexec/qemu-kvm -drive file=089BLNWIN732EBK,if=none,id=drive-ide0-0-0,format=raw,serial=mike_cao,cache=writethrough,media=disk -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -monitor unix:/tmp/tt,server,nowait -boot menu=on -spice port=5900,disable-ticketing -vga qxl -chardev file,path=/root/console.log,id=serial1 -device isa-serial,chardev=serial1,id=s1 -usb -device usb-tablet,id=tablet1 -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 -m 2G -enable-kvm
2.hotplug/unplug in a loop
for ((i=1;i<=10000;i++)); do   echo device_del balloon0 |nc -U /tmp/tt; sleep 5;  echo device_add virtio-balloon-pci,id=balloon0,addr=0x5 |nc -U /tmp/tt; sleep 5; done

Actual Results:
Guest BSOD occurs 
CRITICAL_OBJECT_TERMINATION (f4)
A process or thread crucial to system operation has unexpectedly exited or been
terminated.
Several processes and threads are necessary for the operation of the
system; when they are terminated (for any reason), the system can no
longer function.
Arguments:
Arg1: 00000003, Process
Arg2: 8545bb08, Terminating object
Arg3: 8545bc74, Process image file name
Arg4: 82866cf0, Explanatory message (ascii)

Debugging Details:
------------------

Page 13102 not present in the dump file. Type ".hh dbgerr004" for details

KERNEL_LOG_FAILING_PROCESS:  

PROCESS_OBJECT: 8545bb08

IMAGE_NAME:  csrss.exe

DEBUG_FLR_IMAGE_TIMESTAMP:  0

MODULE_NAME: csrss

FAULTING_MODULE: 00000000 

PROCESS_NAME:  csrss.exe

EXCEPTION_CODE: (NTSTATUS) 0xc0000006 - The instruction at 0x%p referenced memory at 0x%p. The required data was not placed into memory because of an I/O error status of 0x%x.

BUGCHECK_STR:  0xF4_IOERR

DEFAULT_BUCKET_ID:  WIN7_DRIVER_FAULT

CURRENT_IRQL:  0

ANALYSIS_VERSION: 6.3.9600.16384 (debuggers(dbg).130821-1623) amd64fre

STACK_TEXT:  
8e9d5c9c 8292c067 000000f4 00000003 8545bb08 nt!KeBugCheckEx+0x1e
8e9d5cc0 828a9c1e 82866cf0 8545bc74 8545bd78 nt!PspCatchCriticalBreak+0x71
8e9d5cf0 828a9b61 8545bb08 85efe5f8 c0000006 nt!PspTerminateAllThreads+0x2d
8e9d5d24 8268b1ea ffffffff c0000006 0170f5c4 nt!NtTerminateProcess+0x1a2
8e9d5d24 779470b4 ffffffff c0000006 0170f5c4 nt!KiFastCallEntry+0x12a
WARNING: Frame IP not in any known module. Following frames may be wrong.
0170f5c4 00000000 00000000 00000000 00000000 0x779470b4


STACK_COMMAND:  kb

FOLLOWUP_NAME:  MachineOwner

IMAGE_VERSION:  

FAILURE_BUCKET_ID:  0xF4_IOERR_IMAGE_csrss.exe

BUCKET_ID:  0xF4_IOERR_IMAGE_csrss.exe

ANALYSIS_SOURCE:  KM

FAILURE_ID_HASH_STRING:  km:0xf4_ioerr_image_csrss.exe

FAILURE_ID_HASH:  {2b68738d-6c37-fd75-d711-1229511b3eea}

Followup: MachineOwner
---------
Comment 21 Shuang Yu 2014-08-17 21:40:35 EDT
retest this issue on rhel6.6 host w/ win7-32 & virtio-win-prewhql86,during 10000 times balloon device hotplug/unplug,the guest work well,and after the 10000 times hotplug/unplug,the guest can reboot and shutdown without any error.


kernel-2.6.32-495.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.434.el6.x86_64
seabios-0.6.1.2-28.el6.x86_64
virtio-win-prewhql-86

Steps:

1.boot guest:
/usr/libexec/qemu-kvm -m 2G -smp 2,maxcpus=2,cores=2,threads=1,scokets=1 -netdev tap,id=hostnet1,script=/etc/qemu-ifup -device e1000,netdev=hostnet1,id=net1,mac=00:52:00:00:11:22 -usb -device usb-tablet,id=tablet1 -drive file=win7-32-balloon.raw,format=raw,if=none,id=drive1 -device ide-drive,drive=drive1,id=disk1 -cdrom en_windows_7_ultimate_x86_dvd_x15-65921.iso -uuid 56200569-1761-4a09-94ff-383cfd9e2e01 -name win7-32-balloon -spice port=5900,disable-ticketing -vga qxl -monitor unix:/tmp/tt,server,nowait -device virtio-balloon-pci,id=balloon,addr=0x9

2.hotplug/unplug in a loop
for((i=1;i<=10000;i++)); do echo device_del balloon | nc -U /tmp/tt; sleep 7; echo device_add virtio-balloon-pci,id=balloon,addr=0x9 | nc -U /tmp/tt; sleep 7; done

3.reboot guest successfully

4.shutdown guest successfully
Comment 23 Ronen Hod 2014-11-03 09:14:44 EST
Based on comment 19 and comment 10, it looks like RHEL7 might still have this issue.
QE, since it seems as if it does not reproduce on RHEL6.6, can you please also verify it on RHEL7
Thanks.
Comment 24 Mike Cao 2014-11-05 20:11:26 EST
Retest this issue on RHEL7.1 

Packages 
3.10.0-186.el7.x86_64
qemu-kvm-rhev-2.1.2-1.el7.x86_64
seabios-1.7.5-4.el7.x86_64

Steps :
if sleep 2 sec between every cycle of hotunlug/hot-plug 

Actual Results: guest will response slowly and failed to shutdown (shutdown -t 0 -s -f does not work)

if sleep 7 sec between each hotplug/unplug 
Actual Results: Guest works fine after 18 hours

Based on above ,Vadim Can you  provide QE a standard time langency for each round hot-unplug/plug operation ?

Thanks,
Mike
Comment 25 Vadim Rozenfeld 2014-11-06 03:27:30 EST
(In reply to Mike Cao from comment #24)
> Retest this issue on RHEL7.1 
> 
> Packages 
> 3.10.0-186.el7.x86_64
> qemu-kvm-rhev-2.1.2-1.el7.x86_64
> seabios-1.7.5-4.el7.x86_64
> 
> Steps :
> if sleep 2 sec between every cycle of hotunlug/hot-plug 
> 
> Actual Results: guest will response slowly and failed to shutdown (shutdown
> -t 0 -s -f does not work)
> 
> if sleep 7 sec between each hotplug/unplug 
> Actual Results: Guest works fine after 18 hours
> 
> Based on above ,Vadim Can you  provide QE a standard time langency for each
> round hot-unplug/plug operation ?
> 

Hi Mike,
I don't think I can give any exact numbers. PCI device plug/unplug is a very complicated process from sides - HW (emulated by host), OS, and device driver itself. Add more load to host and latency will be changed. I think we can close 
this bug, but lets run this test from time to time as addition to HCK PnP tests.

Best regards,
Vadim.
> Thanks,
> Mike
Comment 26 Mike Cao 2014-11-06 03:42:28 EST
(In reply to Vadim Rozenfeld from comment #25)
> (In reply to Mike Cao from comment #24)
> > Retest this issue on RHEL7.1 
> > 
> > Packages 
> > 3.10.0-186.el7.x86_64
> > qemu-kvm-rhev-2.1.2-1.el7.x86_64
> > seabios-1.7.5-4.el7.x86_64
> > 
> > Steps :
> > if sleep 2 sec between every cycle of hotunlug/hot-plug 
> > 
> > Actual Results: guest will response slowly and failed to shutdown (shutdown
> > -t 0 -s -f does not work)
> > 
> > if sleep 7 sec between each hotplug/unplug 
> > Actual Results: Guest works fine after 18 hours
> > 
> > Based on above ,Vadim Can you  provide QE a standard time langency for each
> > round hot-unplug/plug operation ?
> > 
> 
> Hi Mike,
> I don't think I can give any exact numbers. PCI device plug/unplug is a very
> complicated process from sides - HW (emulated by host), OS, and device
> driver itself. Add more load to host and latency will be changed. I think we
> can close 
> this bug, but lets run this test from time to time as addition to HCK PnP
> tests.
I agree to closing the bug.
Regarding to the HCK pnp job ,I think it is similar as the operation click "eject" in the task bar .Is it same as we run device_del in qemu monitor ?

Thanks,
Mike
Comment 27 Vadim Rozenfeld 2014-11-06 04:22:37 EST
(In reply to Mike Cao from comment #26)
> (In reply to Vadim Rozenfeld from comment #25)
> > (In reply to Mike Cao from comment #24)
> > > Retest this issue on RHEL7.1 
> > > 
> > > Packages 
> > > 3.10.0-186.el7.x86_64
> > > qemu-kvm-rhev-2.1.2-1.el7.x86_64
> > > seabios-1.7.5-4.el7.x86_64
> > > 
> > > Steps :
> > > if sleep 2 sec between every cycle of hotunlug/hot-plug 
> > > 
> > > Actual Results: guest will response slowly and failed to shutdown (shutdown
> > > -t 0 -s -f does not work)
> > > 
> > > if sleep 7 sec between each hotplug/unplug 
> > > Actual Results: Guest works fine after 18 hours
> > > 
> > > Based on above ,Vadim Can you  provide QE a standard time langency for each
> > > round hot-unplug/plug operation ?
> > > 
> > 
> > Hi Mike,
> > I don't think I can give any exact numbers. PCI device plug/unplug is a very
> > complicated process from sides - HW (emulated by host), OS, and device
> > driver itself. Add more load to host and latency will be changed. I think we
> > can close 
> > this bug, but lets run this test from time to time as addition to HCK PnP
> > tests.
> I agree to closing the bug.
> Regarding to the HCK pnp job ,I think it is similar as the operation click
> "eject" in the task bar .Is it same as we run device_del in qemu monitor ?

No, they are not the same. Eject is a gentle way to ask the system to tear  
the device stack down and remove device. While device_del is some sort of brute
force action similar to pulling device out of PCI slot, which will activate surprise removal path.

Cheers,
Vadim.
> 
> Thanks,
> Mike

Note You need to log in before you can comment on or make changes to this bug.