Bug 1352517

Summary: [virtio-win][balloon][whql]windows guest BSOD when run several WHQL jobs
Product: Red Hat Enterprise Linux 7 Reporter: damchen <damchen>
Component: virtio-winAssignee: Gal Hammer <ghammer>
virtio-win sub component: virtio-win-prewhql QA Contact: Virtualization Bugs <virt-bugs>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: juzhang, lmiksik, vrozenfe, wyu
Version: 7.3Keywords: Regression, TestBlocker
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
NO_DOCS
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-04 08:55:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
virtio(not virtio1.0)-balloon-(win864 and win732) none

Description damchen 2016-07-04 09:41:36 UTC
Description of problem:
windows guest BSOD when run several WHQL jobs
*3202 DF - Concurrent Hardware And Operating System (CHAOS) Test (Certification)
*3070 DF - PNP Stop (Rebalance) Device Test (Certification)	
*3071 DF - PNP Rebalance Request New Resources Device Test (Certification)	
*3072 DF - PNP Rebalance Fail Restart Device Test (Certification)
*3073 DF - PNP Surprise Remove Device Test (Certification)	

Version-Release number of selected component (if applicable):
kernel-3.10.0-450.el7.x86_64
qemu-kvm-rhev-2.6.0-8.el7.x86_64
seabios-1.9.1-3.el7.x86_64
virtio-win-prewhql-121

How reproducible:
100%

Steps to Reproduce:
1.boot command line with virtio1.0 device
2.boot guest with:
/usr/libexec/qemu-kvm -name 121BLNWIN832TGS -enable-kvm -m 3G -smp 4 -uuid 7360d5ed-57c9-48ec-bba1-db4bfe5f85ae -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/tmp/121BLNWIN832TGS,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime,driftfix=slew -boot order=cd,menu=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=121BLNWIN832TGS,if=none,id=drive-ide0-0-0,format=raw,serial=mike_cao,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -drive file=en_windows_8_enterprise_x86_dvd_917587.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=121BLNWIN832TGS.vfd,if=none,id=drive-fdc0-0-0,format=raw,cache=none -global isa-fdc.driveA=drive-fdc0-0-0 -netdev tap,script=/etc/qemu-ifup,downscript=no,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=00:52:34:70:75:04,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=isa_serial0 -device usb-tablet,id=input0 -vnc 0.0.0.0:5 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7,disable-legacy=on,disable-modern=off
3.run whql test  DF - PNP Stop (Rebalance) Device Test (Certification)(3070) job

Actual results:
bsod(d5)

Expected results:
pass

Additional info:

1 win7-32/64,win2008R2,win8-32/64 hit the same issue
2 kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

DRIVER_PAGE_FAULT_IN_FREED_SPECIAL_POOL (d5)
Memory was referenced after it was freed.
This cannot be protected by try-except.
When possible, the guilty driver's name (Unicode string) is printed on
the bugcheck screen and saved in KiBugCheckDriver.
Arguments:
Arg1: 9b072dc8, memory referenced
Arg2: 00000000, value 0 = read operation, 1 = write operation
Arg3: 9c5af4d7, if non-zero, the address which referenced memory.
Arg4: 00000000, (reserved)

Debugging Details:
------------------


READ_ADDRESS:  9b072dc8 Special pool

FAULTING_IP: 
balloon+24d7
9c5af4d7 8b4e08          mov     ecx,dword ptr [esi+8]

MM_INTERNAL_CODE:  0

IMAGE_NAME:  balloon.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  57725163

MODULE_NAME: balloon

FAULTING_MODULE: 9c5ad000 balloon

DEFAULT_BUCKET_ID:  WIN8_DRIVER_FAULT

BUGCHECK_STR:  0xD5

PROCESS_NAME:  System

CURRENT_IRQL:  0

ANALYSIS_VERSION: 6.3.9600.16384 (debuggers(dbg).130821-1623) amd64fre

TRAP_FRAME:  871a89a0 -- (.trap 0xffffffff871a89a0)
ErrCode = 00000000
eax=00000000 ebx=00000001 ecx=00000000 edx=00000000 esi=9b072dc0 edi=871a8a54
eip=9c5af4d7 esp=871a8a14 ebp=871a8a20 iopl=0         nv up ei pl zr na pe nc
cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00010246
balloon+0x24d7:
9c5af4d7 8b4e08          mov     ecx,dword ptr [esi+8] ds:0023:9b072dc8=????????
Resetting default scope

LAST_CONTROL_TRANSFER:  from 8160057b to 81554cb0

STACK_TEXT:  
871a88ac 8160057b 00000050 9b072dc8 00000000 nt!KeBugCheckEx
871a88fc 81497585 00000000 9b072dc8 871a89a0 nt! ?? ::FNODOBFM::`string'+0x31116
871a8988 815cb654 00000000 9b072dc8 00000000 nt!MmAccessFault+0x408
871a8988 9c5af4d7 00000000 9b072dc8 00000000 nt!KiTrap0E+0xdc
WARNING: Stack unwind information not available. Following frames may be wrong.
871a8a20 9c5ae669 9b072dc0 871a8a54 00000001 balloon+0x24d7
871a8a68 9c5b57d0 5d445318 871a8a9c 81a4b075 balloon+0x1669
871a8a74 81a4b075 7405f060 9972ef98 910c6ee8 balloon+0x87d0
871a8a9c 81a4ae8d a3d4ce90 a2bbada4 a3d4ce90 Wdf01000!FxPkgGeneral::OnClose+0xc8
871a8abc 81a43bc2 a3d4ce90 8ab01020 a3d4ce90 Wdf01000!FxPkgGeneral::Dispatch+0xc0
871a8ae4 81a43a33 8ab01020 a3d4ce90 8ab01020 Wdf01000!FxDevice::Dispatch+0x155
871a8b00 818eff4b 8ab01020 a3d4ce90 a3d4ce90 Wdf01000!FxDevice::DispatchWithLock+0x77
871a8b20 81495a9f 81907565 a3d4cf88 a3d4cfac nt!IovCallDriver+0x2e3
871a8b34 81907565 871a8b5c 8190765c 8ab01020 nt!IofCallDriver+0x62
871a8b3c 8190765c 8ab01020 a3d4ce90 8aad2680 nt!ViFilterIoCallDriver+0x10
871a8b5c 818eff4b 8aad2738 a3d4ce90 8ae05320 nt!ViFilterDispatchGeneric+0x5e
871a8b7c 81495a9f 9c5bb353 a3d4cfac a3d4cfd0 nt!IovCallDriver+0x2e3
871a8b90 9c5bb353 a3d4ce90 8ae05268 00000000 nt!IofCallDriver+0x62
871a8ba8 9c5ba074 8ae05268 a3d4ce90 8ae05268 MSDMFilt+0x2353
871a8bc8 818eff4b 8ae05268 a3d4ce90 a3d4ce90 MSDMFilt+0x1074
871a8be8 81495a9f 81907565 a3d4cfd0 a3d4cff4 nt!IovCallDriver+0x2e3
871a8bfc 81907565 871a8c24 8190765c 8ae05268 nt!IofCallDriver+0x62
871a8c04 8190765c 8ae05268 a3d4ce90 8aa7ec58 nt!ViFilterIoCallDriver+0x10
871a8c24 818eff4b 8aa7ed10 a3d4ce90 a3d4ce90 nt!ViFilterDispatchGeneric+0x5e
871a8c44 81495a9f 8169fdd3 00000000 8ae403e8 nt!IovCallDriver+0x2e3
871a8c58 8169fdd3 84b75e10 8ae403d0 8ae40300 nt!IofCallDriver+0x62
871a8c8c 81417796 871a8cac 8169fa2d 8ae403e8 nt!IopDeleteFile+0xef
871a8cac 814918f6 00000000 9c5bc502 a2b78ff0 hal!KfLowerIrql+0x2c
871a8cc0 81491882 8ae403e8 9c5bc515 8aa7ec58 nt!ObfDereferenceObjectWithTag+0x5c
871a8cc8 9c5bc515 8aa7ec58 00000000 871a8d1c nt!ObfDereferenceObject+0xd
871a8cd8 814de737 8aa7ec58 a2b78ff0 816344b8 MSDMFilt+0x3515
871a8d1c 814de854 996aafd0 84ba4300 00000000 nt!IopProcessWorkItem+0xa1
871a8d74 81521415 00010000 0f0c34e2 00000000 nt!ExpWorkerThread+0x111
871a8db0 815cd039 814de747 00010000 00000000 nt!PspSystemThreadStartup+0x4a
00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x19


STACK_COMMAND:  kb

FOLLOWUP_IP: 
balloon+24d7
9c5af4d7 8b4e08          mov     ecx,dword ptr [esi+8]

SYMBOL_STACK_INDEX:  4

SYMBOL_NAME:  balloon+24d7

FOLLOWUP_NAME:  MachineOwner

FAILURE_BUCKET_ID:  0xD5_VRF_balloon+24d7

BUCKET_ID:  0xD5_VRF_balloon+24d7

ANALYSIS_SOURCE:  KM

FAILURE_ID_HASH_STRING:  km:0xd5_vrf_balloon+24d7

FAILURE_ID_HASH:  {fa7a74dd-8b71-e1b3-1f89-ae1a9c88a064}

Followup: MachineOwner
---------


************* Symbol Path validation summary **************
Response                         Time (ms)     Location
Deferred                                       SRV*c:\symbols\*http://msdl.microsoft.com/download/symbols
0: kd> .reload
Loading Kernel Symbols
...............................................................
................................................................
.............
Loading User Symbols

Loading unloaded module list
..............

Comment 3 damchen 2016-07-05 02:04:07 UTC
Hi,

For *3072 DF - PNP Rebalance Fail Restart Device Test (Certification), win8-32 and win2008R2 can pass, win7-32/64 and win8-64 BSOD.

Thanks
Damei Chen

Comment 4 damchen 2016-07-05 06:08:35 UTC
Created attachment 1176257 [details]
virtio(not virtio1.0)-balloon-(win864 and win732)

Error:
WDTF_TEST : No devices were found for testing using the provided SDEL device query: IsPhantom=False AND ((HardwareIds='PCI\VEN_1AF4&DEV_1045&SUBSYS_11001AF4&REV_01' OR DeviceId='PCI\VEN_1AF4&DEV_1045&SUBSYS_11001AF4&REV_01\3&13C0B0C5&0&38')) 
File:

Comment 5 Yu Wang 2016-07-05 07:29:21 UTC
(In reply to damchen from comment #4)
> Created attachment 1176257 [details]
> virtio(not virtio1.0)-balloon-(win864 and win732)
> 
> Error:
> WDTF_TEST : No devices were found for testing using the provided SDEL device
> query: IsPhantom=False AND
> ((HardwareIds='PCI\VEN_1AF4&DEV_1045&SUBSYS_11001AF4&REV_01' OR
> DeviceId='PCI\VEN_1AF4&DEV_1045&SUBSYS_11001AF4&REV_01\3&13C0B0C5&0&38')) 
> File:

test w/o virtio1.0, it will report error for "no devces found" and do not run the test. Job logs refer to the attachment

Comment 7 Gal Hammer 2016-07-05 13:57:34 UTC
(In reply to damchen from comment #4)
> Created attachment 1176257 [details]
> virtio(not virtio1.0)-balloon-(win864 and win732)
> 
> Error:
> WDTF_TEST : No devices were found for testing using the provided SDEL device
> query: IsPhantom=False AND
> ((HardwareIds='PCI\VEN_1AF4&DEV_1045&SUBSYS_11001AF4&REV_01' OR
> DeviceId='PCI\VEN_1AF4&DEV_1045&SUBSYS_11001AF4&REV_01\3&13C0B0C5&0&38')) 
> File:

Is this report relevant to the original WHQL BSOD bug?

Comment 8 Gal Hammer 2016-07-05 13:59:00 UTC
A patch that fixed the pnp test failures was posted.

Comment 9 damchen 2016-07-06 02:19:07 UTC
(In reply to Gal Hammer from comment #7)
> (In reply to damchen from comment #4)
> > Created attachment 1176257 [details]
> > virtio(not virtio1.0)-balloon-(win864 and win732)
> > 
> > Error:
> > WDTF_TEST : No devices were found for testing using the provided SDEL device
> > query: IsPhantom=False AND
> > ((HardwareIds='PCI\VEN_1AF4&DEV_1045&SUBSYS_11001AF4&REV_01' OR
> > DeviceId='PCI\VEN_1AF4&DEV_1045&SUBSYS_11001AF4&REV_01\3&13C0B0C5&0&38')) 
> > File:
> 
> Is this report relevant to the original WHQL BSOD bug?

Yes. Firstly,I run the cases in virtio1.0 the BSOD bugs occur,then I run the cases in virtio(not virtio1.0) to make sure whether the version is a problem or not? But the results of virtio also go wrong as shown.

Comment 10 Gal Hammer 2016-07-06 12:24:57 UTC
(In reply to damchen from comment #9)
> (In reply to Gal Hammer from comment #7)
> > (In reply to damchen from comment #4)
> > > Created attachment 1176257 [details]
> > > virtio(not virtio1.0)-balloon-(win864 and win732)
> > > 
> > > Error:
> > > WDTF_TEST : No devices were found for testing using the provided SDEL device
> > > query: IsPhantom=False AND
> > > ((HardwareIds='PCI\VEN_1AF4&DEV_1045&SUBSYS_11001AF4&REV_01' OR
> > > DeviceId='PCI\VEN_1AF4&DEV_1045&SUBSYS_11001AF4&REV_01\3&13C0B0C5&0&38')) 
> > > File:
> > 
> > Is this report relevant to the original WHQL BSOD bug?
> 
> Yes. Firstly,I run the cases in virtio1.0 the BSOD bugs occur,then I run the
> cases in virtio(not virtio1.0) to make sure whether the version is a problem
> or not? But the results of virtio also go wrong as shown.

What command line option for the balloon device did you use when you saw this error message?

Which test failed?

Comment 11 damchen 2016-07-07 05:24:18 UTC
(In reply to Gal Hammer from comment #10)
> (In reply to damchen from comment #9)
> > (In reply to Gal Hammer from comment #7)
> > > (In reply to damchen from comment #4)
> > > > Created attachment 1176257 [details]
> > > > virtio(not virtio1.0)-balloon-(win864 and win732)
> > > > 
> > > > Error:
> > > > WDTF_TEST : No devices were found for testing using the provided SDEL device
> > > > query: IsPhantom=False AND
> > > > ((HardwareIds='PCI\VEN_1AF4&DEV_1045&SUBSYS_11001AF4&REV_01' OR
> > > > DeviceId='PCI\VEN_1AF4&DEV_1045&SUBSYS_11001AF4&REV_01\3&13C0B0C5&0&38')) 
> > > > File:
> > > 
> > > Is this report relevant to the original WHQL BSOD bug?
> > 
> > Yes. Firstly,I run the cases in virtio1.0 the BSOD bugs occur,then I run the
> > cases in virtio(not virtio1.0) to make sure whether the version is a problem
> > or not? But the results of virtio also go wrong as shown.
> 
> What command line option for the balloon device did you use when you saw
> this error message?
> 
> Which test failed?


/usr/libexec/qemu-kvm -name 121BLNWIN864MFA -enable-kvm -m 3G -smp 4 -uuid 546efd0d-7724-4890-a8ae-0430ff82c5b5 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/tmp/121BLNWIN864MFA,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime,driftfix=slew -boot order=cd,menu=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=121BLNWIN864MFA,if=none,id=drive-ide0-0-0,format=raw,serial=mike_cao,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -drive file=en_windows_8_enterprise_x64_dvd_917522.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=121BLNWIN864MFA.vfd,if=none,id=drive-fdc0-0-0,format=raw,cache=none -global isa-fdc.driveA=drive-fdc0-0-0 -netdev tap,script=/etc/qemu-ifup,downscript=no,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=00:52:32:6a:44:47,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=isa_serial0 -device usb-tablet,id=input0 -vnc 0.0.0.0:2 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7

These tests all go wrong! The DeiceStatusCheck case shows error.
*3202 DF - Concurrent Hardware And Operating System (CHAOS) Test (Certification)
*3070 DF - PNP Stop (Rebalance) Device Test (Certification)	
*3071 DF - PNP Rebalance Request New Resources Device Test (Certification)	
*3072 DF - PNP Rebalance Fail Restart Device Test (Certification)
*3073 DF - PNP Surprise Remove Device Test (Certification)

Comment 12 Vadim Rozenfeld 2016-07-12 08:22:11 UTC
Should be fixed in build 122
https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=503003

Comment 14 Yu Wang 2016-07-14 05:27:36 UTC
Re-tested with build 122, all jobs above passed. So this bug has been fixed, change status to verified.

Thanks
Yu Wang

Comment 17 errata-xmlrpc 2016-11-04 08:55:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2609.html