Bug 836467 - [whql][Netkvm] 0x9F BSOD happened on Job of "Sleep and PNP (disable and enable) with IO Before and After(Certification)" on HCK for win2k8-32
[whql][Netkvm] 0x9F BSOD happened on Job of "Sleep and PNP (disable and enabl...
Status: CLOSED CANTFIX
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: virtio-win (Show other bugs)
6.4
Unspecified Unspecified
unspecified Severity high
: rc
: ---
Assigned To: Dmitry Fleytman
Virtualization Bugs
:
Depends On:
Blocks: 896495
  Show dependency treegraph
 
Reported: 2012-06-29 03:31 EDT by dawu
Modified: 2013-09-15 07:26 EDT (History)
21 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-09-15 07:26:31 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
the latest one (32.60 MB, application/zip)
2013-01-28 05:01 EST, Min Deng
no flags Details
dump file-win2k8-32-61 (30.90 MB, application/x-zip-compressed)
2013-05-27 23:23 EDT, guo jiang
no flags Details

  None (edit)
Description dawu 2012-06-29 03:31:29 EDT
Description of problem:
BSOD with error 9F happened when running task "Run Test" for job 2003 of "Sleep and PNP (disable and enable) with IO Before and After(Certification)" on HCK for virtual NIC

Version-Release number of selected component (if applicable):
kernel-2.6.32-278.el6.x86_64
qemu-kvm-0.12.1.2-2.295.el6.x86_64
virtio-win-prewhql-0.1-29

How reproducible:
always

Steps to Reproduce:
1. Boot guest with CLI:

/usr/libexec/qemu-kvm -m 6G -smp 4 -cpu cpu64-rhel6,+x2apic -usbdevice tablet -drive file=win2k8-32-nic1.raw,if=none,format=raw,id=drive-ide0-0-0,werror=stop,rerror=stop,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -netdev tap,sndbuf=0,id=hostnet0,vhost=on,script=/etc/qemu-ifup-private,downscript=no -device virtio-net-pci,netdev=hostnet0,mac=00:10:06:06:49:10,bus=pci.0,addr=0x4,id=virtio-net-pci0 -netdev tap,sndbuf=0,id=hostnet1,vhost=on,script=/etc/qemu-ifup-private,downscript=no -device virtio-net-pci,netdev=hostnet1,mac=00:10:06:06:49:11,bus=pci.0,addr=0x5,id=virtio-net-pci1 -netdev tap,sndbuf=0,id=hostnet2,script=/etc/qemu-ifup,downscript=no -device e1000,netdev=hostnet2,mac=00:10:06:16:49:12,bus=pci.0,addr=0x6 -uuid f661f828-9f97-4a87-a38b-161ae882a8c5 -rtc base=localtime,clock=host,driftfix=slew -no-kvm-pit-reinjection -chardev socket,id=111a,path=/tmp/monitor-win2k8-32-nic1,server,nowait -mon chardev=111a,mode=readline -spice disable-ticketing,port=5931 -vga qxl -bios /usr/share/seabios/bios-pm.bin -monitor stdio

2. Run job 2003 of "Sleep and PNP (disable and enable) with IO Before and After(Certification)"

3. Run task "Run Test"

Actual results:
BSOD with error 9F happened.

Expected results:
Job should passed without any error.

Additional info:
Comment 2 dawu 2012-06-29 03:39:45 EDT
---------

0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

DRIVER_POWER_STATE_FAILURE (9f)
A driver has failed to complete a power IRP within a specific time (usually 10 minutes).
Arguments:
Arg1: 00000003, A device object has been blocking an Irp for too long a time
Arg2: 896cb6b0, Physical Device Object of the stack
Arg3: 89bb2030, nt!TRIAGE_9F_POWER on Win7, otherwise the Functional Device Object of the stack
Arg4: 9aaeeed8, The blocked IRP

Debugging Details:
------------------


DRVPOWERSTATE_SUBCODE:  3

IMAGE_NAME:  pci.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  49e01a44

MODULE_NAME: pci

FAULTING_MODULE: 81e5c000 pci

DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT

BUGCHECK_STR:  0x9F

PROCESS_NAME:  System

CURRENT_IRQL:  2

STACK_TEXT:  
8192aacc 818663bb 0000009f 00000003 896cb6b0 nt!KeBugCheckEx+0x1e
8192ab28 81865fd8 8192ab94 86950078 86950000 nt!PopCheckIrpWatchdog+0x1ad
8192ab68 818df26b 819434e0 00000000 504ec640 nt!PopCheckForIdleness+0x343
8192ac88 818dee2b 8192acd0 8acd2402 8192acd8 nt!KiTimerListExpire+0x367
8192ace8 818df595 00000000 00000000 0004d67e nt!KiTimerExpiration+0x22a
8192ad50 818dd7dd 00000000 0000000e 00000000 nt!KiRetireDpcList+0xba
8192ad54 00000000 0000000e 00000000 00000000 nt!KiIdleLoop+0x49


STACK_COMMAND:  kb

FOLLOWUP_NAME:  MachineOwner

FAILURE_BUCKET_ID:  0x9F_VRF_3_netkvm_IMAGE_pci.sys

BUCKET_ID:  0x9F_VRF_3_netkvm_IMAGE_pci.sys

Followup: MachineOwner
---------
Comment 4 dawu 2012-06-29 03:42:32 EDT
This issue only found on win2k8-32, job on other os passed without error.
Comment 5 Mike Cao 2012-10-19 02:42:49 EDT
Yan ,

Could you help to check the dump ? If it is related to our driver ,I don't think we can pass the tests even with WLK
Comment 6 Yan Vugenfirer 2012-12-20 11:07:31 EST
(In reply to comment #5)
> Yan ,
> 
> Could you help to check the dump ? If it is related to our driver ,I don't
> think we can pass the tests even with WLK

Might be our problem. Can you please run the test while DebugView is running in the background and provide the dump and the log?

Thanks.
Comment 8 Yan Vugenfirer 2012-12-23 07:24:22 EST
Probably related bug:
https://bugzilla.redhat.com/show_bug.cgi?id=832395
Comment 9 Yan Vugenfirer 2012-12-23 07:24:38 EST
Probably related bug:
https://bugzilla.redhat.com/show_bug.cgi?id=832395
Comment 10 Dmitry Fleytman 2013-01-02 12:23:14 EST
Hi Guys

We are trying to reproduce this issue at our side and see different sleep-related problems, we believe this could be connected to the bios version on our setup.
Could you please attach you bios binary to the issue (/usr/share/seabios/bios-pm.bin)?

Thanks in advance,
Dmitry.
Comment 12 Mike Cao 2013-01-04 00:54:58 EST
dengmin, pls re-test this issue on the latest seabios.
Comment 13 Min Deng 2013-01-04 01:39:10 EST
(In reply to comment #12)
> dengmin, pls re-test this issue on the latest seabios.
The testing is on going.
Comment 16 Dmitry Fleytman 2013-01-15 02:46:39 EST
Problem reproduces without WHQL as well. Just start the Guest with 4G of RAM and do hibernate.

It looks like the root cause of the problem is extremely slow hibernation on Windows 2008 32 bit.

For guest with 1G of memory it takes system around 10 minutes to hibernate. When it takes less then 10 minutes system hibernates successfully, when it takes more - system crashes.

With 4G of memory crash happens every time, with 512M of memory the crash has never been observed.

Another phenomena observed is that the crash happens on different hosts, also IO performance tests don't show slow storage access rates, so it doesn't look like the storage issue.
Also when guest is going into hibernate state all it's VCPUs are 100% busy so it doesn't look like a storage latency problem.

The crash also happens without RedHat network/storage drivers so it looks like a host issue.

Unusual system activity during hibernation needs to be investigated.
Comment 17 Dmitry Fleytman 2013-01-15 05:02:15 EST
Prime suspect is Windows making too much cache flushes during hibernate.
Indeed with disk option "cache=unsafe" in QEMU command line hibernate takes a few seconds.

Please, try to repeat tests with "cache=unsafe" disk option added.
Comment 18 Michael S. Tsirkin 2013-01-21 05:49:31 EST
It says an Irp is blocked for too long.
Do we know which Irp is it?
Comment 19 Yan Vugenfirer 2013-01-21 06:06:05 EST
(In reply to comment #18)
> It says an Irp is blocked for too long.
> Do we know which Irp is it?

MJ_POWER with S4.
Comment 20 Ronen Hod 2013-01-21 13:10:23 EST
needinfo from Asias and Stefan.

A summary of my understanding.
- Windows issues an Hibernation IRP (MJ_POWER(S4)).
- If this IRP finishes in less than 10 minutes then everything is fine.
- Inside this IRP Windows is writing the S4 data to its disk (emulated IDE).
- In comment 17 Dima wrote "Prime suspect is Windows making too much cache flushes"

So according to this, the issue is that IDE is extremely slow in this scenario, and probably the IDE emulation needs to be optimized.

Does anybody see it differently?
For related/duplicates of this bug see the tracker bug 896495
Comment 21 Mike Cao 2013-01-22 05:18:51 EST
(In reply to comment #17)
> Prime suspect is Windows making too much cache flushes during hibernate.
> Indeed with disk option "cache=unsafe" in QEMU command line hibernate takes
> a few seconds.
> 
> Please, try to repeat tests with "cache=unsafe" disk option added.

Tried IDE disk w/ "cache=unsafe" ,still hit the same issue
Comment 22 Stefan Hajnoczi 2013-01-22 10:27:44 EST
A kvm_stat or full "perf record -a -e kvm:*" trace would be useful during the vcpu 100% busy period.  That way we can spot unusual operations coming from the guest as well as figure out IDE latency (it can be measured from the IDE controller register access pio and interrupt events in the perf trace).

Also, has anyone tried -smp 1 to simplify the guest configuration?
Comment 23 Asias He 2013-01-24 01:06:38 EST
I am not seeing the BSOD (probably it less than 10 minutes) but seeing huge differences between RHEL and upstream qemu.

RHEL6.4   4 vcpu 4GB ram ide disk, win2008-32bit, hibernate time = 5 minutes
Upstream  4 vcpu 4GB ram ide disk, win2008-32bit, hibernate time = 5 seconds
Comment 24 Asias He 2013-01-24 01:10:51 EST
(In reply to comment #22)
> A kvm_stat or full "perf record -a -e kvm:*" trace would be useful during
> the vcpu 100% busy period.  That way we can spot unusual operations coming
> from the guest as well as figure out IDE latency (it can be measured from
> the IDE controller register access pio and interrupt events in the perf
> trace).
> 
> Also, has anyone tried -smp 1 to simplify the guest configuration?

Changing from -smp 4 to -smp 1 does not help for me.
Comment 25 Asias He 2013-01-24 04:01:03 EST
Tried linux guest on RHEL6.4, it is very fast.

RHEL6.4   4 vcpu 4GB ram ide disk, rhel7-64, hibernate time = 10 seconds
Comment 26 Stefan Hajnoczi 2013-01-24 08:39:46 EST
(In reply to comment #23)
> I am not seeing the BSOD (probably it less than 10 minutes) but seeing huge
> differences between RHEL and upstream qemu.
> 
> RHEL6.4   4 vcpu 4GB ram ide disk, win2008-32bit, hibernate time = 5 minutes
> Upstream  4 vcpu 4GB ram ide disk, win2008-32bit, hibernate time = 5 seconds

This is interesting.  It should be possible to bisect this.

The RHEL6.4 qemu-kvm is based on upstream qemu-kvm-0.12.1.2.

If qemu-kvm-0.12.1.2 shows poor hibernation performance we can bisect qemu-kvm-0.12.1.2..master to find out which change fixed performance.

If qemu-kvm-0.12.1.2 shows good hibernation performance we can bisect qemu-kvm-0.12.1.2..qemu-kvm-0.12.1.2-2.352.el6 to find out which change broke performance.
Comment 27 Karen Noel 2013-01-24 21:00:32 EST
-drive file=win2k8-32-nic1.raw,if=none,format=raw,id=drive-ide0-0-0,werror=stop,rerror=stop,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1

Can this be avoided if the boot disk is on the PCI bus? If so, this would seem like a good workaround on 6.4.
Comment 29 Dmitry Fleytman 2013-01-26 14:42:44 EST
Commit that fixes long hibernation issue is:

    commit 7cdd481cdf15d610f83e38f15c7e7979420c6ac0
    Author: Paolo Bonzini <pbonzini@redhat.com>
    Date:   Wed Jun 6 00:04:54 2012 +0200

        ide: support enable/disable write cache

        Enabling or disabling the write cache is done with the SET FEATURES
        command.  The command can be issued with sg_sat_set_features from
        sg3-utils.

        Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
        Signed-off-by: Kevin Wolf <kwolf@redhat.com>

It is present in https://brewweb.devel.redhat.com/taskinfo?taskID=5321509 build,
so let's give it a try.

(In reply to comment #26)
> (In reply to comment #23)
> > I am not seeing the BSOD (probably it less than 10 minutes) but seeing huge
> > differences between RHEL and upstream qemu.
> > 
> > RHEL6.4   4 vcpu 4GB ram ide disk, win2008-32bit, hibernate time = 5 minutes
> > Upstream  4 vcpu 4GB ram ide disk, win2008-32bit, hibernate time = 5 seconds
> 
> This is interesting.  It should be possible to bisect this.
> 
> The RHEL6.4 qemu-kvm is based on upstream qemu-kvm-0.12.1.2.
> 
> If qemu-kvm-0.12.1.2 shows poor hibernation performance we can bisect
> qemu-kvm-0.12.1.2..master to find out which change fixed performance.
> 
> If qemu-kvm-0.12.1.2 shows good hibernation performance we can bisect
> qemu-kvm-0.12.1.2..qemu-kvm-0.12.1.2-2.352.el6 to find out which change
> broke performance.
Comment 30 Ronen Hod 2013-01-27 05:10:28 EST
Paolo,

Do you see a reason not to include this patch in snapshot-5?
Comment 31 Ronen Hod 2013-01-27 05:13:55 EST
QE,

Please also test Asias's build with all the bugs under the tracker bug 896495.
We would like to include this patch in RHEL6.4 snapshot 5 (the last snapshot).
Please also run a sanity check (IDE), since we do not have any spare time.
Comment 32 Ronen Hod 2013-01-27 13:46:35 EST
QE,

Yan reports that on his machine it indeed improves the situation, but not completely. Please provide dumps for bugs.

Thanks, Ronen.
Comment 33 Asias He 2013-01-28 01:01:03 EST
(In reply to comment #30)
> Paolo,
> 
> Do you see a reason not to include this patch in snapshot-5?

Paolo, I backported the following three patches  which makes the hibernation much faster in RHEL6. Do we need to backport more dependencies to make 'ide: support enable/disable write cache' really work?

1) 
    Enabling or disabling the write cache is done with the SET FEATURES
    command.  The command can be issued with sg_sat_set_features from
    sg3-utils.
    
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    Signed-off-by: Kevin Wolf <kwolf@redhat.com>
    (cherry picked from commit 7cdd481cdf15d610f83e38f15c7e7979420c6ac0)

2) 
    block: always open drivers in writeback mode
    
    Formats are entirely in charge of flushes for metadata writes.  For
    guest-initiated writes, a writethrough cache is faked in the block layer.
    So we can always open in writeback mode.
    
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    Signed-off-by: Kevin Wolf <kwolf@redhat.com>
    (cherry picked from commit e1e9b0aca05747be9e2174a53205bd904c10da49)
    
    Conflicts:
    
        block.c

3) 
    block: add bdrv_set_enable_write_cache
    
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    Signed-off-by: Kevin Wolf <kwolf@redhat.com>
    (cherry picked from commit 425b01487a8072c3b16fa4b3fca30d8ecd06e0ca)
Comment 34 Min Deng 2013-01-28 03:39:07 EST
Tried the bug with above scratch build-comments 29,virtio-win-prewhql-0.1-49 and got the following testing results so far,
seabios&kernel
seabios-0.6.1.2-26.el6.x86_64
kernel-2.6.32-356.el6.x86_64
Steps,
1.boot up guest with the CLI -
  /usr/libexec/qemu-kvm -M rhel6.4.0 -m 6G -smp 4 -cpu cpu64-rhel6,+x2apic -usbdevice tablet -drive file=win2k8-32-nic1.raw,if=none,format=raw,id=drive-ide0-0-0,werror=stop,rerror=stop,cache=*unsafe* -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -netdev tap,sndbuf=0,id=hostnet0,script=/etc/qemu-ifup,downscript=no -device e1000,netdev=hostnet0,mac=00:52:17:36:23:1c,bus=pci.0,addr=0x4 -uuid dad05b08-9f12-44e7-bdc7-222bb0cba101 -rtc base=localtime,clock=host,driftfix=slew -no-kvm-pit-reinjection -chardev socket,id=111a,path=/tmp/win2k8-32-nic-49,server,nowait -mon chardev=111a,mode=readline -name win8-32-nic-49-1 -netdev tap,sndbuf=0,id=hostnet1,script=/etc/qemu-ifup-private,downscript=no -device virtio-net-pci,netdev=hostnet1,id=net1,mac=00:12:42:36:53:28,bus=pci.0,addr=0x7 -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 -monitor stdio -spice disable-ticketing,port=5931 -vga qxl
2.Run the Sleep and PNP job (disable and enable)
Actual results,
The job passed without BSOD.
Comment 35 Min Deng 2013-01-28 04:56:41 EST
(In reply to comment #34)
> Tried the bug with above scratch build-comments 29,virtio-win-prewhql-0.1-49
> and got the following testing results so far,
> seabios&kernel
> seabios-0.6.1.2-26.el6.x86_64
> kernel-2.6.32-356.el6.x86_64
> Steps,
> 1.boot up guest with the CLI -
>   /usr/libexec/qemu-kvm -M rhel6.4.0 -m 6G -smp 4 -cpu cpu64-rhel6,+x2apic
> -usbdevice tablet -drive
> file=win2k8-32-nic1.raw,if=none,format=raw,id=drive-ide0-0-0,werror=stop,
> rerror=stop,cache=*unsafe* -device
> ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1
> -netdev tap,sndbuf=0,id=hostnet0,script=/etc/qemu-ifup,downscript=no -device
> e1000,netdev=hostnet0,mac=00:52:17:36:23:1c,bus=pci.0,addr=0x4 -uuid
> dad05b08-9f12-44e7-bdc7-222bb0cba101 -rtc
> base=localtime,clock=host,driftfix=slew -no-kvm-pit-reinjection -chardev
> socket,id=111a,path=/tmp/win2k8-32-nic-49,server,nowait -mon
> chardev=111a,mode=readline -name win8-32-nic-49-1 -netdev
> tap,sndbuf=0,id=hostnet1,script=/etc/qemu-ifup-private,downscript=no -device
> virtio-net-pci,netdev=hostnet1,id=net1,mac=00:12:42:36:53:28,bus=pci.0,
> addr=0x7 -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0
> -monitor stdio -spice disable-ticketing,port=5931 -vga qxl
> 2.Run the Sleep and PNP job (disable and enable)
> Actual results,
> The job passed without BSOD.

  While cache=*none*,the issue still can be reproduced steadily and QE will upload the latest dump file
Comment 36 Min Deng 2013-01-28 05:01:46 EST
Created attachment 688832 [details]
the latest one
Comment 37 Paolo Bonzini 2013-01-28 10:29:28 EST
Perhaps there is another bug, because it is not really clear why the patch should have any effect.  If the machine is started with cache=none, it should have (from the beginning) its write cache enabled.

Does it work if you only patch like this: 

         switch(s->feature) {
+        case 0x02: /* write cache enable */
         case 0xcc: /* reverting to poewr-on defaults enable */

or like this:

         switch(s->feature) {
+        case 0x82: /* write cache disable */
+        case 0x02: /* write cache enable */
         case 0xcc: /* reverting to poewr-on defaults enable */

or like this:

         switch(s->feature) {
+        case 0x02: /* write cache enable */
+            identify_data = (uint16_t *)s->identify_data;
+            put_le16(identify_data + 85, (1 << 14) | (1 << 5) | 1);
+            s->status = READY_STAT | SEEK_STAT;
+            ide_set_irq(s->bus);
+            break;
+        case 0x82: /* write cache disable */
+            identify_data = (uint16_t *)s->identify_data;
+            put_le16(identify_data + 85, (1 << 14) | 1);
+            ide_flush_cache(s);
+            break;
         case 0xcc: /* reverting to poewr-on defaults enable */

That would explain why the patch makes a difference.

The first would be a very acceptable patch for 6.4.

The others wouldn't really be acceptable; Asias's full backport (and also commits c4a248a+f05fa4a) would be needed, but at least we'd know what Windows is expecting.
Comment 38 Asias He 2013-01-28 22:01:01 EST
(In reply to comment #16)
> Problem reproduces without WHQL as well. Just start the Guest with 4G of RAM
> and do hibernate.
> 
> It looks like the root cause of the problem is extremely slow hibernation on
> Windows 2008 32 bit.
> 
> For guest with 1G of memory it takes system around 10 minutes to hibernate.
> When it takes less then 10 minutes system hibernates successfully, when it
> takes more - system crashes.
> 
> With 4G of memory crash happens every time, with 512M of memory the crash
> has never been observed.
> 
> Another phenomena observed is that the crash happens on different hosts,
> also IO performance tests don't show slow storage access rates, so it
> doesn't look like the storage issue.
> Also when guest is going into hibernate state all it's VCPUs are 100% busy
> so it doesn't look like a storage latency problem.
> 
> The crash also happens without RedHat network/storage drivers so it looks
> like a host issue.
> 
> Unusual system activity during hibernation needs to be investigated.

Hi Dmitry,

Can you post your command line you used in comment #16. You used cache=writethrough ?
Comment 39 Asias He 2013-01-29 03:17:25 EST
Some updates:

Only applying 'block: always open drivers in writeback mode' will achieve the same result as in comment 33. 'ide: support enable/disable write cache' does not really fix any issue.

The slowness of hibernation should not related to 'flush' but slowness of 'write'. Very few flush operations (less than 20) are observed during 4g ram hibernation.

Hibernation slowness is only observed with 'cache=writethrough', e.g. ' -drive file=$OS,if=none,cache=writethrough,id=hd0  -device ide-drive,drive=hd0,id=hd0' using the stock RHEL6.4 qemu-kvm build.
Comment 40 Dmitry Fleytman 2013-01-29 04:28:57 EST
We used defaults, i.e. with no explicit cache specification.

> 
> Can you post your command line you used in comment #16. You used
> cache=writethrough ?
Comment 41 Paolo Bonzini 2013-01-29 06:13:07 EST
The default is cache=writethrough in RHEL6.4 and cache=writeback upstream.  It is possible that Windows tries to toggle the cache mode at startup, this would explain the result of the bisection.

But if the issue still can be reproduced with HCK and cache=none, we need to bisect with HCK and the exact same cache parameters.
Comment 42 Ronen Hod 2013-01-29 09:16:10 EST
Following a conversation with Vadim and Yan, Dima, Gal.

The tracker bug 896495 shows several cases, that are not all related to IDE, all of them are related to "power" (s3/s4), but we think that this is not the direct cause, and it could be a result of a secondary effect, such as Flush.
Still we believe that all of them are due to the same issue.
- This is an old problem that we carry since RHEL 5.
- The problem is that a request to a device is stuck in QEMU
- A good trigger for this power-off of a device (S3/S4) during massive storage activity.
- Although the power-off is stuck, it is not clear that it is the source of the problem. Probably a standard I/O stuck request is the real issue.

Vadim is looking at it. Although the request is stuck inside QEMU, Vadim prefers to use virtio drivers over IDE, since he can tell exactly what happens in the driver's code.

Ronen.
Comment 44 dawu 2013-03-11 01:51:26 EDT
Tried with 5 times with the latest new version seabios-0.6.1.2-26.el6.x86_64, still hit 9F BSOD 4 times, only one time passed, following is the CLI:
/usr/libexec/qemu-kvm -m 6G -smp 4 -cpu cpu64-rhel6,+x2apic -usbdevice tablet -drive file=win2k8-32-nic.raw,if=none,id=drive-ide0-0-0,werror=stop,rerror=stop,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,sndbuf=0,id=hostnet0,vhost=on,script=/etc/qemu-ifup-private,downscript=no -device virtio-net-pci,netdev=hostnet0,mac=00:14:31:26:86:21,bus=pci.0,addr=0x4,id=virtio-net-pci0 -netdev tap,sndbuf=0,id=hostnet1,script=/etc/qemu-ifup,downscript=no -device e1000,netdev=hostnet1,mac=00:12:17:10:86:22,bus=pci.0,addr=0x6 -uuid a3e8540f-fea5-4b43-a3cd-9c86f519ce77 -no-kvm-pit-reinjection -chardev socket,id=111a,path=/tmp/monitor-win2k8-32-nic2,server,nowait -mon chardev=111a,mode=readline -vnc :2 -vga cirrus -rtc base=localtime,clock=host,driftfix=slew -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 -monitor stdio 

Tried one time with the bios.bin from https://bugzilla.redhat.com/show_bug.cgi?id=912561, also still hit the BSOD with 9F. following is the CLI:

/usr/libexec/qemu-kvm -m 6G -smp 4 -cpu cpu64-rhel6,+x2apic -usbdevice tablet -drive file=win2k8-32-nic.raw,if=none,id=drive-ide0-0-0,werror=stop,rerror=stop -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,sndbuf=0,id=hostnet0,vhost=on,script=/etc/qemu-ifup-private,downscript=no -device virtio-net-pci,netdev=hostnet0,mac=00:14:31:26:86:21,bus=pci.0,addr=0x4,id=virtio-net-pci0 -netdev tap,sndbuf=0,id=hostnet1,script=/etc/qemu-ifup,downscript=no -device e1000,netdev=hostnet1,mac=00:12:17:10:86:22,bus=pci.0,addr=0x6 -uuid a3e8540f-fea5-4b43-a3cd-9c86f519ce77 -no-kvm-pit-reinjection -chardev socket,id=111a,path=/tmp/monitor-win2k8-32-nic2,server,nowait -mon chardev=111a,mode=readline -vnc :2 -vga cirrus -rtc base=localtime,clock=host,driftfix=slew -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 -monitor stdio -bios /home/bios.bin

Thanks
Best Regards,
Dawn
Comment 45 lijin 2013-04-24 05:18:02 EDT
test this issue on win2k8-32
qemu-img-rhev-0.12.1.2-2.359.el6.x86_64
kernel-2.6.32-358.el6.x86_64
seabios-0.6.1.2-27.el6.x86_64
spice-server-0.12.0-12.el6.x86_64
virtio-win-prewhql-0.1-59

boot the guest with different comnination of cache and s4
1.when using "-global PIIX4_PM.disable_s4=0" and "cache=none"--->guest BSOD
2.when using "-global PIIX4_PM.disable_s4=0" and delete "cache=none"--->guest BSOD
3.when using "-global PIIX4_PM.disable_s4=1" and "cache=none"--->guest pass the job

So it seems s4 cause the guest BSOD
Comment 52 lijin 2013-05-08 07:07:48 EDT
win2k8-R2 hit the same issue when run the netkvm job “Sleep and PNP (disable and enable) with IO Before and After(Certification)” on HCK with virtio-win-prewhql-0.1-59.
The memory.dump file will be upload later.
Comment 53 lijin 2013-05-08 07:11:12 EDT
(In reply to comment #52)
> win2k8-R2 hit the same issue when run the netkvm job “Sleep and PNP (disable
> and enable) with IO Before and After(Certification)” on HCK with
> virtio-win-prewhql-0.1-59.
> The memory.dump file will be upload later.

windbg info:

0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

DRIVER_POWER_STATE_FAILURE (9f)
A driver has failed to complete a power IRP within a specific time (usually 10 minutes).
Arguments:
Arg1: 0000000000000004, The power transition timed out waiting to synchronize with the Pnp
	subsystem.
Arg2: 0000000000000258, Timeout in seconds.
Arg3: fffffa8004ef8b50, The thread currently holding on to the Pnp lock.
Arg4: fffff8000131a510, nt!TRIAGE_9F_PNP on Win7

Debugging Details:
------------------

Implicit thread is now fffffa80`04ef8b50

DRVPOWERSTATE_SUBCODE:  4

IMAGE_NAME:  pci.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  4ce7928f

MODULE_NAME: pci

FAULTING_MODULE: fffff880011b6000 pci

DEFAULT_BUCKET_ID:  WIN7_DRIVER_FAULT

BUGCHECK_STR:  0x9F

PROCESS_NAME:  System

CURRENT_IRQL:  2

STACK_TEXT:  
fffff800`0131a4d8 fffff800`01561e86 : 00000000`0000009f 00000000`00000004 00000000`00000258 fffffa80`04ef8b50 : nt!KeBugCheckEx
fffff800`0131a4e0 fffff800`0171334c : 00000000`00000000 fffff800`00000000 fffff880`009b8100 fffff800`014de81a : nt!PnpBugcheckPowerTimeout+0x76
fffff800`0131a540 fffff800`014e3e3c : 00000000`00000000 00000000`0000000f 00000000`40b11200 00000000`00000000 : nt!PopBuildDeviceNotifyListWatchdog+0x1c
fffff800`0131a570 fffff800`014e3cd6 : fffffa80`06ff5718 fffffa80`06ff5718 00000000`00000000 00000000`00000000 : nt!KiProcessTimerDpcTable+0x6c
fffff800`0131a5e0 fffff800`014e3bbe : 000000fd`a6f19f7e fffff800`0131ac58 00000000`006a63b0 fffff800`0164f888 : nt!KiProcessExpiredTimerList+0xc6
fffff800`0131ac30 fffff800`014e39a7 : 0000005a`cbdc68c8 0000005a`006a63b0 0000005a`cbdc68a2 00000000`000000b0 : nt!KiTimerExpiration+0x1be
fffff800`0131acd0 fffff800`014d0eca : fffff800`0164be80 fffff800`01659cc0 00000000`00000000 fffff880`00e10a00 : nt!KiRetireDpcList+0x277
fffff800`0131ad80 00000000`00000000 : fffff800`0131b000 fffff800`01315000 fffff800`0131ad40 00000000`00000000 : nt!KiIdleLoop+0x5a


STACK_COMMAND:  kb

FOLLOWUP_NAME:  MachineOwner

FAILURE_BUCKET_ID:  X64_0x9F_VRF_4_netkvm_IMAGE_pci.sys

BUCKET_ID:  X64_0x9F_VRF_4_netkvm_IMAGE_pci.sys

Followup: MachineOwner
---------
Comment 54 lijin 2013-05-08 08:14:08 EDT
(In reply to comment #52)
> win2k8-R2 hit the same issue when run the netkvm job “Sleep and PNP (disable
> and enable) with IO Before and After(Certification)” on HCK with
> virtio-win-prewhql-0.1-59.
> The memory.dump file will be upload later.

build 59 dump file located in \\smamit.eng.lab.tlv.redhat.com\win-team\Public\QE\Bug836467\build-59-MEMORY_9F_Sleep&PNP&D.DMP
Comment 55 Dmitry Fleytman 2013-05-16 02:04:44 EDT
We don't see this bug anymore on the latest code.
Please retest with build virtio-win-prewhql-0.1-61 and attach the crash dump if BSOD appears again.

Thanks,
Dmitry
Comment 57 guo jiang 2013-05-27 23:05:10 EDT
win2k8-32 hit the same issue -- virtio-win-prewhql-0.1-61.

Package:
    * Red Hat Enterprise Linux Server release 6.4 (Santiago)
    * kernel-2.6.32-369.el6.x86_64    
    * qemu-img-rhev-0.12.1.2-2.359.el6.x86_64
    * virtio-win-prewhql-0.1-61
    * spice-server-0.12.0-12.el6.x86_64
    * seabios-0.6.1.2-27.el6.x86_64
    * vgabios-0.6b-3.7.el6.noarch

Windbg info:

0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

DRIVER_POWER_STATE_FAILURE (9f)
A driver has failed to complete a power IRP within a specific time (usually 10 minutes).
Arguments:
Arg1: 00000003, A device object has been blocking an Irp for too long a time
Arg2: 8961b6b0, Physical Device Object of the stack
Arg3: 89efa040, nt!TRIAGE_9F_POWER on Win7, otherwise the Functional Device Object of the stack
Arg4: 8b80ced8, The blocked IRP

Debugging Details:
------------------


DRVPOWERSTATE_SUBCODE:  3

IMAGE_NAME:  pci.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  49e01a44

MODULE_NAME: pci

FAULTING_MODULE: 81e62000 pci

DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT

BUGCHECK_STR:  0x9F

PROCESS_NAME:  System

CURRENT_IRQL:  2

STACK_TEXT:  
8190aacc 818463bb 0000009f 00000003 8961b6b0 nt!KeBugCheckEx+0x1e
8190ab28 81845fd8 8190ab94 9280e180 9280e100 nt!PopCheckIrpWatchdog+0x1ad
8190ab68 818bf26b 819234e0 00000000 f5035640 nt!PopCheckForIdleness+0x343
8190ac88 818bee2b 8190acd0 89fc8802 8190acd8 nt!KiTimerListExpire+0x367
8190ace8 818bf595 00000000 00000000 000403ff nt!KiTimerExpiration+0x22a
8190ad50 818bd7dd 00000000 0000000e 00000000 nt!KiRetireDpcList+0xba
8190ad54 00000000 0000000e 00000000 00000000 nt!KiIdleLoop+0x49


STACK_COMMAND:  kb

FOLLOWUP_NAME:  MachineOwner

FAILURE_BUCKET_ID:  0x9F_VRF_3_vgapnp_IMAGE_pci.sys

BUCKET_ID:  0x9F_VRF_3_vgapnp_IMAGE_pci.sys

Followup: MachineOwner
---------
Comment 58 Mike Cao 2013-05-27 23:10:47 EDT
Based on comment #57 ,re-assgin this issue
Comment 59 guo jiang 2013-05-27 23:23:17 EDT
Created attachment 753717 [details]
dump file-win2k8-32-61
Comment 60 Dmitry Fleytman 2013-05-28 02:34:05 EDT
Hello,

Unfortunately it is not clear which BIOS was used for the last test cycle.
We believe this problem occurs due to a BIOS bug, did you use BIOS from here: https://bugzilla.redhat.com/show_bug.cgi?id=912561#c3

Thanks,
Dmitry
Comment 61 Paolo Bonzini 2013-05-28 08:29:18 EDT
Or alternatively: seabios-0.6.1.2-27.el6 also has the patch.
Comment 62 Mike Cao 2013-05-28 22:58:20 EDT
(In reply to Dmitry Fleytman from comment #60)
> Hello,
> 
> Unfortunately it is not clear which BIOS was used for the last test cycle.
> We believe this problem occurs due to a BIOS bug, did you use BIOS from
> here: https://bugzilla.redhat.com/show_bug.cgi?id=912561#c3
> 
> Thanks,
> Dmitry

Hi, Dima

We use  seabios-0.6.1.2-27.el6
Comment 63 Mike Cao 2013-05-28 22:59:15 EDT
Re-Assigned this issue according to comment #57
Comment 69 guo jiang 2013-09-03 04:30:31 EDT
win2k8-32 still hit this issue with build-67.
other package version:
Red Hat Enterprise Linux Server release 6.4 (Santiago)
kernel-2.6.32-414.el6.x86_64    
qemu-kvm-rhev-0.12.1.2-2.397.el6.x86_64
spice-server-0.12.4-2.el6.x86_64
seabios-0.6.1.2-28.el6.x86_64
vgabios-0.6b-3.7.el6.noarch
Comment 71 Dmitry Fleytman 2013-09-03 05:24:25 EDT
According to MS requirements Windows server 2008 32/64 (not R2) should be tested with WLK. Could you please try to repeat this test on WLK?

Thanks,
Dmitry
Comment 72 Mike Cao 2013-09-03 05:37:23 EDT
(In reply to Dmitry Fleytman from comment #71)
> According to MS requirements Windows server 2008 32/64 (not R2) should be
> tested with WLK. Could you please try to repeat this test on WLK?

We test it on WLK1.6 and the result is pass which means it is not a testblocker.
We would like to deferring this bug to next release in case need it in future 
> 
> Thanks,
> Dmitry
Comment 73 Yan Vugenfirer 2013-09-03 07:52:06 EDT
(In reply to Mike Cao from comment #72)
> (In reply to Dmitry Fleytman from comment #71)
> > According to MS requirements Windows server 2008 32/64 (not R2) should be
> > tested with WLK. Could you please try to repeat this test on WLK?
> 
> We test it on WLK1.6 and the result is pass which means it is not a
> testblocker.
> We would like to deferring this bug to next release in case need it in
> future 

As we know that Win2008 will be tested with WLK in the future - I think we can close it.
Yan.

> > 
> > Thanks,
> > Dmitry
Comment 74 Yan Vugenfirer 2013-09-15 07:26:31 EDT
Closing based on comment #73 (certification for Windows 2008 server will be done with WLK test kit).

In case if we still want to pass with HCK - support case with MS should be opened.

Note You need to log in before you can comment on or make changes to this bug.