Bug 651912 - BSOD 0x101 or hang for Windows guests running with emulated IDE devices
Summary: BSOD 0x101 or hang for Windows guests running with emulated IDE devices
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: xen
Version: 5.5
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Paolo Bonzini
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 514500 518435 642903 712412
TreeView+ depends on / blocked
 
Reported: 2010-11-10 15:42 UTC by Paolo Bonzini
Modified: 2011-07-21 11:59 UTC (History)
9 users (show)

Fixed In Version: xen-3.0.3-121.el5
Doc Type: Bug Fix
Doc Text:
Previously, the cache flush for IDE devices emulated via the qemu emulator was performed synchronously. When the flush process took too much time, the virtual CPU was stuck while the fsync utility was running. This behavior sometimes caused guests to terminate unexpectedly on the Windows operating system. With this update, cache flushes for IDE devices are done asynchronously, and a crash no longer occurs in the described scenario.
Clone Of: 642903
Environment:
Last Closed: 2011-07-21 09:17:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Screen Shot of BSOD (32.87 KB, image/png)
2010-11-22 03:07 UTC, Rita Wu
no flags Details
possible patch (3.27 KB, patch)
2010-11-23 19:38 UTC, Paolo Bonzini
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:1070 0 normal SHIPPED_LIVE xen bug fix and enhancement update 2011-07-21 09:12:56 UTC

Description Paolo Bonzini 2010-11-10 15:42:40 UTC
--- Additional comment from rwu on 2010-11-09 05:46:17 EST ---

(In reply to comment #12)
> It shouldn't matter because the job only touches the single disk that you
> specify on the command line---anyway, yes, I'm using 3 for everything except
> boot test and now crashdump too.

According to our latest test result:
when set it as '3', I'll get BSOD in a short time.
when set it as '1', the job will run 3 days but fails at last.

Any idea?

--- Additional comment from pbonzini on 2010-11-09 14:37:58 EST ---

Can I get the BSOD screenshot and memory dump?

--- Additional comment from rwu on 2010-11-10 01:55:44 EST ---

Created attachment 459331 [details]
screen shot of BSOD when set EnumateDevice=3

You can get MEMORY.DMP from ftp://10.66.93.232/tmp/ (I cannot access file.nay.redhat.com with my today)

---

The bug happens when an IDE flush takes too much time.  Unlike real hardware, QEMU's IDE emulation flushes synchronously and the VCPU is stuck while fsync runs.  This triggers Windows's watchdog and a BSOD.

The infrastructure for asynchronous flushing has been backported to RHEL5.6 already and is already in use by the SCSI device.  http://lists.xensource.com/archives/html/xen-devel/2008-03/msg00857.html includes a patch for this.

This should be reproducible running iozone on dom0 while running a Windows guest.  It is the Xen backport of patch 4 in KVM bug 537646.

Comment 3 Miroslav Rezanina 2010-11-11 09:03:45 UTC
Fix built into xen-3.0.3-118.el5

Comment 5 Rita Wu 2010-11-22 03:06:31 UTC
BSOD 0x101 still occurs with xen-debuginfo-3.0.3-118.el5.

We still can reproduce it with WHQL job 'Disk Verification' after a long
time(several hours this time, instead of several minutes) running.

Comment 6 Rita Wu 2010-11-22 03:07:11 UTC
Created attachment 461913 [details]
Screen Shot of BSOD

Comment 7 Rita Wu 2010-11-22 03:08:57 UTC
You can download the crashdump from ftp://10.66.93.232/Whql/tmp/

Comment 8 Rita Wu 2010-11-22 03:31:49 UTC
Update comment5, verify with xen-3.0.3-118.el5

Comment 9 Lei Wang 2010-11-22 07:04:40 UTC
(In reply to comment #8)
> Update comment5, verify with xen-3.0.3-118.el5

Please ignore comment 8, sorry for any inconvenience.

Here we use xen-3.0.3-118.el5 for test, and the bug still could be reproduced, details please see comment 5~7 except replace xen-debuginfo-3.0.3-118.el5 with xen-3.0.3-118.el5 in comment 5.

Comment 13 Paolo Bonzini 2010-11-23 19:19:45 UTC
Thanks, this dump is quite helpful.  Can you please go to Device Manager -> IDE ATA/ATAPI controllers -> Primary IDE Channel, then right click and properties, then "Advanced Settings", and say what is written in there as "Current Transfer Mode"?

Thanks!

Comment 14 Paolo Bonzini 2010-11-23 19:38:37 UTC
Created attachment 462424 [details]
possible patch

The second dump shows that PIO is taking place when the guest crashes.  PIO has a special case to handle operation with disabled write cache, and that also uses bdrv_flush.  The dump shows that Windows is writing registry data to the disk, so it's plausible that the cache is disabled and that this occurrence of bdrv_flush is causing the crash.

The bug has been delayed to 5.7, however, we may still want to give this a shot after the WHQL test finishes.

Comment 15 Rita Wu 2010-11-24 08:50:46 UTC
(In reply to comment #13)
> Thanks, this dump is quite helpful.  Can you please go to Device Manager -> IDE
> ATA/ATAPI controllers -> Primary IDE Channel, then right click and properties,
> then "Advanced Settings", and say what is written in there as "Current Transfer
> Mode"?
> 
> Thanks!

I'll check that value ASAP, since we've add 'EnumerateDevicesOverride=3' to
avoid BSOD in order to do new round WHQL testing. We've running this job
without 'EnumerateDevicesOverride=3' and waiting for the BSOD.

Comment 16 cshao 2010-11-24 10:46:44 UTC
(In reply to comment #13)
> Thanks, this dump is quite helpful.  Can you please go to Device Manager -> IDE
> ATA/ATAPI controllers -> Primary IDE Channel, then right click and properties,
> then "Advanced Settings", and say what is written in there as "Current Transfer
> Mode"?
> 
> Thanks!

Hi, Paolo

The BSOD occurs and I checked the "Current Mode" in "Advanced Settings" tab for "ATA Channel 0"/"ATA Channel 1", both are "Multi-Word DMA Mode2".

Comment 22 Huang Wenlong 2011-04-18 09:52:43 UTC
we will test it on the latest xenpv-win driver then  update the result on BZ

Comment 23 Paolo Bonzini 2011-04-18 16:13:57 UTC
This run of xenpv-win testing will not use IDE for disk stress/verification, so it wouldn't add any new information.

Comment 24 Rita Wu 2011-04-19 03:08:13 UTC
(In reply to comment #23)
> This run of xenpv-win testing will not use IDE for disk stress/verification, so
> it wouldn't add any new information.

We can set EDO=3 to run Disk Verification jobs with another ENV besides whql testing. Should we test it with the xenpv-win got from brewroot? Or we need to begin test after Bug 697368 fixed?

Comment 25 Paolo Bonzini 2011-04-19 06:41:50 UTC
Since this is a xen bug (not xenpv-win), you can do any of the following:

1) consider the testing with -7 valid and VERIFY :)

2) use the same build you're using now, with EDO=3

3) wait for the next build (signed, with bug 697368 fixed---it's just a packaging hiccup)

Comment 26 Huang Wenlong 2011-04-19 07:20:53 UTC
(In reply to comment #25)
> Since this is a xen bug (not xenpv-win), you can do any of the following:
> 
> 1) consider the testing with -7 valid and VERIFY :)
> 
> 2) use the same build you're using now, with EDO=3
> 
> 3) wait for the next build (signed, with bug 697368 fixed---it's just a
> packaging hiccup)

I am running disk-verification with xenpv-win-1.3.4-8 and xen.3.0.3-128 in 2k8-64 and 
set EDO=3. 
I will update the result after it done .

Comment 27 Huang Wenlong 2011-04-22 06:39:34 UTC
Verify this bug  in rhel5.7 with: 
xenpv-win-1.3.4-8 (not brewweb,Paolo give us)
xen-3.0.3-128 
kernel-xen-2.6.18-257.el5xen

2k8-64 bit Disk verification job passed  , no BSOD .

Comment 30 Tomas Capek 2011-07-13 13:27:20 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, the cache flush for IDE devices emulated via the qemu emulator was performed synchronously. When the flush process took too much time, the virtual CPU was stuck while the fsync utility was running. This behavior sometimes caused guests to terminate unexpectedly on the Windows operating system. With this update, cache flushes for IDE devices are done asynchronously, and a crash no longer occurs in the described scenario.

Comment 31 errata-xmlrpc 2011-07-21 09:17:55 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1070.html

Comment 32 errata-xmlrpc 2011-07-21 11:59:17 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1070.html


Note You need to log in before you can comment on or make changes to this bug.