Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 651912

Summary:

BSOD 0x101 or hang for Windows guests running with emulated IDE devices

Product:

Red Hat Enterprise Linux 5

Reporter:

Paolo Bonzini <pbonzini>

Component:

xen

Assignee:

Paolo Bonzini <pbonzini>

Status:

CLOSED ERRATA

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

medium

Docs Contact:

Priority:

low

Version:

5.5

CC:

cshao, jwest, leiwang, mrezanin, mshao, pbonzini, rwu, whuang, xen-maint

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

xen-3.0.3-121.el5

Doc Type:

Bug Fix

Doc Text:

Previously, the cache flush for IDE devices emulated via the qemu emulator was performed synchronously. When the flush process took too much time, the virtual CPU was stuck while the fsync utility was running. This behavior sometimes caused guests to terminate unexpectedly on the Windows operating system. With this update, cache flushes for IDE devices are done asynchronously, and a crash no longer occurs in the described scenario.

Story Points:

---

Clone Of:

642903

Environment:

Last Closed:

2011-07-21 09:17:55 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

514500, 518435, 642903, 712412

Attachments:

Description	Flags
Screen Shot of BSOD	none
possible patch	none

Description Paolo Bonzini 2010-11-10 15:42:40 UTC

--- Additional comment from rwu on 2010-11-09 05:46:17 EST ---

(In reply to comment #12)
> It shouldn't matter because the job only touches the single disk that you
> specify on the command line---anyway, yes, I'm using 3 for everything except
> boot test and now crashdump too.

According to our latest test result:
when set it as '3', I'll get BSOD in a short time.
when set it as '1', the job will run 3 days but fails at last.

Any idea?

--- Additional comment from pbonzini on 2010-11-09 14:37:58 EST ---

Can I get the BSOD screenshot and memory dump?

--- Additional comment from rwu on 2010-11-10 01:55:44 EST ---

Created attachment 459331 [details]
screen shot of BSOD when set EnumateDevice=3

You can get MEMORY.DMP from ftp://10.66.93.232/tmp/ (I cannot access file.nay.redhat.com with my today)

---

The bug happens when an IDE flush takes too much time.  Unlike real hardware, QEMU's IDE emulation flushes synchronously and the VCPU is stuck while fsync runs.  This triggers Windows's watchdog and a BSOD.

The infrastructure for asynchronous flushing has been backported to RHEL5.6 already and is already in use by the SCSI device.  http://lists.xensource.com/archives/html/xen-devel/2008-03/msg00857.html includes a patch for this.

This should be reproducible running iozone on dom0 while running a Windows guest.  It is the Xen backport of patch 4 in KVM bug 537646.

Comment 3 Miroslav Rezanina 2010-11-11 09:03:45 UTC

Fix built into xen-3.0.3-118.el5

Comment 5 Rita Wu 2010-11-22 03:06:31 UTC

BSOD 0x101 still occurs with xen-debuginfo-3.0.3-118.el5.

We still can reproduce it with WHQL job 'Disk Verification' after a long
time(several hours this time, instead of several minutes) running.

Comment 6 Rita Wu 2010-11-22 03:07:11 UTC

Created attachment 461913 [details]
Screen Shot of BSOD

Comment 7 Rita Wu 2010-11-22 03:08:57 UTC

You can download the crashdump from ftp://10.66.93.232/Whql/tmp/

Comment 8 Rita Wu 2010-11-22 03:31:49 UTC

Update comment5, verify with xen-3.0.3-118.el5

Comment 9 Lei Wang 2010-11-22 07:04:40 UTC

(In reply to comment #8)
> Update comment5, verify with xen-3.0.3-118.el5

Please ignore comment 8, sorry for any inconvenience.

Here we use xen-3.0.3-118.el5 for test, and the bug still could be reproduced, details please see comment 5~7 except replace xen-debuginfo-3.0.3-118.el5 with xen-3.0.3-118.el5 in comment 5.

Comment 13 Paolo Bonzini 2010-11-23 19:19:45 UTC

Thanks, this dump is quite helpful.  Can you please go to Device Manager -> IDE ATA/ATAPI controllers -> Primary IDE Channel, then right click and properties, then "Advanced Settings", and say what is written in there as "Current Transfer Mode"?

Thanks!

Comment 14 Paolo Bonzini 2010-11-23 19:38:37 UTC

Created attachment 462424 [details]
possible patch

The second dump shows that PIO is taking place when the guest crashes.  PIO has a special case to handle operation with disabled write cache, and that also uses bdrv_flush.  The dump shows that Windows is writing registry data to the disk, so it's plausible that the cache is disabled and that this occurrence of bdrv_flush is causing the crash.

The bug has been delayed to 5.7, however, we may still want to give this a shot after the WHQL test finishes.

Comment 15 Rita Wu 2010-11-24 08:50:46 UTC

(In reply to comment #13)
> Thanks, this dump is quite helpful.  Can you please go to Device Manager -> IDE
> ATA/ATAPI controllers -> Primary IDE Channel, then right click and properties,
> then "Advanced Settings", and say what is written in there as "Current Transfer
> Mode"?
> 
> Thanks!

I'll check that value ASAP, since we've add 'EnumerateDevicesOverride=3' to
avoid BSOD in order to do new round WHQL testing. We've running this job
without 'EnumerateDevicesOverride=3' and waiting for the BSOD.

Comment 16 cshao 2010-11-24 10:46:44 UTC

(In reply to comment #13)
> Thanks, this dump is quite helpful.  Can you please go to Device Manager -> IDE
> ATA/ATAPI controllers -> Primary IDE Channel, then right click and properties,
> then "Advanced Settings", and say what is written in there as "Current Transfer
> Mode"?
> 
> Thanks!

Hi, Paolo

The BSOD occurs and I checked the "Current Mode" in "Advanced Settings" tab for "ATA Channel 0"/"ATA Channel 1", both are "Multi-Word DMA Mode2".

Comment 22 Huang Wenlong 2011-04-18 09:52:43 UTC

we will test it on the latest xenpv-win driver then  update the result on BZ

Comment 23 Paolo Bonzini 2011-04-18 16:13:57 UTC

This run of xenpv-win testing will not use IDE for disk stress/verification, so it wouldn't add any new information.

Comment 24 Rita Wu 2011-04-19 03:08:13 UTC

(In reply to comment #23)
> This run of xenpv-win testing will not use IDE for disk stress/verification, so
> it wouldn't add any new information.

We can set EDO=3 to run Disk Verification jobs with another ENV besides whql testing. Should we test it with the xenpv-win got from brewroot? Or we need to begin test after Bug 697368 fixed?

Comment 25 Paolo Bonzini 2011-04-19 06:41:50 UTC

Since this is a xen bug (not xenpv-win), you can do any of the following:

1) consider the testing with -7 valid and VERIFY :)

2) use the same build you're using now, with EDO=3

3) wait for the next build (signed, with bug 697368 fixed---it's just a packaging hiccup)

Comment 26 Huang Wenlong 2011-04-19 07:20:53 UTC

(In reply to comment #25)
> Since this is a xen bug (not xenpv-win), you can do any of the following:
> 
> 1) consider the testing with -7 valid and VERIFY :)
> 
> 2) use the same build you're using now, with EDO=3
> 
> 3) wait for the next build (signed, with bug 697368 fixed---it's just a
> packaging hiccup)

I am running disk-verification with xenpv-win-1.3.4-8 and xen.3.0.3-128 in 2k8-64 and 
set EDO=3. 
I will update the result after it done .

Comment 27 Huang Wenlong 2011-04-22 06:39:34 UTC

Verify this bug  in rhel5.7 with: 
xenpv-win-1.3.4-8 (not brewweb,Paolo give us)
xen-3.0.3-128 
kernel-xen-2.6.18-257.el5xen

2k8-64 bit Disk verification job passed  , no BSOD .

Comment 30 Tomas Capek 2011-07-13 13:27:20 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, the cache flush for IDE devices emulated via the qemu emulator was performed synchronously. When the flush process took too much time, the virtual CPU was stuck while the fsync utility was running. This behavior sometimes caused guests to terminate unexpectedly on the Windows operating system. With this update, cache flushes for IDE devices are done asynchronously, and a crash no longer occurs in the described scenario.

Comment 31 errata-xmlrpc 2011-07-21 09:17:55 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1070.html

Comment 32 errata-xmlrpc 2011-07-21 11:59:17 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1070.html