--- Additional comment from rwu on 2010-11-09 05:46:17 EST --- (In reply to comment #12) > It shouldn't matter because the job only touches the single disk that you > specify on the command line---anyway, yes, I'm using 3 for everything except > boot test and now crashdump too. According to our latest test result: when set it as '3', I'll get BSOD in a short time. when set it as '1', the job will run 3 days but fails at last. Any idea? --- Additional comment from pbonzini on 2010-11-09 14:37:58 EST --- Can I get the BSOD screenshot and memory dump? --- Additional comment from rwu on 2010-11-10 01:55:44 EST --- Created attachment 459331 [details] screen shot of BSOD when set EnumateDevice=3 You can get MEMORY.DMP from ftp://10.66.93.232/tmp/ (I cannot access file.nay.redhat.com with my today) --- The bug happens when an IDE flush takes too much time. Unlike real hardware, QEMU's IDE emulation flushes synchronously and the VCPU is stuck while fsync runs. This triggers Windows's watchdog and a BSOD. The infrastructure for asynchronous flushing has been backported to RHEL5.6 already and is already in use by the SCSI device. http://lists.xensource.com/archives/html/xen-devel/2008-03/msg00857.html includes a patch for this. This should be reproducible running iozone on dom0 while running a Windows guest. It is the Xen backport of patch 4 in KVM bug 537646.
Fix built into xen-3.0.3-118.el5
BSOD 0x101 still occurs with xen-debuginfo-3.0.3-118.el5. We still can reproduce it with WHQL job 'Disk Verification' after a long time(several hours this time, instead of several minutes) running.
Created attachment 461913 [details] Screen Shot of BSOD
You can download the crashdump from ftp://10.66.93.232/Whql/tmp/
Update comment5, verify with xen-3.0.3-118.el5
(In reply to comment #8) > Update comment5, verify with xen-3.0.3-118.el5 Please ignore comment 8, sorry for any inconvenience. Here we use xen-3.0.3-118.el5 for test, and the bug still could be reproduced, details please see comment 5~7 except replace xen-debuginfo-3.0.3-118.el5 with xen-3.0.3-118.el5 in comment 5.
Thanks, this dump is quite helpful. Can you please go to Device Manager -> IDE ATA/ATAPI controllers -> Primary IDE Channel, then right click and properties, then "Advanced Settings", and say what is written in there as "Current Transfer Mode"? Thanks!
Created attachment 462424 [details] possible patch The second dump shows that PIO is taking place when the guest crashes. PIO has a special case to handle operation with disabled write cache, and that also uses bdrv_flush. The dump shows that Windows is writing registry data to the disk, so it's plausible that the cache is disabled and that this occurrence of bdrv_flush is causing the crash. The bug has been delayed to 5.7, however, we may still want to give this a shot after the WHQL test finishes.
(In reply to comment #13) > Thanks, this dump is quite helpful. Can you please go to Device Manager -> IDE > ATA/ATAPI controllers -> Primary IDE Channel, then right click and properties, > then "Advanced Settings", and say what is written in there as "Current Transfer > Mode"? > > Thanks! I'll check that value ASAP, since we've add 'EnumerateDevicesOverride=3' to avoid BSOD in order to do new round WHQL testing. We've running this job without 'EnumerateDevicesOverride=3' and waiting for the BSOD.
(In reply to comment #13) > Thanks, this dump is quite helpful. Can you please go to Device Manager -> IDE > ATA/ATAPI controllers -> Primary IDE Channel, then right click and properties, > then "Advanced Settings", and say what is written in there as "Current Transfer > Mode"? > > Thanks! Hi, Paolo The BSOD occurs and I checked the "Current Mode" in "Advanced Settings" tab for "ATA Channel 0"/"ATA Channel 1", both are "Multi-Word DMA Mode2".
we will test it on the latest xenpv-win driver then update the result on BZ
This run of xenpv-win testing will not use IDE for disk stress/verification, so it wouldn't add any new information.
(In reply to comment #23) > This run of xenpv-win testing will not use IDE for disk stress/verification, so > it wouldn't add any new information. We can set EDO=3 to run Disk Verification jobs with another ENV besides whql testing. Should we test it with the xenpv-win got from brewroot? Or we need to begin test after Bug 697368 fixed?
Since this is a xen bug (not xenpv-win), you can do any of the following: 1) consider the testing with -7 valid and VERIFY :) 2) use the same build you're using now, with EDO=3 3) wait for the next build (signed, with bug 697368 fixed---it's just a packaging hiccup)
(In reply to comment #25) > Since this is a xen bug (not xenpv-win), you can do any of the following: > > 1) consider the testing with -7 valid and VERIFY :) > > 2) use the same build you're using now, with EDO=3 > > 3) wait for the next build (signed, with bug 697368 fixed---it's just a > packaging hiccup) I am running disk-verification with xenpv-win-1.3.4-8 and xen.3.0.3-128 in 2k8-64 and set EDO=3. I will update the result after it done .
Verify this bug in rhel5.7 with: xenpv-win-1.3.4-8 (not brewweb,Paolo give us) xen-3.0.3-128 kernel-xen-2.6.18-257.el5xen 2k8-64 bit Disk verification job passed , no BSOD .
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, the cache flush for IDE devices emulated via the qemu emulator was performed synchronously. When the flush process took too much time, the virtual CPU was stuck while the fsync utility was running. This behavior sometimes caused guests to terminate unexpectedly on the Windows operating system. With this update, cache flushes for IDE devices are done asynchronously, and a crash no longer occurs in the described scenario.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-1070.html