Bug 552573 - Formatting a SCSI disk on a Windows Vista 64-bit guest causes Windows to hang
Formatting a SCSI disk on a Windows Vista 64-bit guest causes Windows to hang
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: xen (Show other bugs)
5.5
x86_64 Linux
medium Severity medium
: ---
: 5.4
Assigned To: Michal Novotny
Virtualization Bugs
:
Depends On: 515757 516177
Blocks: 514500
  Show dependency treegraph
 
Reported: 2010-01-05 09:56 EST by Paolo Bonzini
Modified: 2014-02-02 17:37 EST (History)
17 users (show)

See Also:
Fixed In Version: xen-3.0.3-112.el5
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 465116
Environment:
Last Closed: 2011-01-13 17:19:53 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Patch to fix Win64 SCSI operations (11.00 KB, patch)
2010-05-20 09:46 EDT, Michal Novotny
no flags Details | Diff
Patch to fix Win64 SCSI operations v2 (9.05 KB, patch)
2010-06-14 13:18 EDT, Michal Novotny
no flags Details | Diff

  None (edit)
Description Paolo Bonzini 2010-01-05 09:56:53 EST
+++ This bug was initially created as a clone of Bug #465116 +++

Created an attachment (id=319123)
this is a screen shot of the crash

Description of problem: I installed Windows 2008 64 bit Standard Edition on a RHEL 5.3 Dom 0. Twice now, while formatting a disk, the guest crashed. One time I got a memory dump the other time I did not.


Version-Release number of selected component (if applicable):
kernel: 2.6.18-116.el5xen #1 SMP
RHEL: RHEL5.3-Server-20080922.0
Windows guest: 64 bit Windows 2008 Server Standard Edition.

How reproducible:
I was able to reproduce this twice in about 10 tries.

Steps to Reproduce:
1. Install the guest.
2. Shutdown guest, add attached storage devices, xm create guest.
3. Within disk manager, format a FAT32 partition of 1GB size.
  
Actual results:
Sometimes the guest crashes.

Expected results:
The disk should format and then come into service.

Additional info:
Attached is a screen shot of the blue screen. Also attached is the xen configuration file of the guest. I also have a memory dump file if anyone can read Windows dumps.

--- Additional comment from bburns@redhat.com on 2009-02-27 15:05:58 EDT ---

Asking Rik to have a look. Multi-threaded QEMU has a fix for this issue, but our QEMU does not. Please investigate for a possible solution.

--- Additional comment from riel@redhat.com on 2009-03-04 14:03:54 EDT ---

If we have correct code in KVM/RHEV qemu, it may be backportable to RHEL 5 after a few months of RHEV testing.

The upstream qemu code had a few known bugs with the AIO code and the SCSI emulation, last I heard.

--- Additional comment from riel@redhat.com on 2009-03-26 13:32:26 EDT ---

I have backported AIO to qemu's IDE and SCSI emulation.  Could you try out the
test RPMs on http://people.redhat.com/riel/.xen-aio/ to see if the timer irq still gets locked out from one virtual CPU?

--- Additional comment from bdonahue@redhat.com on 2009-03-26 14:54:40 EDT ---

I just formatted 3 drives on a 2 VPU 64 bit Windows 2008 guest with no problems. It looks like this fixed the problem.

--- Additional comment from riel@redhat.com on 2009-03-26 16:56:51 EDT ---

Created an attachment (id=336893)
QEMU AIO infrastructure backport

--- Additional comment from riel@redhat.com on 2009-03-26 16:57:17 EDT ---

Created an attachment (id=336894)
QEMU AIO IDE backport

--- Additional comment from riel@redhat.com on 2009-03-26 16:57:42 EDT ---

Created an attachment (id=336895)
QEMU AIO SCSI backport

--- Additional comment from llim@redhat.com on 2009-08-04 15:00:45 EDT ---

Moving bug to assigned based on the following comment.

<https://bugzilla.redhat.com/show_bug.cgi?id=479339#c58>

--- Additional comment from riel@redhat.com on 2009-08-04 15:40:31 EDT ---

The bug appears to only be present in the emulated SCSI disks.  This means we can get away with reverting just the SCSI part of the AIO backport (xen-qemu-aio-scsi.patch):

diff -u -d -u -r1.287 xen.spec
--- xen.spec    3 Aug 2009 05:25:12 -0000       1.287
+++ xen.spec    4 Aug 2009 19:39:14 -0000
@@ -863,7 +863,7 @@
 # AIO backport for qemu
 %patch873 -p1
 %patch874 -p1
-%patch875 -p1
+# %patch875 -p1
 %patch876 -p1
 # Fix HVM time skew problems
 %patch877 -p1

--- Additional comment from lsmid@redhat.com on 2009-08-05 09:36:58 EDT ---

Blocker approved for RHEL 5.4 Release Candidate, see comment #29.

--- Additional comment from armbru@redhat.com on 2009-08-05 09:58:29 EDT ---

This bug is about IDE disks.  We do not know whether a similar bug exists for SCSI.  It would be useful if QA could test SCSI in addition to IDE, and if it fails for SCSI, file a separate bug for 5.5.
Comment 2 Paolo Bonzini 2010-01-06 13:55:46 EST
Yes, the SCSI part of AIO was reverted due to bugs, but the original bug 465116 was for IDE so it is closed now.
Comment 3 Michal Novotny 2010-03-17 05:26:18 EDT
(In reply to comment #2)
> Yes, the SCSI part of AIO was reverted due to bugs, but the original bug 465116
> was for IDE so it is closed now.    

Any update on this ? Paolo? Rik?

Michal
Comment 4 Rik van Riel 2010-03-17 08:53:40 EDT
Michal, what is the question?
Comment 5 Michal Novotny 2010-03-17 08:57:59 EDT
(In reply to comment #4)
> Michal, what is the question?    

It was just if you have any suggestion about this bug since it *may* be possible it's done by your aio patches but it may not too. Is it possible this is causes by one of your aio patches?

Michal
Comment 6 Rik van Riel 2010-03-17 10:28:02 EDT
Before the AIO patches were merged, the bug also happened with IDE disks.  Merging the AIO patches fixed the bug for IDE disks.

If you can make the AIO patches work for SCSI, it may also be fixed for SCSI disks. However, the SCSI code in qemu has all kinds of bugs that make it difficult to get the AIO code to work correctly.
Comment 7 Michal Novotny 2010-04-20 11:17:32 EDT
Version-Release number of selected component (if applicable):
xen version: xen-3.0.3-105.el5virttest24
kernel: 2.6.18-194.el5xen #1 SMP
Guest: Windows 2008 Server x64 Datacenter Edition.

Steps:
1. Install the guest.
2. dd if=/dev/zero of=/some/path/to/new/image bs=1G count=1
3. Shutdown guest, add attached storage devices, xm create guest.

In guest:
 4. Open computer management in MMC
 5. Click disk management, select the disk, now it should be Unknown, 1.00 GB and Offline
 6. Right click on the disk, select Online, it should go online & change Offline string to Not Initialized
 7. Right click and click Initialize disk, select Disk 1 with MBR partition style
 8. It should change to Basic, 1023 MB and Online
 9. Right click on the bar right to the description and select Create simple volume
 10. Click next and select 1021 MB, assign some letter to it and format as FAT32
 11. Partition is created, open it and copy some data onto it
 12. Reboot the guest
 13. Start the guest, log-in and try to access the data on the disk...

I see no problem here. the data are present too.

Could you please try whether it's still problem Paolo?

Michal
Comment 8 Paolo Bonzini 2010-04-21 08:04:47 EDT
The bug doesn't reproduce always.  There's nothing that could have fixed it.
Comment 9 Michal Novotny 2010-04-21 08:07:43 EDT
(In reply to comment #8)
> The bug doesn't reproduce always.  There's nothing that could have fixed it.    

Oh, ok, I'm now looking at bug 516177 which is closely connected so I am investigating the code for SCSI in Xen-iommu right now so I hope to get such an error to be able to debug the problem.

Thanks,
Michal
Comment 10 Michal Novotny 2010-05-05 06:03:38 EDT
(In reply to comment #9)
> (In reply to comment #8)
> > The bug doesn't reproduce always.  There's nothing that could have fixed it.    
> 
> Oh, ok, I'm now looking at bug 516177 which is closely connected so I am
> investigating the code for SCSI in Xen-iommu right now so I hope to get such an
> error to be able to debug the problem.
> 
> Thanks,
> Michal    

Well, the investigation and testing was showing that even *without* AIO patches (as in BZ #516177 ) it was not working for Windows guest to format it. see bug 516177 comment #11 but the normal Windows write operation is working fine. The problem is with the low-level formatting in Windows.

Michal
Comment 11 Michal Novotny 2010-05-19 08:09:10 EDT
(In reply to comment #10)
> (In reply to comment #9)
> > (In reply to comment #8)
> > > The bug doesn't reproduce always.  There's nothing that could have fixed it.    
> > 
> > Oh, ok, I'm now looking at bug 516177 which is closely connected so I am
> > investigating the code for SCSI in Xen-iommu right now so I hope to get such an
> > error to be able to debug the problem.
> > 
> > Thanks,
> > Michal    
> 
> Well, the investigation and testing was showing that even *without* AIO patches
> (as in BZ #516177 ) it was not working for Windows guest to format it. see bug
> 516177 comment #11 but the normal Windows write operation is working fine. The
> problem is with the low-level formatting in Windows.
> 
> Michal    

Well, the problem here is missing 64-bit DMA Block Move support in LSI SCSI controller. But when implementing (backporting) the 64-bit DMA Block Move it's still not working without applying the patch I've found on kerneltrap kvm list [1] that's about skipping the phase mismatch. But like I read in comment from Ryan Harper at 2009-01-06 07:08 (which includes the patch) it's more likely a workaround than a real patch so some work on this is needed (although the workaround seems to be working fine not causing any other issues according to my testing).

Michal

[1] http://kerneltrap.com/mailarchive/linux-kvm/2009/1/6/4610304
Comment 12 Michal Novotny 2010-05-20 09:46:38 EDT
Created attachment 415418 [details]
Patch to fix Win64 SCSI operations

Hi,
those are patches for BZ #552573 (Formatting a SCSI disk on Windows
Vista 64-bit guest causes Windows to hang) which requires my AIO
patches to be applied to make these patches apply. The main issue
here was the missing implementation of 64-bit Block Move support
in LSI SCSI controller code. Also, this was not the only patch
that was required to make it working fine (not just for the Win64
formatting but also to boot it up) and also there was a hack for
Win64 that was necessary. I did write the hack part (part 2: Make
a Win64 LSI drivers hack for LSI SCSI controller) and it was
tested and working fine for all the guests tested (for more info
see next paragraph please). Without this patch applied all the
SCSI operations took a pretty long time according to my testing
which resulted into the guest being almost unusable without this
patch applied). It was based on one thread on kerneltrap linux-kvm
mailing list (can be found at [1], comment by Ryan Harper on 06 Jan
2009 at 07:08) but with some modification to be used only in case
the Windows 64-bit guest is running (it's being checked by offset
number for the byte write operation in lsi_reg_writeb() function).

It's been tested using both Windows 32-bit and 64-bit guests (XP/2003,
Vista/2008) and also using RHEL-{5|6} both 32-bit and 64-bit guests and
it was working fine. According to the log files the only guests that 
were using the hack were 64-bit versions of Windows so this 
really seems to be relevant to Win64 LSI drivers implementation.

Brew: https://brewweb.devel.redhat.com/taskinfo?taskID=2460313

Upstream relationship: written per each part, one is direct backport,
                       second is patch written by myself but based on
                       the kerneltrap linux-kvm mailing list ([1])

So please review this one.

Thanks,
Michal

[1] http://kerneltrap.com/mailarchive/linux-kvm/2009/1/6/4610304
Comment 16 Michal Novotny 2010-06-14 13:18:43 EDT
Created attachment 423914 [details]
Patch to fix Win64 SCSI operations v2

This is second version of the patch.

Michal
Comment 33 Pengzhen Cao 2010-08-25 03:08:04 EDT
change to assigned as it failed on win2k3
Comment 34 Paolo Bonzini 2010-08-25 04:23:27 EDT
Since the bug is about the _hang_, I think this can be moved to verified and a new bug opened for the format failure in Windows 2003.

BTW Michal you _never_ have to assume it is a bug in Windows drivers, period. :)  As you already witnessed once, it is much much more likely that the bug is in QEMU's emulation code.
Comment 35 Pengzhen Cao 2010-08-25 05:46:37 EDT
for windows 2k3, I tried again to format the disk with a driver letter assigned to it, this time everything works fine and format successful.
I think this is why Michal thought it could work before but failed in his recent try.

I tried with a qemu ide disk and it also failed immediately without assign a driver leter. So I think this is windows issue or qemu issue.

So this bug could be considered verified.
Comment 36 Pengzhen Cao 2010-08-25 05:56:29 EDT
FYI.
change one key in the registry can turn off windows lsi driver's error countrol:
"HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\sym_hi\ErrorControl"
Change this key from 1 to 0, could eliminate the several seconds hang when the  format begin on  win2k8 or vista.
I thought this could resolve win2k3's issue too but it turned out to be the driver letter, really strange.
Comment 38 errata-xmlrpc 2011-01-13 17:19:53 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0031.html

Note You need to log in before you can comment on or make changes to this bug.