+++ This bug was initially created as a clone of Bug #465116 +++ Created an attachment (id=319123) this is a screen shot of the crash Description of problem: I installed Windows 2008 64 bit Standard Edition on a RHEL 5.3 Dom 0. Twice now, while formatting a disk, the guest crashed. One time I got a memory dump the other time I did not. Version-Release number of selected component (if applicable): kernel: 2.6.18-116.el5xen #1 SMP RHEL: RHEL5.3-Server-20080922.0 Windows guest: 64 bit Windows 2008 Server Standard Edition. How reproducible: I was able to reproduce this twice in about 10 tries. Steps to Reproduce: 1. Install the guest. 2. Shutdown guest, add attached storage devices, xm create guest. 3. Within disk manager, format a FAT32 partition of 1GB size. Actual results: Sometimes the guest crashes. Expected results: The disk should format and then come into service. Additional info: Attached is a screen shot of the blue screen. Also attached is the xen configuration file of the guest. I also have a memory dump file if anyone can read Windows dumps. --- Additional comment from bburns on 2009-02-27 15:05:58 EDT --- Asking Rik to have a look. Multi-threaded QEMU has a fix for this issue, but our QEMU does not. Please investigate for a possible solution. --- Additional comment from riel on 2009-03-04 14:03:54 EDT --- If we have correct code in KVM/RHEV qemu, it may be backportable to RHEL 5 after a few months of RHEV testing. The upstream qemu code had a few known bugs with the AIO code and the SCSI emulation, last I heard. --- Additional comment from riel on 2009-03-26 13:32:26 EDT --- I have backported AIO to qemu's IDE and SCSI emulation. Could you try out the test RPMs on http://people.redhat.com/riel/.xen-aio/ to see if the timer irq still gets locked out from one virtual CPU? --- Additional comment from bdonahue on 2009-03-26 14:54:40 EDT --- I just formatted 3 drives on a 2 VPU 64 bit Windows 2008 guest with no problems. It looks like this fixed the problem. --- Additional comment from riel on 2009-03-26 16:56:51 EDT --- Created an attachment (id=336893) QEMU AIO infrastructure backport --- Additional comment from riel on 2009-03-26 16:57:17 EDT --- Created an attachment (id=336894) QEMU AIO IDE backport --- Additional comment from riel on 2009-03-26 16:57:42 EDT --- Created an attachment (id=336895) QEMU AIO SCSI backport --- Additional comment from llim on 2009-08-04 15:00:45 EDT --- Moving bug to assigned based on the following comment. <https://bugzilla.redhat.com/show_bug.cgi?id=479339#c58> --- Additional comment from riel on 2009-08-04 15:40:31 EDT --- The bug appears to only be present in the emulated SCSI disks. This means we can get away with reverting just the SCSI part of the AIO backport (xen-qemu-aio-scsi.patch): diff -u -d -u -r1.287 xen.spec --- xen.spec 3 Aug 2009 05:25:12 -0000 1.287 +++ xen.spec 4 Aug 2009 19:39:14 -0000 @@ -863,7 +863,7 @@ # AIO backport for qemu %patch873 -p1 %patch874 -p1 -%patch875 -p1 +# %patch875 -p1 %patch876 -p1 # Fix HVM time skew problems %patch877 -p1 --- Additional comment from lsmid on 2009-08-05 09:36:58 EDT --- Blocker approved for RHEL 5.4 Release Candidate, see comment #29. --- Additional comment from armbru on 2009-08-05 09:58:29 EDT --- This bug is about IDE disks. We do not know whether a similar bug exists for SCSI. It would be useful if QA could test SCSI in addition to IDE, and if it fails for SCSI, file a separate bug for 5.5.
Yes, the SCSI part of AIO was reverted due to bugs, but the original bug 465116 was for IDE so it is closed now.
(In reply to comment #2) > Yes, the SCSI part of AIO was reverted due to bugs, but the original bug 465116 > was for IDE so it is closed now. Any update on this ? Paolo? Rik? Michal
Michal, what is the question?
(In reply to comment #4) > Michal, what is the question? It was just if you have any suggestion about this bug since it *may* be possible it's done by your aio patches but it may not too. Is it possible this is causes by one of your aio patches? Michal
Before the AIO patches were merged, the bug also happened with IDE disks. Merging the AIO patches fixed the bug for IDE disks. If you can make the AIO patches work for SCSI, it may also be fixed for SCSI disks. However, the SCSI code in qemu has all kinds of bugs that make it difficult to get the AIO code to work correctly.
Version-Release number of selected component (if applicable): xen version: xen-3.0.3-105.el5virttest24 kernel: 2.6.18-194.el5xen #1 SMP Guest: Windows 2008 Server x64 Datacenter Edition. Steps: 1. Install the guest. 2. dd if=/dev/zero of=/some/path/to/new/image bs=1G count=1 3. Shutdown guest, add attached storage devices, xm create guest. In guest: 4. Open computer management in MMC 5. Click disk management, select the disk, now it should be Unknown, 1.00 GB and Offline 6. Right click on the disk, select Online, it should go online & change Offline string to Not Initialized 7. Right click and click Initialize disk, select Disk 1 with MBR partition style 8. It should change to Basic, 1023 MB and Online 9. Right click on the bar right to the description and select Create simple volume 10. Click next and select 1021 MB, assign some letter to it and format as FAT32 11. Partition is created, open it and copy some data onto it 12. Reboot the guest 13. Start the guest, log-in and try to access the data on the disk... I see no problem here. the data are present too. Could you please try whether it's still problem Paolo? Michal
The bug doesn't reproduce always. There's nothing that could have fixed it.
(In reply to comment #8) > The bug doesn't reproduce always. There's nothing that could have fixed it. Oh, ok, I'm now looking at bug 516177 which is closely connected so I am investigating the code for SCSI in Xen-iommu right now so I hope to get such an error to be able to debug the problem. Thanks, Michal
(In reply to comment #9) > (In reply to comment #8) > > The bug doesn't reproduce always. There's nothing that could have fixed it. > > Oh, ok, I'm now looking at bug 516177 which is closely connected so I am > investigating the code for SCSI in Xen-iommu right now so I hope to get such an > error to be able to debug the problem. > > Thanks, > Michal Well, the investigation and testing was showing that even *without* AIO patches (as in BZ #516177 ) it was not working for Windows guest to format it. see bug 516177 comment #11 but the normal Windows write operation is working fine. The problem is with the low-level formatting in Windows. Michal
(In reply to comment #10) > (In reply to comment #9) > > (In reply to comment #8) > > > The bug doesn't reproduce always. There's nothing that could have fixed it. > > > > Oh, ok, I'm now looking at bug 516177 which is closely connected so I am > > investigating the code for SCSI in Xen-iommu right now so I hope to get such an > > error to be able to debug the problem. > > > > Thanks, > > Michal > > Well, the investigation and testing was showing that even *without* AIO patches > (as in BZ #516177 ) it was not working for Windows guest to format it. see bug > 516177 comment #11 but the normal Windows write operation is working fine. The > problem is with the low-level formatting in Windows. > > Michal Well, the problem here is missing 64-bit DMA Block Move support in LSI SCSI controller. But when implementing (backporting) the 64-bit DMA Block Move it's still not working without applying the patch I've found on kerneltrap kvm list [1] that's about skipping the phase mismatch. But like I read in comment from Ryan Harper at 2009-01-06 07:08 (which includes the patch) it's more likely a workaround than a real patch so some work on this is needed (although the workaround seems to be working fine not causing any other issues according to my testing). Michal [1] http://kerneltrap.com/mailarchive/linux-kvm/2009/1/6/4610304
Created attachment 415418 [details] Patch to fix Win64 SCSI operations Hi, those are patches for BZ #552573 (Formatting a SCSI disk on Windows Vista 64-bit guest causes Windows to hang) which requires my AIO patches to be applied to make these patches apply. The main issue here was the missing implementation of 64-bit Block Move support in LSI SCSI controller code. Also, this was not the only patch that was required to make it working fine (not just for the Win64 formatting but also to boot it up) and also there was a hack for Win64 that was necessary. I did write the hack part (part 2: Make a Win64 LSI drivers hack for LSI SCSI controller) and it was tested and working fine for all the guests tested (for more info see next paragraph please). Without this patch applied all the SCSI operations took a pretty long time according to my testing which resulted into the guest being almost unusable without this patch applied). It was based on one thread on kerneltrap linux-kvm mailing list (can be found at [1], comment by Ryan Harper on 06 Jan 2009 at 07:08) but with some modification to be used only in case the Windows 64-bit guest is running (it's being checked by offset number for the byte write operation in lsi_reg_writeb() function). It's been tested using both Windows 32-bit and 64-bit guests (XP/2003, Vista/2008) and also using RHEL-{5|6} both 32-bit and 64-bit guests and it was working fine. According to the log files the only guests that were using the hack were 64-bit versions of Windows so this really seems to be relevant to Win64 LSI drivers implementation. Brew: https://brewweb.devel.redhat.com/taskinfo?taskID=2460313 Upstream relationship: written per each part, one is direct backport, second is patch written by myself but based on the kerneltrap linux-kvm mailing list ([1]) So please review this one. Thanks, Michal [1] http://kerneltrap.com/mailarchive/linux-kvm/2009/1/6/4610304
Created attachment 423914 [details] Patch to fix Win64 SCSI operations v2 This is second version of the patch. Michal
change to assigned as it failed on win2k3
Since the bug is about the _hang_, I think this can be moved to verified and a new bug opened for the format failure in Windows 2003. BTW Michal you _never_ have to assume it is a bug in Windows drivers, period. :) As you already witnessed once, it is much much more likely that the bug is in QEMU's emulation code.
for windows 2k3, I tried again to format the disk with a driver letter assigned to it, this time everything works fine and format successful. I think this is why Michal thought it could work before but failed in his recent try. I tried with a qemu ide disk and it also failed immediately without assign a driver leter. So I think this is windows issue or qemu issue. So this bug could be considered verified.
FYI. change one key in the registry can turn off windows lsi driver's error countrol: "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\sym_hi\ErrorControl" Change this key from 1 to 0, could eliminate the several seconds hang when the format begin on win2k8 or vista. I thought this could resolve win2k3's issue too but it turned out to be the driver letter, really strange.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0031.html