Bug 1411105
Summary: | Windows Server 2008-32 crashes on startup with q35 if cdrom attached | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | ybendito |
Component: | qemu-kvm-rhev | Assignee: | Ladi Prosek <lprosek> |
Status: | CLOSED ERRATA | QA Contact: | jingzhao <jinzhao> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.4 | CC: | ailan, chayang, jinzhao, jsnow, juzhang, knoel, lijin, lprosek, michen, rbalakri, virt-maint, yvugenfi |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | qemu-kvm-rhev-2.8.0-5.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-08-01 23:42:15 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1418320 | ||
Bug Blocks: |
Description
ybendito
2017-01-08 12:21:55 UTC
(from https://bugzilla.redhat.com/show_bug.cgi?id=1408771) The corrupting pattern in several dump files, looks a lot like SCSI Sense Buffer data, 18 bytes long: f0 00 05 00 00 00 00 0a 00 00 00 00 24 00 00 00 00 00 Is there some known/WIP problem related to this bug? Should it be considered as bug of 2008-only cdrom/msahci or problem of existing qemu? This is roughly what's happening: Windows issues the MODE SENSE (5a) ATAPI command with action = 0 and code = LUN Mapping (1b). Because QEMU does not support the LUN Mapping page, the command fails with ILLEGAL_REQUEST / ASC_INV_FIELD_IN_CMD_PACKET. Windows then issues the REQUEST SENSE (3) command to figure out why 5a failed and because it is (very likely) not set up correctly, the DMA transfer of the sense data buffer corrupts guest memory, sometimes causing BSOD later on. It's hard to tell what exactly Windows is doing without extensive reverse engineering. But the story with a poorly tested code path (presumably real HW tends to support the LUN Mapping page?) is plausible. This is where the DMA transfer is triggered from Windows point of view: 00 803d8530 803f0e64 nt!WRITE_REGISTER_ULONG+0xa 01 0x803f0e64 02 msahci!P_Running_WaitOnBSYDRQ+0x10a 03 msahci!P_Running_WaitOnFRE+0xcd 04 msahci!P_Running_WaitOnDET+0xef 05 msahci!P_Running+0x10b 06 msahci!P_Running_StartAttempt+0x36 07 msahci!AhciNonQueuedErrorRecovery+0x243 08 msahci!WorkerDispatch+0x60 09 ataport!IdeProcessMiniportDpcRequest+0x5e 0a ataport!IdePortCompletionDpc+0x6c 0b nt!KiRetireDpcList+0x147 0c nt!KiDispatchInterrupt+0x45 0d hal!HalpCheckForSoftwareInterrupt+0x64 0e hal!KfLowerIrql+0x64 0f ataport!IdeStartDeviceRequest+0x107 10 ataport!IssueCrbSync+0x30 11 ataport!IdeDiscoverDevice+0x131 12 ataport!IdeEnumerateDevices+0x8b 13 ataport!IdePortScanChannel+0x36 14 ataport!ChannelQueryBusRelation+0x3d 15 nt!IopProcessWorkItem+0x23 16 nt!ExpWorkerThread+0xfd 17 nt!PspSystemThreadStartup+0x9d 18 nt!KiThreadStartup+0x16 So - a more specific question for John: How hard would it be to support the MODE SENSE command with LUN Mapping (1b)? Thanks! I tried the same Q35 VM config with Windows Server 2008R2 64-bit and I see the same issue. It's just that nothing important happens to live in the corrupted memory region on 2008R2. In my run the DMA physical address was 7ff40540. Then in windbg: // get directory base 1: kd> !ptov 00187000 Amd64PtoV: pagedir 187000 ... 7ff40000 fffff880`0299f000 ... 1: kd> !address 0xfffff880`0299f000+0x540 ... Usage: Base Address: fffff880`023a0000 End Address: fffff8a0`00000000 Region Size: 0000001f`fdc60000 VA Type: SystemPTEs // <== doesn't look like a good DMA destination (In reply to Ladi Prosek from comment #3) > It's hard to tell what exactly Windows is doing without extensive reverse > engineering. But the story with a poorly tested code path (presumably real > HW tends to support the LUN Mapping page?) is plausible. This was not a correct assessment. Also, comment 4 should be ignored. Apparently the memory returned by AtaPortGetUnCachedExtension where all the AHCI datastructures live really identifies as "SystemPTEs" in !address, as strange as it looks. The problem is somewhere else. Here's a snippet from msahci.sys which sets up the DMA destination (data base addresses in the PRDT): 8cb2d14c call msahci!AtaPortGetPhysicalAddress (8cb2d3ac) 8cb2d151 test al,1 8cb2d153 jne msahci!IRBtoPRDT+0x198 (8cb2d192) 8cb2d155 mov dword ptr [esi+80h],eax 8cb2d15b test dword ptr [edi+468h],80000000h 8cb2d165 je msahci!IRBtoPRDT+0x173 (8cb2d16d) [br=1] 8cb2d167 mov dword ptr [esi+84h],edx [esi+80h] is the lower 32-bits (DBA), [esi+84h] is the upper 32-bits (DBAU). The corruption we observe is caused by not setting the DBAU while the physical address is in fact >4GB and has a non-zero upper dword. Side notes: 1) I couldn't originally repro this because I was running the VM with less than 4GB of RAM. 2) 4GB of RAM is enough to hit this because there will be pages >4GB due to all those memory gaps. [edi+468h] has the contents of the HBA Capabilities register and bit 31 tested above is described in the spec like so: "Supports 64-bit Addressing (S64A): Indicates whether the HBA can access 64-bit data structures. When set to ‘1’, the HBA shall make the 32-bit upper bits of the port DMA Descriptor, the PRD Base, and each PRD entry read/write. When cleared to ‘0’, these are read-only and treated as ‘0’ by the HBA." So whose fault is this? What Windows is doing is definitely odd. If the address has non-zero upper 32-bits and the HBA claims that it doesn't support 64-bit DMA, then they ignore the upper 32-bits and hope that it will somehow work. Right. :) On the other hand, the QEMU HBA does support 64-bit DMA so bit 31 in the HBA Capabilities register should be set. I have verified that it fixes the issue, i.e it makes Windows correctly supply 64-bit physical addresses and it all works. I have posted a QEMU patch to advertise the 64-bit capability: https://lists.nongnu.org/archive/html/qemu-devel/2017-01/msg02596.html And a SeaBIOS patch to correctly initialize the AHCI controller (found during testing; manifests as a hang or crash after reboot): https://www.coreboot.org/pipermail/seabios/2017-January/011062.html Bug 1418320 tracks the SeaBIOS fix in 7.4. This bug continues to track the QEMU fix. Changing the component as Windows guests are not supported in qemu-kvm. Fix included in qemu-kvm-rhev-2.8.0-5.el7 Reproduce the issue on qemu-kvm-rhev-2.6.0-29.el7 Verified it on the qemu-kvm-rhev-2.8.0-5.el7 ps: the qemu command line: /usr/libexec/qemu-kvm \ -M q35 \ -cpu SandyBridge \ -nodefaults -rtc base=utc \ -m 4G \ -smp 2,sockets=2,cores=1,threads=1 \ -enable-kvm \ -name rhel7.4 \ -uuid 990ea161-6b67-47b2-b803-19fb01d30d12 \ -smbios type=1,manufacturer='Red Hat',product='RHEV Hypervisor',version=el6,serial=koTUXQrb,uuid=feebc8fd-f8b0-4e75-abc3-e63fcdb67170 \ -k en-us \ -serial unix:/tmp/console,server,nowait \ -boot menu=on \ -bios /usr/share/seabios/bios.bin \ -chardev file,path=/home/test/seabios.log,id=seabios \ -device isa-debugcon,chardev=seabios,iobase=0x402 \ -qmp tcp::8887,server,nowait \ -vga qxl \ -spice port=5932,disable-ticketing \ -device ioh3420,id=root.0,slot=1 \ -drive file=/home/test/win8-32.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,werror=stop,rerror=stop \ -device virtio-blk-pci,bus=root.0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \ -device ioh3420,id=root.1,slot=2 \ -device ioh3420,id=root.2,slot=3 \ -netdev tap,id=hostnet1 \ -device virtio-net-pci,netdev=hostnet1,id=net1,mac=54:52:00:B6:40:22,bus=root.2 \ -monitor stdio \ -cdrom /home/en_windows_server_2008_datacenter_enterprise_standard_sp2_x64_dvd_342336.iso \ -drive file=/usr/share/virtio-win/virtio-win-1.9.0.iso,if=none,media=cdrom,id=drive-ide1,format=raw \ -device ide-drive,bus=ide.0,drive=drive-ide1,id=ide1 \ Thanks Jing Zhao Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392 |