Bug 1454641
Summary: | Windows 10 BSOD when using rhel6.4.0/rhel6.5.0/rhel6.6.0 | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Guo, Zhiyi <zhguo> | ||||
Component: | qemu-kvm-rhev | Assignee: | Eduardo Habkost <ehabkost> | ||||
Status: | CLOSED ERRATA | QA Contact: | Guo, Zhiyi <zhguo> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.4 | CC: | chayang, dgilbert, juzhang, knoel, lijin, pbonzini, rkrcmar, virt-bugs, virt-maint, vrozenfe, wyu, zhguo | ||||
Target Milestone: | rc | Keywords: | Regression | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Windows | ||||||
Whiteboard: | |||||||
Fixed In Version: | qemu-kvm-rhev-2.9.0-9.el7 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-08-02 04:41:00 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Guo, Zhiyi
2017-05-23 08:50:54 UTC
Created attachment 1284530 [details]
BSOD with qemu-kvm-rhev-2.6.0-28.el7_3.10 + -machine rhel6.6.0
I could reproduce it on a similar host.
Command-line used: /usr/libexec/qemu-kvm -S -cpu Nehalem,+erms,enforce -m 2048 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 0ebfba3a-83a1-144f-4e5d-9f7731677231 -no-user-config -nodefaults -rtc base=utc -no-shutdown -boot menu=off,strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device ide-drive,drive=drive-virtio-disk0,id=virtio-disk0 -vnc 0.0.0.0:1 -vga std -boot order=d -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -monitor stdio -msg timestamp=on -drive file=/var/lib/libvirt/images/windows_10_dbgenabled_test.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,aio=native -machine rhel6.6.0,accel=kvm,usb=off -device intel-hda,id=sound0 -device hda-duplex,id=sound0-codec0 -device usb-tablet
Results:
* qemu-kvm-rhev-2.9.0-7.el7 + -machine rhel6.6.0: boots normally
* qemu-kvm-rhev-2.6.0-28.el7_3.10 + -machine rhel6.6.0: crash (see screenshot)
* qemu-kvm-rhev-2.6.0-28.el7_3.10.x86_64 + -machine pc-i440fx-rhel7.4.0: boots normally
The BSOD screenshot was captured after setting HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl\DisplayParameters (DWORD) = 1
The exception info seems to indicate an illegal instruction. Is anybody more experienced in debugging windows kernel code able to figure out what's the instruction triggering the exception? When SYSTEM_THREAD_EXCEPTION_NOT_HANDLED bugcheck happens, the second parameter usually indicates the address of instruction, which caused KiTrap0D. Next time when the problem happens, you can check this value and use it from qemu monitor for dumping and disassembling Windows guest code (xp /4i physical_address_from_2nd_bugcheck_param - couple_of_bytes) Disassemble of the address seems to indicate a clflushopt instruction. Example below: gva: 0xfffff8038b7637a0, gpa: 0x25637a0 (qemu) xp /256b 0x2563700 [...] 0000000002563788: 0xca 0x75 0xa5 0x0f 0xae 0xf8 0xc3 0xcc 0000000002563790: 0xcc 0xcc 0xcc 0xcc 0xcc 0x66 0x66 0x66 0000000002563798: 0x0f 0x1f 0x84 0x00 0x00 0x00 0x00 0x00 00000000025637a0: 0x66 0x0f 0xae 0x39 0x49 0x03 0xc8 0x49 00000000025637a8: 0x2b 0xd0 0x75 0xf4 0x0f 0xae 0xf8 0xc3 00000000025637b0: 0xcc 0xcc 0xcc 0xcc 0xcc 0xcc 0x66 0x66 [...] (qemu) xp /256i 0x2563700 [...] 0x0000000002563793: int3 0x0000000002563794: int3 0x0000000002563795: nopw 0x0(%rax,%rax,1) 0x00000000025637a0: data16 0x00000000025637a1: clflush (%rcx) 0x00000000025637a4: add %r8,%rcx 0x00000000025637a7: sub %r8,%rdx 0x00000000025637aa: jne 0x25637a0 0x00000000025637ac: sfence 0x00000000025637af: retq 0x00000000025637b0: int3 0x00000000025637b1: int3 clflushopt is not enabled on CPUID, though. I will take a look at other CPUID data and compare with the rhel7.4 machine-type and RHEL-7.3 qemu-kvm. Is anybody with access to a Windows kernel debugger able to to get a stack trace and a memory dump, to find out what exactly is making this function be called? Found the root cause: qemu-kvm-rhev-2.6.0-28.el7_3.10 sets CPUID[0].EAX=7. qemu-kvm-rhev-2.9.0-7.el7 sets CPUID[0].EAX=4. Can work around it using "-cpu Nehalem,+erms,level=7,enforce". It looks like Windows 10 assumes a non-zero value for CPUID[7] if it's unavailable. I am now reviewing the compat code to understand why we broke compatibility on CPUID[0].EAX. Upstream fix submitted: From: Eduardo Habkost <ehabkost> Subject: [PATCH] pc: Use "min-[x]level" on compat_props Date: Mon, 5 Jun 2017 12:56:45 -0300 Message-Id: <20170605155645.19226-1-ehabkost> Fix included in qemu-kvm-rhev-2.9.0-9.el7 Verify this issue against qemu-kvm-rhev-2.9.0-9.el7.x86_64. Win 10 guest can boot normally with cli: /usr/libexec/qemu-kvm \ -S \ -cpu Nehalem,+erms,enforce \ -m 2048 \ -realtime mlock=off \ -smp 1,sockets=1,cores=1,threads=1 \ -uuid 0ebfba3a-83a1-144f-4e5d-9f7731677231 \ -no-user-config \ -nodefaults \ -rtc base=utc \ -no-shutdown \ -boot menu=off,strict=on \ -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \ -device ide-drive,drive=drive-virtio-disk0,id=virtio-disk0 \ -vnc 0.0.0.0:28 \ -vga std \ -boot order=d \ -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 \ -monitor stdio \ -cdrom en_windows_10_enterprise_version_1607_updated_jul_2016_x64_dvd_9054264.iso \ -msg timestamp=on \ -drive file=/root/zhguo/rhel74-64-virtio-scsi.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,aio=native \ -machine rhel6.6.0,accel=kvm,usb=off \ rhel6.5.0/rhel6.4.0 is also worked. Try the cli with rhel7.4 guest, guest also can be booted. Verified per comment 26 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392 |