Bug 1454641

Summary: Windows 10 BSOD when using rhel6.4.0/rhel6.5.0/rhel6.6.0
Product: Red Hat Enterprise Linux 7 Reporter: Guo, Zhiyi <zhguo>
Component: qemu-kvm-rhevAssignee: Eduardo Habkost <ehabkost>
Status: CLOSED ERRATA QA Contact: Guo, Zhiyi <zhguo>
Severity: high Docs Contact:
Priority: high    
Version: 7.4CC: chayang, dgilbert, juzhang, knoel, lijin, pbonzini, rkrcmar, virt-bugs, virt-maint, vrozenfe, wyu, zhguo
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: x86_64   
OS: Windows   
Whiteboard:
Fixed In Version: qemu-kvm-rhev-2.9.0-9.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-02 04:41:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
BSOD with qemu-kvm-rhev-2.6.0-28.el7_3.10 + -machine rhel6.6.0 none

Description Guo, Zhiyi 2017-05-23 08:50:54 UTC
Description of problem:
Windows 10 BSOD when using rhel6.4.0/rhel6.5.0/rhel6.6.0

Version-Release number of selected component (if applicable):
3.10.0-668.el7.x86_64
qemu-kvm-rhev-2.9.0-5.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1.Boot windows 10 guest with below cli:
/usr/libexec/qemu-kvm \
-S \
-cpu Nehalem \
-m 2048 \
-realtime mlock=off \
-smp 1,sockets=1,cores=1,threads=1 \
-uuid 0ebfba3a-83a1-144f-4e5d-9f7731677231 \
-no-user-config \
-nodefaults \
-rtc base=utc \
-no-shutdown \
-boot menu=off,strict=on \
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \
-device ide-drive,drive=drive-virtio-disk0,id=virtio-disk0 \
-vnc 0.0.0.0:28 \
-vga std \
-boot order=d \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 \
-monitor stdio \
-cdrom en_windows_10_enterprise_version_1607_updated_jul_2016_x64_dvd_9054264.iso \
-msg timestamp=on \
-drive file=/root/zhguo/bwin10.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,aio=native \
-machine rhel6.6.0,accel=kvm,usb=off \
-device intel-hda,id=sound0 \
-device hda-duplex,id=sound0-codec0 \
2.Check guest status
3.

Actual results:
Guest BSOD with stop code: SYSTEM THREAD EXCEPTION NOT HANDLED

Expected results:
Guest can boot to desktop

Additional info:
No such issue happen when using rhel7.4 guest. Also no such issue happen if change machine type to rhel6.3.0 or pc-i440fx-rhel7.3.0/pc-i440fx-rhel7.4.0

Comment 13 Eduardo Habkost 2017-06-02 21:08:53 UTC
Created attachment 1284530 [details]
BSOD with qemu-kvm-rhev-2.6.0-28.el7_3.10 + -machine rhel6.6.0

I could reproduce it on a similar host.

Command-line used: /usr/libexec/qemu-kvm -S -cpu Nehalem,+erms,enforce -m 2048 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 0ebfba3a-83a1-144f-4e5d-9f7731677231 -no-user-config -nodefaults -rtc base=utc -no-shutdown -boot menu=off,strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device ide-drive,drive=drive-virtio-disk0,id=virtio-disk0 -vnc 0.0.0.0:1 -vga std -boot order=d -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -monitor stdio -msg timestamp=on -drive file=/var/lib/libvirt/images/windows_10_dbgenabled_test.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,aio=native -machine rhel6.6.0,accel=kvm,usb=off -device intel-hda,id=sound0 -device hda-duplex,id=sound0-codec0 -device usb-tablet

Results:

* qemu-kvm-rhev-2.9.0-7.el7 + -machine rhel6.6.0: boots normally
* qemu-kvm-rhev-2.6.0-28.el7_3.10 + -machine rhel6.6.0: crash (see screenshot)
* qemu-kvm-rhev-2.6.0-28.el7_3.10.x86_64 + -machine pc-i440fx-rhel7.4.0: boots normally

The BSOD screenshot was captured after setting HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl\DisplayParameters (DWORD) = 1

Comment 14 Eduardo Habkost 2017-06-02 21:20:30 UTC
The exception info seems to indicate an illegal instruction.  Is anybody more experienced in debugging windows kernel code able to figure out what's the instruction triggering the exception?

Comment 15 Vadim Rozenfeld 2017-06-03 23:59:03 UTC
When SYSTEM_THREAD_EXCEPTION_NOT_HANDLED bugcheck happens, the second parameter usually indicates the address of instruction, which caused KiTrap0D. Next time when the problem happens, you can check this value and use it from qemu monitor for dumping and disassembling Windows guest code 
(xp /4i physical_address_from_2nd_bugcheck_param - couple_of_bytes)

Comment 16 Eduardo Habkost 2017-06-05 11:27:30 UTC
Disassemble of the address seems to indicate a clflushopt instruction. Example below:

gva: 0xfffff8038b7637a0, gpa: 0x25637a0

(qemu) xp /256b 0x2563700
[...]
0000000002563788: 0xca 0x75 0xa5 0x0f 0xae 0xf8 0xc3 0xcc
0000000002563790: 0xcc 0xcc 0xcc 0xcc 0xcc 0x66 0x66 0x66
0000000002563798: 0x0f 0x1f 0x84 0x00 0x00 0x00 0x00 0x00
00000000025637a0: 0x66 0x0f 0xae 0x39 0x49 0x03 0xc8 0x49
00000000025637a8: 0x2b 0xd0 0x75 0xf4 0x0f 0xae 0xf8 0xc3
00000000025637b0: 0xcc 0xcc 0xcc 0xcc 0xcc 0xcc 0x66 0x66
[...]
(qemu) xp /256i 0x2563700
[...]
0x0000000002563793:  int3
0x0000000002563794:  int3
0x0000000002563795:  nopw   0x0(%rax,%rax,1)
0x00000000025637a0:  data16
0x00000000025637a1:  clflush (%rcx)
0x00000000025637a4:  add    %r8,%rcx
0x00000000025637a7:  sub    %r8,%rdx
0x00000000025637aa:  jne    0x25637a0
0x00000000025637ac:  sfence
0x00000000025637af:  retq
0x00000000025637b0:  int3
0x00000000025637b1:  int3


clflushopt is not enabled on CPUID, though.  I will take a look at other CPUID data and compare with the rhel7.4 machine-type and RHEL-7.3 qemu-kvm.

Is anybody with access to a Windows kernel debugger able to to get a stack trace and a memory dump, to find out what exactly is making this function be called?

Comment 17 Eduardo Habkost 2017-06-05 12:53:44 UTC
Found the root cause: qemu-kvm-rhev-2.6.0-28.el7_3.10 sets CPUID[0].EAX=7.  qemu-kvm-rhev-2.9.0-7.el7 sets CPUID[0].EAX=4.  Can work around it using "-cpu Nehalem,+erms,level=7,enforce".  It looks like Windows 10 assumes a non-zero value for CPUID[7] if it's unavailable.

I am now reviewing the compat code to understand why we broke compatibility on CPUID[0].EAX.

Comment 18 Eduardo Habkost 2017-06-05 15:57:27 UTC
Upstream fix submitted:

From: Eduardo Habkost <ehabkost>
Subject: [PATCH] pc: Use "min-[x]level" on compat_props
Date: Mon,  5 Jun 2017 12:56:45 -0300
Message-Id: <20170605155645.19226-1-ehabkost>

Comment 24 Miroslav Rezanina 2017-06-08 16:28:00 UTC
Fix included in qemu-kvm-rhev-2.9.0-9.el7

Comment 26 Guo, Zhiyi 2017-06-13 06:56:10 UTC
Verify this issue against qemu-kvm-rhev-2.9.0-9.el7.x86_64. Win 10 guest can boot normally with cli:
/usr/libexec/qemu-kvm \
-S \
-cpu Nehalem,+erms,enforce \
-m 2048 \
-realtime mlock=off \
-smp 1,sockets=1,cores=1,threads=1 \
-uuid 0ebfba3a-83a1-144f-4e5d-9f7731677231 \
-no-user-config \
-nodefaults \
-rtc base=utc \
-no-shutdown \
-boot menu=off,strict=on \
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \
-device ide-drive,drive=drive-virtio-disk0,id=virtio-disk0 \
-vnc 0.0.0.0:28 \
-vga std \
-boot order=d \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 \
-monitor stdio \
-cdrom en_windows_10_enterprise_version_1607_updated_jul_2016_x64_dvd_9054264.iso \
-msg timestamp=on \
-drive file=/root/zhguo/rhel74-64-virtio-scsi.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,aio=native \
-machine rhel6.6.0,accel=kvm,usb=off \

rhel6.5.0/rhel6.4.0 is also worked.
Try the cli with rhel7.4 guest, guest also can be booted.

Comment 27 Guo, Zhiyi 2017-06-13 06:56:43 UTC
Verified per comment 26

Comment 29 errata-xmlrpc 2017-08-02 04:41:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392