Bug 720936
| Summary: | Windows guests may hang/BSOD on some AMD processors. | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Qixiang Wan <qwan> | ||||||||||
| Component: | kernel-xen | Assignee: | Paolo Bonzini <pbonzini> | ||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||||
| Severity: | high | Docs Contact: | |||||||||||
| Priority: | high | ||||||||||||
| Version: | 5.6 | CC: | drjones, imammedo, jzheng, leiwang, mrezanin, mshao, pbonzini, pcao, qwan, xen-maint | ||||||||||
| Target Milestone: | rc | Keywords: | Regression | ||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | Unspecified | ||||||||||||
| OS: | Unspecified | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | kernel-2.6.18-284.el5 | Doc Type: | Bug Fix | ||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2012-02-21 03:44:50 UTC | Type: | --- | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Bug Depends On: | |||||||||||||
| Bug Blocks: | 514489 | ||||||||||||
| Attachments: |
|
||||||||||||
|
Description
Qixiang Wan
2011-07-13 09:45:42 UTC
Created attachment 512617 [details]
Windows BSOD
there is such messages in hypervisor log when windows guest get BSOD:
(XEN) traps.c:1910:d0 Domain attempted WRMSR 00000410 from 00000000:00043bff to 00000000:00000003.
(XEN) traps.c:1910:d0 Domain attempted WRMSR 00000410 from 00000000:00000000 to 00000000:00000003.
It can reproduce with RHEL5.6 GA kernel-xen-2.6.18-238.el5 + xen-3.0.3-120.el5 on AMD 1216 x86_64 host. So reduce the Priority/Severity to high/high and request for rhel‑5.8.0. No issue with 5.5 GA kernel: kernel-xen-2.6.18-238.el5 + 5.6 xen-3.0.3-120.el5 on the same host as comment 2. (In reply to comment #3) > No issue with 5.5 GA kernel: kernel-xen-2.6.18-238.el5 + 5.6 xen-3.0.3-120.el5 > on the same host as comment 2. sorry, should be kernel-xen-2.6.18-194.el5 (5.5GA) + xen-3.0.3-120.el5 (5.6GA). Qixiang, We could check if it is x86emulator problem. Could you check if HAP is enabled on affected and not-affected hosts? (In reply to comment #5) > Could you check if HAP is enabled on affected and not-affected hosts? I think it's not related to HAP because it's not supported on the hosts (AMD 1216, 1220) which found the issue. And I confirmed it's only can be reproduced with multiple vcpus: [1] no issue with 1 vcpu on AMD 1220 [2] guest hang while rebooting with 2 vcpus on AMD 1220. so the statements in report which said no issue with AMD 1220 are wrong. hypervisor log: --------------------- (XEN) HVM6: int13_harddisk: function 15, unmapped device for ELDL=81 (XEN) HVM6: *** int 15h function AX=E980, BX=0063 not yet supported! (XEN) hvm.c:1359:d6 AP 1 bringup suceeded. (XEN) irq.c:222: Dom6 PCI link 0 changed 5 -> 0 (XEN) irq.c:222: Dom6 PCI link 1 changed 7 -> 0 (XEN) irq.c:222: Dom6 PCI link 2 changed 10 -> 0 (XEN) irq.c:222: Dom6 PCI link 3 changed 11 -> 0 (XEN) irq.c:285: Dom6 callback via changed to GSI 28 (XEN) hvm.c:524:d6 DOM6/VCPU1: going offline. -------------------- and I confirmed there is also the same hypervisor log as comment 1 when reboot 32bit winxp with 1 vcpu on AMD i386 host without issue, so seems there is nothing interesting in the hypervisor log. (In reply to comment #6) We actually suspect that emulation caused by shadow paging goes wrong, so hence was the cause of the question if the failed box is HAP-less box. Could you try a couple brew builds that has emulation fixes? https://brewweb.devel.redhat.com/taskinfo?taskID=3384309 - imul fix https://brewweb.devel.redhat.com/taskinfo?taskID=3471412 - emulator resync with upstream Created attachment 512663 [details]
xen hypervisor log
It's a regression introduced in kernel-xen-2.6.18-222.el5 although haven't figured out which patch is the root cause.
No issue with kernel-xen-2.6.18-221.el5 on the same host.
the hypervisor log is attached. (reboot i386 winxp with 2 vcpus over -221 and -222)
Created attachment 512670 [details] xen-imul-shaf hypervisor log (In reply to comment #7) > Could you try a couple brew builds that has emulation fixes? > https://brewweb.devel.redhat.com/taskinfo?taskID=3384309 - imul fix > https://brewweb.devel.redhat.com/taskinfo?taskID=3471412 - emulator resync with > upstream The hypervisor kernel you provided to me (http://scratch.englab.brq.redhat.com/imammedo/xen-imul-shaf.gz) doesn't work on the same host (AMD Dual-Core Opteron(tm) 1220 ), guest still hang when reboot it. $ cat grub.conf title xen-imul-shaf root (hd0,0) kernel /xen-imul-shaf.gz loglvl=all guest_loglvl=all module /vmlinuz-2.6.18-274.el5xen ro root=/dev/VolGroup00/LogVol00 module /initrd-2.6.18-274.el5xen.img no luck with xen-emul_sync.gz either. it's this commit introduced the regression: c308e27 [xen] emulate injection of guest NMI [1] 'git reset f90bbc0 --hard', build the hypervisor and boot up, there is no hang issue. [2] 'git reset c308e27 --hard', build the hypervisor and boot up, guest hang when reboot with multiple vcpus. Probably a duplicate of bug 643295. ... which is in turn a duplicate of bug 701608, even though at the time it was reported only on Intel. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. *** Bug 730221 has been marked as a duplicate of this bug. *** We can try upstream changesets 15897, 16398 and especially this hunk of 17655:
@@ -1266,6 +1278,15 @@ asmlinkage void svm_vmexit_handler(struc
reason = TSW_call_or_int;
if ( (vmcb->exitinfo2 >> 44) & 1 )
errcode = (uint32_t)vmcb->exitinfo2;
+
+ /*
+ * Some processors set the EXITINTINFO field when the task switch
+ * is caused by a task gate in the IDT. In this case we will be
+ * emulating the event injection, so we do not want the processor
+ * to re-inject the original event!
+ */
+ vmcb->eventinj.bytes = 0;
+
hvm_task_switch((uint16_t)vmcb->exitinfo1, reason, errcode);
break;
}
Other changesets relevant for these bugs are 15984, 16618, 17100, 17104/17105, but these are definitely too big to be backported---and the backport would amount to a rewrite for large parts of the code.
Created attachment 519480 [details]
test hypervisor
Please test with the attached hypervisor binary. If it still fails, please capture a memory dump and place it on some FTP server so that I can analyze the failure. Thanks!
(In reply to comment #19) > Please test with the attached hypervisor binary. If it still fails, please > capture a memory dump and place it on some FTP server so that I can analyze the > failure. Thanks! This hypervisor works for me. With this hypervisor + 274 Dom0 kernel, the Windows XP i386 guest (w/o pv driver) can reboot successfully with 2 vcpus on an AMD 1216 processor (for the same configuration, it will hang while rebooting on this processor). Patch(es) available in kernel-2.6.18-284.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. Patch(es) available in kernel-2.6.18-284.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. I reproduced this bug on AMD 1216 using kernel-xen -274. Guest: Windows XP, 2003, Win7, 2008 (all 32 bit). Host: kernel-xen-x86_64. -274: All the guests hangs on reboot. The mouse pointer stops moving after a while, and xm top shows 100% cpu usage of the guest. The guest loses response and does not reboot. -300: XP, Win7, Win2008 are proved to be fixed. The reboot does not hang any more. A note on Win2003: At first it still hangs on reboot, but after a host reboot the problem could not be reproduced any more. Now the guest reboots fine. I'll see if this is another problem, if I can reproduce it. I tested on some other AMD processors, but 1216 is the only one where this bug could be reproduced. Checked with -300 kernel, guest reboot works fine on them. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0150.html |