Description of problem: Windows i386 guests can't reboot on some AMD x86_64 hosts, and it will hang at the end of installation, also see i386 guest crash on i386 hosts at the end of installation. The behaviors are different for different Windows guests and AMD cpu models, for example: WinXP 32bit + AMD 1220 : no issue with 274 kernel WinXP 32bit + AMD 5200 i386 host: guest crash at the end of installation (-274). Win2003 32bit + AMD 1220 x86_64 host: no issue with -274 kernel Win2003 32bit + AMD 1216 x86_64 host: hang during reboot with -274, BSOD while rebooting with -271/-272/-273. Win2003 32bit + AMD 9600B x86_64 host: no issue with -274. WinXP/Win2003/Win2008/Win7 32bit + AMD B95 i386 host: no issue with -274 still investigating with different Windows + AMD models and bitsection on the hosts which has such issues to figure out the root cause. Version-Release number of selected component (if applicable): xen-3.0.3-132.el5.x86_64.rpm How reproducible: on some of AMD processor models Steps to Reproduce: 1. boot up a windows 32bit guest 2. reboot the guest Actual results: guest may hang or BSOD Expected results: no issue with running windows guests. Additional info: there is such messages in xm dmesg when windows get BSOD: (XEN) traps.c:1910:d0 Domain attempted WRMSR 00000410 from 00000000:00043bff to 00000000:00000003. (XEN) traps.c:1910:d0 Domain attempted WRMSR 00000410 from 00000000:00000000 to 00000000:00000003.
Created attachment 512617 [details] Windows BSOD there is such messages in hypervisor log when windows guest get BSOD: (XEN) traps.c:1910:d0 Domain attempted WRMSR 00000410 from 00000000:00043bff to 00000000:00000003. (XEN) traps.c:1910:d0 Domain attempted WRMSR 00000410 from 00000000:00000000 to 00000000:00000003.
It can reproduce with RHEL5.6 GA kernel-xen-2.6.18-238.el5 + xen-3.0.3-120.el5 on AMD 1216 x86_64 host. So reduce the Priority/Severity to high/high and request for rhel‑5.8.0.
No issue with 5.5 GA kernel: kernel-xen-2.6.18-238.el5 + 5.6 xen-3.0.3-120.el5 on the same host as comment 2.
(In reply to comment #3) > No issue with 5.5 GA kernel: kernel-xen-2.6.18-238.el5 + 5.6 xen-3.0.3-120.el5 > on the same host as comment 2. sorry, should be kernel-xen-2.6.18-194.el5 (5.5GA) + xen-3.0.3-120.el5 (5.6GA).
Qixiang, We could check if it is x86emulator problem. Could you check if HAP is enabled on affected and not-affected hosts?
(In reply to comment #5) > Could you check if HAP is enabled on affected and not-affected hosts? I think it's not related to HAP because it's not supported on the hosts (AMD 1216, 1220) which found the issue. And I confirmed it's only can be reproduced with multiple vcpus: [1] no issue with 1 vcpu on AMD 1220 [2] guest hang while rebooting with 2 vcpus on AMD 1220. so the statements in report which said no issue with AMD 1220 are wrong. hypervisor log: --------------------- (XEN) HVM6: int13_harddisk: function 15, unmapped device for ELDL=81 (XEN) HVM6: *** int 15h function AX=E980, BX=0063 not yet supported! (XEN) hvm.c:1359:d6 AP 1 bringup suceeded. (XEN) irq.c:222: Dom6 PCI link 0 changed 5 -> 0 (XEN) irq.c:222: Dom6 PCI link 1 changed 7 -> 0 (XEN) irq.c:222: Dom6 PCI link 2 changed 10 -> 0 (XEN) irq.c:222: Dom6 PCI link 3 changed 11 -> 0 (XEN) irq.c:285: Dom6 callback via changed to GSI 28 (XEN) hvm.c:524:d6 DOM6/VCPU1: going offline. -------------------- and I confirmed there is also the same hypervisor log as comment 1 when reboot 32bit winxp with 1 vcpu on AMD i386 host without issue, so seems there is nothing interesting in the hypervisor log.
(In reply to comment #6) We actually suspect that emulation caused by shadow paging goes wrong, so hence was the cause of the question if the failed box is HAP-less box. Could you try a couple brew builds that has emulation fixes? https://brewweb.devel.redhat.com/taskinfo?taskID=3384309 - imul fix https://brewweb.devel.redhat.com/taskinfo?taskID=3471412 - emulator resync with upstream
Created attachment 512663 [details] xen hypervisor log It's a regression introduced in kernel-xen-2.6.18-222.el5 although haven't figured out which patch is the root cause. No issue with kernel-xen-2.6.18-221.el5 on the same host. the hypervisor log is attached. (reboot i386 winxp with 2 vcpus over -221 and -222)
Created attachment 512670 [details] xen-imul-shaf hypervisor log (In reply to comment #7) > Could you try a couple brew builds that has emulation fixes? > https://brewweb.devel.redhat.com/taskinfo?taskID=3384309 - imul fix > https://brewweb.devel.redhat.com/taskinfo?taskID=3471412 - emulator resync with > upstream The hypervisor kernel you provided to me (http://scratch.englab.brq.redhat.com/imammedo/xen-imul-shaf.gz) doesn't work on the same host (AMD Dual-Core Opteron(tm) 1220 ), guest still hang when reboot it. $ cat grub.conf title xen-imul-shaf root (hd0,0) kernel /xen-imul-shaf.gz loglvl=all guest_loglvl=all module /vmlinuz-2.6.18-274.el5xen ro root=/dev/VolGroup00/LogVol00 module /initrd-2.6.18-274.el5xen.img
no luck with xen-emul_sync.gz either. it's this commit introduced the regression: c308e27 [xen] emulate injection of guest NMI [1] 'git reset f90bbc0 --hard', build the hypervisor and boot up, there is no hang issue. [2] 'git reset c308e27 --hard', build the hypervisor and boot up, guest hang when reboot with multiple vcpus.
Probably a duplicate of bug 643295.
... which is in turn a duplicate of bug 701608, even though at the time it was reported only on Intel.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
*** Bug 730221 has been marked as a duplicate of this bug. ***
We can try upstream changesets 15897, 16398 and especially this hunk of 17655: @@ -1266,6 +1278,15 @@ asmlinkage void svm_vmexit_handler(struc reason = TSW_call_or_int; if ( (vmcb->exitinfo2 >> 44) & 1 ) errcode = (uint32_t)vmcb->exitinfo2; + + /* + * Some processors set the EXITINTINFO field when the task switch + * is caused by a task gate in the IDT. In this case we will be + * emulating the event injection, so we do not want the processor + * to re-inject the original event! + */ + vmcb->eventinj.bytes = 0; + hvm_task_switch((uint16_t)vmcb->exitinfo1, reason, errcode); break; } Other changesets relevant for these bugs are 15984, 16618, 17100, 17104/17105, but these are definitely too big to be backported---and the backport would amount to a rewrite for large parts of the code.
Created attachment 519480 [details] test hypervisor Please test with the attached hypervisor binary. If it still fails, please capture a memory dump and place it on some FTP server so that I can analyze the failure. Thanks!
(In reply to comment #19) > Please test with the attached hypervisor binary. If it still fails, please > capture a memory dump and place it on some FTP server so that I can analyze the > failure. Thanks! This hypervisor works for me. With this hypervisor + 274 Dom0 kernel, the Windows XP i386 guest (w/o pv driver) can reboot successfully with 2 vcpus on an AMD 1216 processor (for the same configuration, it will hang while rebooting on this processor).
Patch(es) available in kernel-2.6.18-284.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
I reproduced this bug on AMD 1216 using kernel-xen -274. Guest: Windows XP, 2003, Win7, 2008 (all 32 bit). Host: kernel-xen-x86_64. -274: All the guests hangs on reboot. The mouse pointer stops moving after a while, and xm top shows 100% cpu usage of the guest. The guest loses response and does not reboot. -300: XP, Win7, Win2008 are proved to be fixed. The reboot does not hang any more. A note on Win2003: At first it still hangs on reboot, but after a host reboot the problem could not be reproduced any more. Now the guest reboots fine. I'll see if this is another problem, if I can reproduce it. I tested on some other AMD processors, but 1216 is the only one where this bug could be reproduced. Checked with -300 kernel, guest reboot works fine on them.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0150.html