Bug 720936 - Windows guests may hang/BSOD on some AMD processors.
Summary: Windows guests may hang/BSOD on some AMD processors.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen
Version: 5.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Paolo Bonzini
QA Contact: Virtualization Bugs
URL:
Whiteboard:
: 730221 (view as bug list)
Depends On:
Blocks: 514489
TreeView+ depends on / blocked
 
Reported: 2011-07-13 09:45 UTC by Qixiang Wan
Modified: 2012-02-21 03:44 UTC (History)
10 users (show)

Fixed In Version: kernel-2.6.18-284.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-02-21 03:44:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Windows BSOD (17.53 KB, image/png)
2011-07-13 09:47 UTC, Qixiang Wan
no flags Details
xen hypervisor log (25.24 KB, text/plain)
2011-07-13 13:32 UTC, Qixiang Wan
no flags Details
xen-imul-shaf hypervisor log (9.80 KB, text/plain)
2011-07-13 13:51 UTC, Qixiang Wan
no flags Details
test hypervisor (418.67 KB, application/gzip)
2011-08-23 15:24 UTC, Paolo Bonzini
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2012:0150 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise Linux 5.8 kernel update 2012-02-21 07:35:24 UTC

Description Qixiang Wan 2011-07-13 09:45:42 UTC
Description of problem:
Windows i386 guests can't reboot on some AMD x86_64 hosts, and it will hang at 
the end of installation, also see i386 guest crash on i386 hosts at the end of
installation.

The behaviors are different for different Windows guests and AMD cpu models, for
example:

WinXP 32bit + AMD 1220 : no issue with 274 kernel
WinXP 32bit + AMD 5200 i386 host: guest crash at the end of installation (-274).
Win2003 32bit + AMD 1220 x86_64 host: no issue with -274 kernel
Win2003 32bit + AMD 1216 x86_64 host: hang during reboot with -274, BSOD while rebooting with -271/-272/-273.
Win2003 32bit + AMD 9600B x86_64 host: no issue with -274.
WinXP/Win2003/Win2008/Win7 32bit + AMD B95 i386 host: no issue with -274

still investigating with different Windows + AMD models and bitsection on the hosts which has such issues to figure out the root cause. 

Version-Release number of selected component (if applicable):

xen-3.0.3-132.el5.x86_64.rpm

How reproducible:
on some of AMD processor models

Steps to Reproduce:
1. boot up a windows 32bit guest
2. reboot the guest

  
Actual results:
guest may hang or BSOD

Expected results:
no issue with running windows guests.

Additional info:

there is such messages in xm dmesg when windows get BSOD:

(XEN) traps.c:1910:d0 Domain attempted WRMSR 00000410 from 00000000:00043bff to 00000000:00000003.
(XEN) traps.c:1910:d0 Domain attempted WRMSR 00000410 from 00000000:00000000 to 00000000:00000003.

Comment 1 Qixiang Wan 2011-07-13 09:47:26 UTC
Created attachment 512617 [details]
Windows BSOD

there is such messages in hypervisor log when windows guest get BSOD:

(XEN) traps.c:1910:d0 Domain attempted WRMSR 00000410 from 00000000:00043bff to 00000000:00000003.
(XEN) traps.c:1910:d0 Domain attempted WRMSR 00000410 from 00000000:00000000 to 00000000:00000003.

Comment 2 Qixiang Wan 2011-07-13 10:01:29 UTC
It can reproduce with RHEL5.6 GA kernel-xen-2.6.18-238.el5 + xen-3.0.3-120.el5 on AMD 1216 x86_64 host. So reduce the Priority/Severity to high/high and request for rhel‑5.8.0.

Comment 3 Qixiang Wan 2011-07-13 10:15:40 UTC
No issue with 5.5 GA kernel: kernel-xen-2.6.18-238.el5 + 5.6 xen-3.0.3-120.el5 on the same host as comment 2.

Comment 4 Qixiang Wan 2011-07-13 10:51:19 UTC
(In reply to comment #3)
> No issue with 5.5 GA kernel: kernel-xen-2.6.18-238.el5 + 5.6 xen-3.0.3-120.el5
> on the same host as comment 2.
sorry, should be kernel-xen-2.6.18-194.el5 (5.5GA) + xen-3.0.3-120.el5 (5.6GA).

Comment 5 Igor Mammedov 2011-07-13 11:48:09 UTC
Qixiang,

We could check if it is x86emulator problem.
Could you check if HAP is enabled on affected and not-affected hosts?

Comment 6 Qixiang Wan 2011-07-13 12:16:33 UTC
(In reply to comment #5)
> Could you check if HAP is enabled on affected and not-affected hosts?

I think it's not related to HAP because it's not supported on the hosts (AMD 1216, 1220) which found the issue.

And I confirmed it's only can be reproduced with multiple vcpus:

[1] no issue with 1 vcpu on AMD 1220
[2] guest hang while rebooting with 2 vcpus on AMD 1220.

so the statements in report which said no issue with AMD 1220 are wrong.

hypervisor log:
---------------------
(XEN) HVM6: int13_harddisk: function 15, unmapped device for ELDL=81
(XEN) HVM6: *** int 15h function AX=E980, BX=0063 not yet supported!
(XEN) hvm.c:1359:d6 AP 1 bringup suceeded.
(XEN) irq.c:222: Dom6 PCI link 0 changed 5 -> 0
(XEN) irq.c:222: Dom6 PCI link 1 changed 7 -> 0
(XEN) irq.c:222: Dom6 PCI link 2 changed 10 -> 0
(XEN) irq.c:222: Dom6 PCI link 3 changed 11 -> 0
(XEN) irq.c:285: Dom6 callback via changed to GSI 28
(XEN) hvm.c:524:d6 DOM6/VCPU1: going offline.

--------------------

and I confirmed there is also the same hypervisor log as comment 1 when reboot 32bit winxp with 1 vcpu on AMD i386 host without issue, so seems there is nothing interesting in the hypervisor log.

Comment 7 Igor Mammedov 2011-07-13 12:51:33 UTC
(In reply to comment #6)

We actually suspect that emulation caused by shadow paging goes wrong, so hence was the cause of the question if the failed box is HAP-less box.

Could you try a couple brew builds that has emulation fixes?
https://brewweb.devel.redhat.com/taskinfo?taskID=3384309 - imul fix
https://brewweb.devel.redhat.com/taskinfo?taskID=3471412 - emulator resync with upstream

Comment 8 Qixiang Wan 2011-07-13 13:32:05 UTC
Created attachment 512663 [details]
xen hypervisor log

It's a regression introduced in kernel-xen-2.6.18-222.el5 although haven't figured out which patch is the root cause.

No issue with kernel-xen-2.6.18-221.el5 on the same host.

the hypervisor log is attached. (reboot i386 winxp with 2 vcpus over -221 and -222)

Comment 9 Qixiang Wan 2011-07-13 13:51:16 UTC
Created attachment 512670 [details]
xen-imul-shaf hypervisor log

(In reply to comment #7)
> Could you try a couple brew builds that has emulation fixes?
> https://brewweb.devel.redhat.com/taskinfo?taskID=3384309 - imul fix
> https://brewweb.devel.redhat.com/taskinfo?taskID=3471412 - emulator resync with
> upstream

The hypervisor kernel you provided to me (http://scratch.englab.brq.redhat.com/imammedo/xen-imul-shaf.gz) doesn't work on the same host (AMD Dual-Core Opteron(tm) 1220 ), guest still hang when reboot it.

$ cat grub.conf
title xen-imul-shaf
	root (hd0,0)
	kernel /xen-imul-shaf.gz loglvl=all guest_loglvl=all
	module /vmlinuz-2.6.18-274.el5xen ro root=/dev/VolGroup00/LogVol00
	module /initrd-2.6.18-274.el5xen.img

Comment 10 Qixiang Wan 2011-07-13 14:56:17 UTC
no luck with xen-emul_sync.gz either.

it's this commit introduced the regression:

c308e27 [xen] emulate injection of guest NMI

[1] 'git reset f90bbc0 --hard', build the hypervisor and boot up, there is no hang issue.
[2] 'git reset c308e27 --hard', build the hypervisor and boot up, guest hang when reboot with multiple vcpus.

Comment 11 Paolo Bonzini 2011-07-13 16:20:26 UTC
Probably a duplicate of bug 643295.

Comment 12 Paolo Bonzini 2011-07-13 16:21:13 UTC
... which is in turn a duplicate of bug 701608, even though at the time it was reported only on Intel.

Comment 13 RHEL Program Management 2011-08-04 04:20:39 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 17 Paolo Bonzini 2011-08-23 11:57:22 UTC
*** Bug 730221 has been marked as a duplicate of this bug. ***

Comment 18 Paolo Bonzini 2011-08-23 12:15:48 UTC
We can try upstream changesets 15897, 16398 and especially this hunk of 17655:

@@ -1266,6 +1278,15 @@ asmlinkage void svm_vmexit_handler(struc
             reason = TSW_call_or_int;
         if ( (vmcb->exitinfo2 >> 44) & 1 )
             errcode = (uint32_t)vmcb->exitinfo2;
+
+        /*
+         * Some processors set the EXITINTINFO field when the task switch
+         * is caused by a task gate in the IDT. In this case we will be
+         * emulating the event injection, so we do not want the processor
+         * to re-inject the original event!
+         */
+        vmcb->eventinj.bytes = 0;
+
         hvm_task_switch((uint16_t)vmcb->exitinfo1, reason, errcode);
         break;
     }

Other changesets relevant for these bugs are 15984, 16618, 17100, 17104/17105, but these are definitely too big to be backported---and the backport would amount to a rewrite for large parts of the code.

Comment 19 Paolo Bonzini 2011-08-23 15:24:56 UTC
Created attachment 519480 [details]
test hypervisor

Please test with the attached hypervisor binary.  If it still fails, please capture a memory dump and place it on some FTP server so that I can analyze the failure.  Thanks!

Comment 20 Qixiang Wan 2011-08-25 13:33:49 UTC
(In reply to comment #19)
> Please test with the attached hypervisor binary.  If it still fails, please
> capture a memory dump and place it on some FTP server so that I can analyze the
> failure.  Thanks!

This hypervisor works for me. With this hypervisor + 274 Dom0 kernel, the Windows XP i386 guest (w/o pv driver) can reboot successfully with 2 vcpus on an AMD 1216 processor (for the same configuration, it will hang while rebooting on this processor).

Comment 23 Jarod Wilson 2011-09-02 15:40:39 UTC
Patch(es) available in kernel-2.6.18-284.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 24 Jarod Wilson 2011-09-02 17:42:19 UTC
Patch(es) available in kernel-2.6.18-284.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 26 Jinxin Zheng 2011-12-13 05:57:49 UTC
I reproduced this bug on AMD 1216 using kernel-xen -274.

Guest: Windows XP, 2003, Win7, 2008 (all 32 bit).
Host: kernel-xen-x86_64.

-274: All the guests hangs on reboot. The mouse pointer stops moving after a while, and xm top shows 100% cpu usage of the guest. The guest loses response and does not reboot.

-300: XP, Win7, Win2008 are proved to be fixed. The reboot does not hang any more.
A note on Win2003: At first it still hangs on reboot, but after a host reboot the problem could not be reproduced any more. Now the guest reboots fine. I'll see if this is another problem, if I can reproduce it.

I tested on some other AMD processors, but 1216 is the only one where this bug could be reproduced. Checked with -300 kernel, guest reboot works fine on them.

Comment 27 errata-xmlrpc 2012-02-21 03:44:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0150.html


Note You need to log in before you can comment on or make changes to this bug.