Bug 529857

Summary: Booting Windows 2008 R2 64-bit domU results in BSOD (STOP 0x000007E)
Product: Red Hat Enterprise Linux 5 Reporter: Flavio Leitner <fleitner>
Component: kernel-xenAssignee: Xen Maintainance List <xen-maint>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: urgent    
Version: 5.4CC: clalance, james.brown, pbonzini, raud, tao, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-10-23 10:28:59 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Flavio Leitner 2009-10-20 13:40:23 UTC
Description:
On some dom0 hosts, booting Windows 2008 R2 64-bit domU results in nearly
immediate blue-screen Windows crash (STOP 0x000007E).  This occures with 
disk images that work normally on many different hosts.  

This also occurs if installing Windows 2008 R2 64-bit inside domU from 
virtual CDROM, using basic QEMU emulation: The initial installation goes 
well, but upon first reboot, system crashes in same manner.  

Thus far, the only commonality between the systems that don't work and that 
do are that the systems which *don't* work have a specific revision of Intel
Xeon processor and similar BIOS.  (Xeon Family 6, Model 23, Stepping 10).  
This may be unrelated, though.  

How reproducible:
100% on certain hosts.

Steps to Reproduce:
Install or boot a Windows 2008 R2 domU.

Additional info:
Here are some details from an affected test host:
Kernel:  2.6.18-92.1.13.el5.a3finerscheduler_msi_backport.38025xen #1 SMP x86_64 x86_64 x86_64 GNU/Linux

xm info:
release                : 2.6.18-92.1.13.el5.a3finerscheduler_msi_backport.38025xen
version                : #1 SMP Wed Sep 2 06:42:46 SAST 2009
machine                : x86_64
nr_cpus                : 4
nr_nodes               : 1
sockets_per_node       : 1
cores_per_socket       : 4
threads_per_core       : 1
cpu_mhz                : 2666
hw_caps                : bfebfbff:20100800:00000000:00000140:040ce3bd:00000000:00000001
total_memory           : 16382
free_memory            : 12535
node_to_cpu            : node0:0-3
xen_major              : 3
xen_minor              : 1
xen_extra              : .2-92.1.13.el5.
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : unavailable
cc_compiler            : gcc version 4.1.1 20070105 (Red Hat 4.1.1-52)
cc_compile_by          : builder
cc_compile_domain      : ec2.internal
cc_compile_date        : Wed Sep  2 06:41:30 SAST 2009
xend_config_format     : 2

cpuinfo:

vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           E5430  @ 2.66GHz
stepping        : 10
cpu MHz         : 2666.760
cache size      : 6144 KB
physical id     : 3
siblings        : 1
core id         : 0
cpu cores       : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 6671.26
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:


# lspci | sort

00:00.0 Host bridge: Intel Corporation 5100 Chipset Memory Controller Hub (rev 90)
00:04.0 PCI bridge: Intel Corporation 5100 Chipset PCI Express x16 Port 4-7 (rev 90)
00:05.0 PCI bridge: Intel Corporation 5100 Chipset PCI Express x4 Port 5 (rev 90)
00:06.0 PCI bridge: Intel Corporation 5100 Chipset PCI Express x4 Port 6 (rev 90)
00:07.0 PCI bridge: Intel Corporation 5100 Chipset PCI Express x4 Port 7 (rev 90)
00:10.0 Host bridge: Intel Corporation 5100 Chipset FSB Registers (rev 90)
00:10.1 Host bridge: Intel Corporation 5100 Chipset FSB Registers (rev 90)
00:10.2 Host bridge: Intel Corporation 5100 Chipset FSB Registers (rev 90)
00:11.0 Host bridge: Intel Corporation 5100 Chipset Reserved Registers (rev 90)
00:13.0 Host bridge: Intel Corporation 5100 Chipset Reserved Registers (rev 90)
00:15.0 Host bridge: Intel Corporation 5100 Chipset DDR Channel 0 Registers (rev 90)
00:16.0 Host bridge: Intel Corporation 5100 Chipset DDR Channel 1 Registers (rev 90)
00:19.0 Ethernet controller: Intel Corporation 82566DM-2 Gigabit Network Connection (rev 02)
00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 02)
00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 02)
00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 5 (rev 02)
00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92)
00:1f.0 ISA bridge: Intel Corporation 82801IR (ICH9R) LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA AHCI Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
06:00.0 Ethernet controller: Intel Corporation 82573V Gigabit Ethernet Controller (Copper) (rev 03)
07:03.0 VGA compatible controller: ASPEED Technology, Inc. AST2000

BIOS Information
       Vendor: Dell Computer Corporation
       Version: S45_3A20

System Information
       Manufacturer: Dell     
       Product Name: DCS CS24-SC           

Base Board Information
       Manufacturer: Dell     
       Product Name: S45                  


From http://aumha.org/a/stop.htm:
0x0000007E: SYSTEM_THREAD_EXCEPTION_NOT_HANDLED
(Click to consult the online MSDN article.)
A system thread generated an exception which the error handler did not catch. There are numerous individual causes for this problem, including hardware incompatibility, a faulty device driver or system service, or some software issues.

and then:
http://msdn.microsoft.com/en-gb/library/ms795746.aspx
Cause

The SYSTEM_THREAD_EXCEPTION_NOT_HANDLED bug check is a very common bug check. To interpret it, you must identify which exception was generated.

Common exception codes include the follwoing:

   * 0x80000002: STATUS_DATATYPE_MISALIGNMENT indicates an unaligned data reference was encountered.
   * 0x80000003: STATUS_BREAKPOINT indicates a breakpoint or ASSERT was encountered when no kernel debugger was attached to the system.
   * 0xC0000005: STATUS_ACCESS_VIOLATION indicates a memory access violation occurred.



       RHEL 5.4 has been tried with no success on these hardware.

Comment 5 Paolo Bonzini 2009-10-21 09:54:13 UTC
0xC0000096 is a privileged instruction exception, so there is some hope of fixing the issue.  Can you get a crash dump of the system after it gets the BSOD?

Testing the hypervisor should not be necessary, I was informed that the bug indeed affects only the 32-bit PAE kernel.

Comment 6 Flavio Leitner 2009-10-21 12:07:21 UTC
There is a kbase for Windows Server 2008 R2-based about this:
http://support.microsoft.com/kb/974598

 -- Copy&Paste below for completeness: --

Assume that you enable the Hyper-V role on a computer that is running Windows Server 2008 R2. You restart the computer after you enable the Hyper-V role. However, you receive the following Stop error message during the restart operation:

Stop 0x0000007E (ffffffffc0000096, parameter2, parameter3, parameter4)
SYSTEM_THREAD_EXCEPTION_NOT_HANDLED 

Notes:
    * The parameters in these Stop error messages may vary, depending on the actual configuration.
    * The symptoms of a Stop error may vary, depending on your computer's system failure options. For example, the computer may restart when a Stop error occurs.

Cause:
This problem occurs because the system uses a C-state that is supported by the processor. However, the C-state is not supported by Hyper-V.

Resolution:
To resolve this problem, follow these steps:

   1. Disable Processor Virtualization in the BIOS.
   2. Start the computer normally.
   3. Apply this hotfix and then restart the computer.

Status:
Microsoft has confirmed that this is a problem in the Microsoft products that are listed in the "Applies to" section.

Workaround:
Important This section, method, or task contains steps that tell you how to modify the registry. However, serious problems might occur if you modify the registry incorrectly. Therefore, make sure that you follow these steps carefully. For added protection, back up the registry before you modify it. Then, you can restore the registry if a problem occurs. For more information about how to back up and restore the registry, click the following article number to view the article in the Microsoft Knowledge Base:
322756  (http://support.microsoft.com/kb/322756/ ) How to back up and restore the registry in Windows
To work around this problem, follow these steps:

   1. Disable Processor Virtualization in the BIOS.
   2. Start the computer normally.
   3. Open an elevated command prompt, and then run the following command:
      reg add HKLM\System\CurrentControlSet\Control\Processor /v Capabilities /t REG_DWORD /d 0x0007E066
   4. Restart the computer.

This workaround adds a registry entry that disables the C2 state and the C3 state.

Comment 7 Issue Tracker 2009-10-22 09:04:42 UTC
Event posted on 10-22-2009 04:46am EDT by kentf

Here we go:

The XSAVE/XRESTOR feature is not supported in this Xen. The BIOS in some
of these boxes is exposing the feature, and not on others - it appears
that all of the boxes with 5430 parts have it, and some of the 5410s. This
patch fixes HVM and PV, I think - it successfully boots Win 2008 Server R2
on a host that did not work before:

diff -Naur xen/arch/x86/hvm/hvm.c xen.new/arch/x86/hvm/hvm.c
--- xen/arch/x86/hvm/hvm.c      2009-10-22 01:08:55.000000000 -0700
+++ xen.new/arch/x86/hvm/hvm.c  2009-10-22 01:12:57.000000000 -0700
@@ -675,6 +675,7 @@
             struct vcpu *v = current;

             clear_bit(X86_FEATURE_MWAIT & 31, ecx);
+            clear_bit(X86_FEATURE_XSAVE & 31, ecx);

             if ( vlapic_hw_disabled(vcpu_vlapic(v)) )
                 clear_bit(X86_FEATURE_APIC & 31, edx);
diff -Naur xen/arch/x86/hvm/vmx/vmx.c xen.new/arch/x86/hvm/vmx/vmx.c
--- xen/arch/x86/hvm/vmx/vmx.c  2009-10-22 01:08:55.000000000 -0700
+++ xen.new/arch/x86/hvm/vmx/vmx.c      2009-10-22 01:16:48.000000000
-0700
@@ -1249,6 +1249,8 @@
      */
     boot_cpu_data.x86_capability[4] = cpuid_ecx(1);

+    clear_bit(X86_FEATURE_XSAVE, &boot_cpu_data.x86_capability);
+
     if ( !test_bit(X86_FEATURE_VMXE, &boot_cpu_data.x86_capability) )
         return 0;

diff -Naur xen/arch/x86/traps.c xen.new/arch/x86/traps.c
--- xen/arch/x86/traps.c        2009-10-22 01:08:55.000000000 -0700
+++ xen.new/arch/x86/traps.c    2009-10-22 01:14:52.000000000 -0700
@@ -615,6 +615,7 @@
             clear_bit(X86_FEATURE_SEP, &d);
         if ( !IS_PRIV(current->domain) )
             clear_bit(X86_FEATURE_MTRR, &d);
+        clear_bit(X86_FEATURE_XSAVE % 32, &c);
     }
     else if ( regs->eax == 0x80000001 )
     {
diff -Naur xen/include/asm-x86/cpufeature.h
xen.new/include/asm-x86/cpufeature.h
--- xen/include/asm-x86/cpufeature.h    2007-12-06 09:48:39.000000000
-0800
+++ xen.new/include/asm-x86/cpufeature.h        2009-10-21
23:24:14.000000000 -0700
@@ -82,6 +82,7 @@
 #define X86_FEATURE_CID                (4*32+10) /* Context ID */
 #define X86_FEATURE_CX16        (4*32+13) /* CMPXCHG16B */
 #define X86_FEATURE_XTPR       (4*32+14) /* Send Task Priority Messages
*/
+#define X86_FEATURE_XSAVE      (4*32+26) /* XSAVE/XRESTOR feature set */

 /* VIA/Cyrix/Centaur-defined CPU features, CPUID level 0xC0000001, word 5
*/
 #define X86_FEATURE_XSTORE     (5*32+ 2) /* on-CPU RNG present (xstore
insn) */



This event sent from IssueTracker by jabrown 
 issue 354327

Comment 8 Chris Lalancette 2009-10-22 09:50:02 UTC
We've just put a similar patch into RHEL-5.5 kernel, that should also fix the issue.  Would it be possible to try out the kernel here:

http://people.redhat.com/dzickus/el5/170.el5/

Or at least try out the patch entitled "xen-mask-out-xsave-for-hvm-guest"?

Thanks,
Chris Lalancette