Bug 1391707

Summary: Hyper-V on KVM nested virtualization does not work
Product: [Fedora] Fedora Reporter: Ladi Prosek <lprosek>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED NEXTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 25CC: bdas, berrange, clalancette, crobinso, cz172638, ehabkost, extras-orphan, gansalmon, ichavero, itamar, jonathan, kernel-maint, madhu.chinakonda, markmc, mchehab, pbonzini, quintela, rkrcmar, virt-maint
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-22 20:24:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
VMCS dump
none
Hyper-V on KVM none

Description Ladi Prosek 2016-11-03 20:10:50 UTC
This BZ tracks the Hyper-V on KVM upstream enablement work.

Hyper-V running on KVM with the two patches applied:
http://www.spinics.net/lists/kvm/msg139187.html
http://www.spinics.net/lists/kvm/msg139377.html

can start a nested VM but it quickly gets into a page fault loop and doesn't make any progress. In fact the trace suggests that it does not execute a single instruction. User-facing, the VM is stuck at the 'Hyper-V' splash screen and consuming one full core.

Hyper-V sets up the nested VM with protected mode and paging enabled.

vmwrite(CR0 read shadow, 60000010)
vmwrite(Guest CR0, 80000031)

and CS:IP is set to F000:FFF0 as expected.

What's weird is that CR3 is (explictly!) set to 0.

vmwrite(Guest CR3, 0)

and it really is 0 that we end up setting in kvm_set_cr3 when running the L2 guest.

When accessed as a guest physical address (in terms of kvm_vcpu_read_guest_page), F000:FFF0 contains what seems to be a reasonable bootstrap code:

ea5be000f0      jmp     F000:E05B

followed by an ASCII representation of "06/23/99".

But because of the zero CR3, there's a page fault VM exit right at the jmp instruction. This gets correctly forwarded to Hyper-V as a nested VM exit and the handler, among other things, sets CS:IP to 0:0

vmwrite(Guest CS, 0)
vmwrite(Guest RIP, 0)

Then it loops page faulting (and nested-VM-exiting) forever, although at some point it switches to IP=2 for some reason.

I'm going to disassemble the Hyper-V code that does the Guest CR3 vmwrite because that's where things get weird, as far as I can tell. Questions to answer:

* What is the canonical way of virtualizing real mode?
* Should I expect paging data structures for the lower 1 MiB fully set up before the guest is first entered?
* What kind of address translation is involved in kvm_vcpu_read_guest_page, i.e. why does it appear to work? If Hyper-V uses EPT then why does it need to run its guest with paging on?

Comment 1 Ladi Prosek 2016-11-04 13:15:52 UTC
(In reply to Ladi Prosek from comment #0)
> * What kind of address translation is involved in kvm_vcpu_read_guest_page,
> i.e. why does it appear to work? If Hyper-V uses EPT then why does it need
> to run its guest with paging on?

"06/23/99" is the SeaBIOS date, kvm_vcpu_read_guest_page is reading the L1 physical memory. Doh! Also confirming that we're running in a nested EPT configuration. Why paging then? Does CR3=0 have a special meaning?

Comment 2 Ladi Prosek 2016-11-04 15:40:48 UTC
Hacking the nested mmu code to treat CR3=0 as paging off (g_context->gva_to_gpa = nonpaging_gva_to_gpa_nested) show the true contents of guest physical F000:FFF0 as:

ea48ff00f0      jmp     F000:FF48

followed by "04/28/16" which is the Hyper-V BIOS date as confirmed by grepping \Windows\System32\vmchipset.dll

So that's a decent indication that Hyper-V either didn't want to enable paging or wanted to set up identity page tables for real mode memory.

Comment 3 Paolo Bonzini 2016-11-04 23:19:36 UTC
> * What is the canonical way of virtualizing real mode?

The canonical way is to do one of these:

* set CR0.PE=CR0.PG=0 (only if the host has "unrestricted guest" support)

* set CR0.PE=CR0.PG=EFLAGS.VM=1 (vm86 mode), place an identity page table somewhere in guest memory, set CR3 to point to that identity page table.

Can you check if unrestricted guest support is enabled (bit 7 in secondary execution controls---try looking in traces for a vmwrite to field 0x401e, I sent a patch to kvm.org that adds a vmwrite tracepoint)?

The latter is needed because processors until Westmere didn't support VMX with CR0.PE=0 or CR0.PG=0.

CR3=0 doesn't have any meaning.

Comment 4 Ladi Prosek 2016-11-07 15:33:17 UTC
(In reply to Paolo Bonzini from comment #3)
> > * What is the canonical way of virtualizing real mode?
> 
> The canonical way is to do one of these:
> 
> * set CR0.PE=CR0.PG=0 (only if the host has "unrestricted guest" support)
> 
> * set CR0.PE=CR0.PG=EFLAGS.VM=1 (vm86 mode), place an identity page table
> somewhere in guest memory, set CR3 to point to that identity page table.

Thanks!

> Can you check if unrestricted guest support is enabled (bit 7 in secondary
> execution controls---try looking in traces for a vmwrite to field 0x401e, I
> sent a patch to kvm.org that adds a vmwrite tracepoint)?

Unrestricted guest support is *not* enabled.

> The latter is needed because processors until Westmere didn't support VMX
> with CR0.PE=0 or CR0.PG=0.

So it looks like they are after the latter. It all checks out except for CR3.
 
kvm_set_cr3 when entering L2 guest: CR3=0, CR0=80000031, RFLAGS=30002

> CR3=0 doesn't have any meaning.

I'll see if I can figure out where the value is coming from.

Comment 5 Ladi Prosek 2016-11-11 07:57:44 UTC
Quick update:

* Hyper-V code is inaccessible using the standard remote Windows kernel debugging techniques, the memory region cannot even be read from windbg

* My only tool so far has been QEMU remote debugging

* The "0" in vmwrite(0x6802, 0) comes from a couple of stack frames above, it is hard-coded, and there's no obvious conditional explaining the intent behind it

* Bin-grepping for the code pattern being executed confirmed the module to be hvix64.exe, located under \Windows\WinSxS

* Microsoft doesn't provide public symbols for hvix64.exe


Two avenues going forward:

* Static-analyze the heck out of hvix64.exe, maybe leverage prior (security?) research, use powerful reverse engineering tools

* Find another hypervisor or emulator under which Hyper-V works and can be debugged/single-stepped, then compare the code paths and see where they diverge

Comment 6 Ladi Prosek 2016-11-16 14:05:10 UTC
Created attachment 1221159 [details]
VMCS dump

(In reply to Ladi Prosek from comment #5)
> * Find another hypervisor or emulator under which Hyper-V works and can be
> debugged/single-stepped, then compare the code paths and see where they
> diverge

I have installed VMware, verified that nested Hyper-V works, and the guest can be debugged via gdb stub very similar to KVM.

I was able to dump the L2 VMCS as seen by the L1 guest at the point when it's about to execute VMLAUNCH, and at the first non-external-interrupt VMEXIT following the VMLAUNCH. The VMCS was dumped by injecting a series of VMREAD instructions into an unused code region of L1 and executing them by manipulating the instruction pointer and setting breakpoints in gdb. Hyper-V funnels VMEXITs from all VMs to a single entry point so to distinguish the nested VM of interest from the root partition exits, I have hijacked the VMEXIT vector to point to another piece of injected code which checks the exit reason and jumps to the Hyper-V handler if it's an external interrupt.

  push rcx
  push rdx
  mov rdx, 0x4402
  vmread rcx, rdx
  cmp rdx, 1
  pop rdx
  pop rcx
  jne <where the breakpoint is>
  jmp <Hyper-V VMEXIT entrypoint>  

The full table is attached.

A few observations:

1) Guest CR3 (field 0x6802) is seriously 0 everywhere. Really!

2) The notable difference at VMLAUNCH is that under KVM, Hyper-V sets APIC-access address (field 0x2014) and bit 0 (Virtualize APIC
accesses) in Secondary processor-based VM-execution controls (field 0x401e). So it looks like VMware doesn't support this feature. Benign? The rest is identical, modulo pointer values.

3) The first VMEXIT is very different. While under KVM we exit right at IP=0xFFF0 with exception 0xE - Page Fault (this matches the KVM-side analysis above), under VMware the first exit happens later and it's exception 0xD - General Protection Fault at IP=0xC0A0, see field 0x681e.

4) Under KVM, all guest PDPTEs are zeroed on exit, see fields 0x280a-0x2811. This looks extremely relevant. If I'm reading the manual right, the conditions under which PDPTEs are loaded from the memory pointed to by CR3 are well specified and one can legitimately get away with leaving CR3 uninitialized. See "26.3.2.4 Loading Page-Directory-Pointer-Table Entries". The working theory is that KVM doesn't correctly set up the PDPTE fields when building VMCS02.

Comment 7 Ladi Prosek 2016-11-16 15:20:09 UTC
Created attachment 1221251 [details]
Hyper-V on KVM

(In reply to Ladi Prosek from comment #6)
> 4) Under KVM, all guest PDPTEs are zeroed on exit, see fields 0x280a-0x2811.
> This looks extremely relevant. If I'm reading the manual right, the
> conditions under which PDPTEs are loaded from the memory pointed to by CR3
> are well specified and one can legitimately get away with leaving CR3
> uninitialized. See "26.3.2.4 Loading Page-Directory-Pointer-Table Entries".
> The working theory is that KVM doesn't correctly set up the PDPTE fields
> when building VMCS02.

Indeed, guest PDPTEs are incorrectly overwritten in ept_load_pdptrs. This fixes it for me but I'll spend some more time on it before posting the patch.

--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3876,7 +3876,7 @@ static void ept_load_pdptrs(struct kvm_vcpu *vcpu)
                      (unsigned long *)&vcpu->arch.regs_dirty))
                return;
 
-       if (is_paging(vcpu) && is_pae(vcpu) && !is_long_mode(vcpu)) {
+       if (is_paging(vcpu) && is_pae(vcpu) && !is_long_mode(vcpu) && !is_guest_mode(vcpu)) {
                vmcs_write64(GUEST_PDPTR0, mmu->pdptrs[0]);
                vmcs_write64(GUEST_PDPTR1, mmu->pdptrs[1]);
                vmcs_write64(GUEST_PDPTR2, mmu->pdptrs[2]);
@@ -3888,7 +3888,7 @@ static void ept_save_pdptrs(struct kvm_vcpu *vcpu)
 {
        struct kvm_mmu *mmu = vcpu->arch.walk_mmu;
 
-       if (is_paging(vcpu) && is_pae(vcpu) && !is_long_mode(vcpu)) {
+       if (is_paging(vcpu) && is_pae(vcpu) && !is_long_mode(vcpu) && !is_guest_mode(vcpu)) {
                mmu->pdptrs[0] = vmcs_read64(GUEST_PDPTR0);
                mmu->pdptrs[1] = vmcs_read64(GUEST_PDPTR1);
                mmu->pdptrs[2] = vmcs_read64(GUEST_PDPTR2);

Comment 8 Ladi Prosek 2016-11-22 16:31:31 UTC
The fix in comment #7 works great as long as the nested guest stays in real mode. Attempting to run an OS installer or boot a Hyper-V Gen2 VM fails with a triple fault.

Here's what's happening when booting a Gen2 VM. From L2 point of view (pseudocode):
1. mov cr4, 0x640 (not interesting)
2. mov cr3, 0xFFFFE000
3. mov cr4, 0x660 (+PAE)
4. mov cr0, 0xc0000033 (+PE,+PG)

From L0 point of view, there is a bunch of CR_ACCESS VMEXITs which are forwarded to L1. This seems expected.

In particular, 2. will result in L1 executing vmwrite(Guest CR3, 0xFFFFE000). This vmwrite is intercepted (shadow VMCS is disabled for simplicity) and written to vmcs12. The value is attempted to be used on re-entry but it doesn't stick:

  kvm_set_cr3(vcpu, vmcs12->guest_cr3); // 0xFFFE000
  printk( .. vm_read_cr3(vcpu) .. ); // 0x0

because kvm_set_cr3 returns early:

	} else if (is_pae(vcpu) && is_paging(vcpu) &&
		   !load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
		return 1;

Hmmm.. paging is *not* enabled in L2 yet at this point so it should be legal to load anything into CR3, or not?

Comment 9 Ladi Prosek 2016-11-25 08:50:53 UTC
This is now fully understood and a KVM patch has been posted:
http://www.spinics.net/lists/kvm/msg141478.html

Comment 10 Cole Robinson 2016-12-22 20:24:05 UTC
Looks like the last referenced patch is upstream now:

commit 7ca29de21362de242025fbc1c22436e19e39dddc
Author: Ladi Prosek <lprosek>
Date:   Wed Nov 30 16:03:08 2016 +0100

    KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT


Meaning it will be in fedora 25 after the next kernel release. Just closing as NEXTRELEASE, but if this is needed sooner feel free to reopen and maybe the kernel devs will backport