| Summary: | Hyper-V on KVM nested virtualization does not work | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Ladi Prosek <lprosek> | ||||||
| Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||
| Status: | CLOSED NEXTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 25 | CC: | bdas, berrange, clalancette, crobinso, cz172638, ehabkost, extras-orphan, gansalmon, ichavero, itamar, jonathan, kernel-maint, madhu.chinakonda, markmc, mchehab, pbonzini, quintela, rkrcmar, virt-maint | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2016-12-22 20:24:05 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
|
Description
Ladi Prosek
2016-11-03 20:10:50 UTC
(In reply to Ladi Prosek from comment #0) > * What kind of address translation is involved in kvm_vcpu_read_guest_page, > i.e. why does it appear to work? If Hyper-V uses EPT then why does it need > to run its guest with paging on? "06/23/99" is the SeaBIOS date, kvm_vcpu_read_guest_page is reading the L1 physical memory. Doh! Also confirming that we're running in a nested EPT configuration. Why paging then? Does CR3=0 have a special meaning? Hacking the nested mmu code to treat CR3=0 as paging off (g_context->gva_to_gpa = nonpaging_gva_to_gpa_nested) show the true contents of guest physical F000:FFF0 as: ea48ff00f0 jmp F000:FF48 followed by "04/28/16" which is the Hyper-V BIOS date as confirmed by grepping \Windows\System32\vmchipset.dll So that's a decent indication that Hyper-V either didn't want to enable paging or wanted to set up identity page tables for real mode memory. > * What is the canonical way of virtualizing real mode?
The canonical way is to do one of these:
* set CR0.PE=CR0.PG=0 (only if the host has "unrestricted guest" support)
* set CR0.PE=CR0.PG=EFLAGS.VM=1 (vm86 mode), place an identity page table somewhere in guest memory, set CR3 to point to that identity page table.
Can you check if unrestricted guest support is enabled (bit 7 in secondary execution controls---try looking in traces for a vmwrite to field 0x401e, I sent a patch to kvm.org that adds a vmwrite tracepoint)?
The latter is needed because processors until Westmere didn't support VMX with CR0.PE=0 or CR0.PG=0.
CR3=0 doesn't have any meaning.
(In reply to Paolo Bonzini from comment #3) > > * What is the canonical way of virtualizing real mode? > > The canonical way is to do one of these: > > * set CR0.PE=CR0.PG=0 (only if the host has "unrestricted guest" support) > > * set CR0.PE=CR0.PG=EFLAGS.VM=1 (vm86 mode), place an identity page table > somewhere in guest memory, set CR3 to point to that identity page table. Thanks! > Can you check if unrestricted guest support is enabled (bit 7 in secondary > execution controls---try looking in traces for a vmwrite to field 0x401e, I > sent a patch to kvm.org that adds a vmwrite tracepoint)? Unrestricted guest support is *not* enabled. > The latter is needed because processors until Westmere didn't support VMX > with CR0.PE=0 or CR0.PG=0. So it looks like they are after the latter. It all checks out except for CR3. kvm_set_cr3 when entering L2 guest: CR3=0, CR0=80000031, RFLAGS=30002 > CR3=0 doesn't have any meaning. I'll see if I can figure out where the value is coming from. Quick update: * Hyper-V code is inaccessible using the standard remote Windows kernel debugging techniques, the memory region cannot even be read from windbg * My only tool so far has been QEMU remote debugging * The "0" in vmwrite(0x6802, 0) comes from a couple of stack frames above, it is hard-coded, and there's no obvious conditional explaining the intent behind it * Bin-grepping for the code pattern being executed confirmed the module to be hvix64.exe, located under \Windows\WinSxS * Microsoft doesn't provide public symbols for hvix64.exe Two avenues going forward: * Static-analyze the heck out of hvix64.exe, maybe leverage prior (security?) research, use powerful reverse engineering tools * Find another hypervisor or emulator under which Hyper-V works and can be debugged/single-stepped, then compare the code paths and see where they diverge Created attachment 1221159 [details] VMCS dump (In reply to Ladi Prosek from comment #5) > * Find another hypervisor or emulator under which Hyper-V works and can be > debugged/single-stepped, then compare the code paths and see where they > diverge I have installed VMware, verified that nested Hyper-V works, and the guest can be debugged via gdb stub very similar to KVM. I was able to dump the L2 VMCS as seen by the L1 guest at the point when it's about to execute VMLAUNCH, and at the first non-external-interrupt VMEXIT following the VMLAUNCH. The VMCS was dumped by injecting a series of VMREAD instructions into an unused code region of L1 and executing them by manipulating the instruction pointer and setting breakpoints in gdb. Hyper-V funnels VMEXITs from all VMs to a single entry point so to distinguish the nested VM of interest from the root partition exits, I have hijacked the VMEXIT vector to point to another piece of injected code which checks the exit reason and jumps to the Hyper-V handler if it's an external interrupt. push rcx push rdx mov rdx, 0x4402 vmread rcx, rdx cmp rdx, 1 pop rdx pop rcx jne <where the breakpoint is> jmp <Hyper-V VMEXIT entrypoint> The full table is attached. A few observations: 1) Guest CR3 (field 0x6802) is seriously 0 everywhere. Really! 2) The notable difference at VMLAUNCH is that under KVM, Hyper-V sets APIC-access address (field 0x2014) and bit 0 (Virtualize APIC accesses) in Secondary processor-based VM-execution controls (field 0x401e). So it looks like VMware doesn't support this feature. Benign? The rest is identical, modulo pointer values. 3) The first VMEXIT is very different. While under KVM we exit right at IP=0xFFF0 with exception 0xE - Page Fault (this matches the KVM-side analysis above), under VMware the first exit happens later and it's exception 0xD - General Protection Fault at IP=0xC0A0, see field 0x681e. 4) Under KVM, all guest PDPTEs are zeroed on exit, see fields 0x280a-0x2811. This looks extremely relevant. If I'm reading the manual right, the conditions under which PDPTEs are loaded from the memory pointed to by CR3 are well specified and one can legitimately get away with leaving CR3 uninitialized. See "26.3.2.4 Loading Page-Directory-Pointer-Table Entries". The working theory is that KVM doesn't correctly set up the PDPTE fields when building VMCS02. Created attachment 1221251 [details] Hyper-V on KVM (In reply to Ladi Prosek from comment #6) > 4) Under KVM, all guest PDPTEs are zeroed on exit, see fields 0x280a-0x2811. > This looks extremely relevant. If I'm reading the manual right, the > conditions under which PDPTEs are loaded from the memory pointed to by CR3 > are well specified and one can legitimately get away with leaving CR3 > uninitialized. See "26.3.2.4 Loading Page-Directory-Pointer-Table Entries". > The working theory is that KVM doesn't correctly set up the PDPTE fields > when building VMCS02. Indeed, guest PDPTEs are incorrectly overwritten in ept_load_pdptrs. This fixes it for me but I'll spend some more time on it before posting the patch. --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3876,7 +3876,7 @@ static void ept_load_pdptrs(struct kvm_vcpu *vcpu) (unsigned long *)&vcpu->arch.regs_dirty)) return; - if (is_paging(vcpu) && is_pae(vcpu) && !is_long_mode(vcpu)) { + if (is_paging(vcpu) && is_pae(vcpu) && !is_long_mode(vcpu) && !is_guest_mode(vcpu)) { vmcs_write64(GUEST_PDPTR0, mmu->pdptrs[0]); vmcs_write64(GUEST_PDPTR1, mmu->pdptrs[1]); vmcs_write64(GUEST_PDPTR2, mmu->pdptrs[2]); @@ -3888,7 +3888,7 @@ static void ept_save_pdptrs(struct kvm_vcpu *vcpu) { struct kvm_mmu *mmu = vcpu->arch.walk_mmu; - if (is_paging(vcpu) && is_pae(vcpu) && !is_long_mode(vcpu)) { + if (is_paging(vcpu) && is_pae(vcpu) && !is_long_mode(vcpu) && !is_guest_mode(vcpu)) { mmu->pdptrs[0] = vmcs_read64(GUEST_PDPTR0); mmu->pdptrs[1] = vmcs_read64(GUEST_PDPTR1); mmu->pdptrs[2] = vmcs_read64(GUEST_PDPTR2); The fix in comment #7 works great as long as the nested guest stays in real mode. Attempting to run an OS installer or boot a Hyper-V Gen2 VM fails with a triple fault. Here's what's happening when booting a Gen2 VM. From L2 point of view (pseudocode): 1. mov cr4, 0x640 (not interesting) 2. mov cr3, 0xFFFFE000 3. mov cr4, 0x660 (+PAE) 4. mov cr0, 0xc0000033 (+PE,+PG) From L0 point of view, there is a bunch of CR_ACCESS VMEXITs which are forwarded to L1. This seems expected. In particular, 2. will result in L1 executing vmwrite(Guest CR3, 0xFFFFE000). This vmwrite is intercepted (shadow VMCS is disabled for simplicity) and written to vmcs12. The value is attempted to be used on re-entry but it doesn't stick: kvm_set_cr3(vcpu, vmcs12->guest_cr3); // 0xFFFE000 printk( .. vm_read_cr3(vcpu) .. ); // 0x0 because kvm_set_cr3 returns early: } else if (is_pae(vcpu) && is_paging(vcpu) && !load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3)) return 1; Hmmm.. paging is *not* enabled in L2 yet at this point so it should be legal to load anything into CR3, or not? This is now fully understood and a KVM patch has been posted: http://www.spinics.net/lists/kvm/msg141478.html Looks like the last referenced patch is upstream now:
commit 7ca29de21362de242025fbc1c22436e19e39dddc
Author: Ladi Prosek <lprosek>
Date: Wed Nov 30 16:03:08 2016 +0100
KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT
Meaning it will be in fedora 25 after the next kernel release. Just closing as NEXTRELEASE, but if this is needed sooner feel free to reopen and maybe the kernel devs will backport
|