Red Hat Bugzilla – Bug 459067
32 bit 2.6.27-rc3 Xen guest crashing on 32 bit 3.1.2 HV
Last modified: 2008-11-17 03:25:05 EST
Created attachment 314288 [details]
Output from the domU console during the crash
Description of problem:
I'm running an i386 RHEL-5 dom0. On top of that, I successfully installed a F-9 i386 PV domU. After I upgraded the kernel in the F-9 guest to 2.6.27-0.256.rc3.git1.fc10.i686.PAE, I'm getting a kernel crash on boot (I'll attach a full console output). I also get exactly the same crash in a hand-built kernel from Linus' upstream tree (pulled as of 13-Aug-2008). During the crash, there are also some messages on the serial console of the dom0; I'll also include those as an attachment.
Created attachment 314289 [details]
Output from the dom0 serial console during the F-10 kernel bootup/crash
32 bit DomU crashing on 32 bit 3.1.2 HV
Also reported at:
1 multicall(s) failed: cpu 0 Pid: 0, comm: swapper Tainted: G W 2.6.27-rc1-xenU #4
call 1/1: op=1 arg=[c1203860] result=-22
this could be the relevant error:
(XEN) mm.c:516:d8 Could not get page ref for pfn 7fffffff
(XEN) mm.c:2372:d8 Could not get page for normal update
Might be worth seeing if it's something fixed with 3.1.4 HV
Yes, this is the same Xen bug as the other set_pud crash, I think. Though the Xen console messages aren't familiar.
(In reply to comment #3)
> Yes, this is the same Xen bug as the other set_pud crash, I think. Though the
> Xen console messages aren't familiar.
That was a 32-on-64 issue only, right?
Works fine with 3.2.0 HV
Hm, right, the other bug should only be 32-on-64. But you're saying that a new Xen fixes it anyway?
OK. It also seems to work fine with a 3.1.4 HV. I haven't had time to bisect it yet, but it seems as though we should get a fix for this issue in, so that we can boot F-10 kernels on RHEL-5. I'm changing the component to RHEL-5 as well, since it seems to be specific to the HV.
Ug. I've just found out that this bug happens because of a patch we are carrying in RHEL-5 that is not in the 3.1.x stream; namely, it's the "mprotect" performance enhancements patch. Now, interestingly enough, I do believe these enhancements are also in the upstream (3.2.0) HV, so we must be missing a fix from there. I'll have to dig into it more.
Also interesting is that the mprotect() patch and the fix for bug #457879 (the 32-on-64 issue) conflict with each other. Think that just might be a coincidence, though.
Heh. I just realized 3.2 is fine because it doesn't have the mprotect batching fixes. We'll have to test with 3.3 to see how it fairs (I assume it is OK, but we'll have to see).
OK. I just tested with a recently released 3.3.0 hypervisor, and I got the exact same crash I got with the RHEL-5 hypervisor. I'm still not sure whether the bug is in the guest kernel or the HV code, but it needs to be looked at and fixed either way.
Just to be clear: does this happen with any unmodified kernel/Xen combination, or only with some local RH patches in place?
Yes, this does happen, with basically any combination I've tried with (except for the 3.2.0 kernel, which doesn't have the batched mprotect patches). Here are the combinations I've tried (all i386):
RHEL-5 HV (w/ mprotect patch) + RHEL-5 PV guest - good
RHEL-5 HV (w/ mprotect patch) + F-9 2.6.25 pv-ops guest - good
RHEL-5 HV (w/ mprotect patch) + F-10 2.6.27 pv-ops guest - crash
RHEL-5 HV (w/ mprotect patch) + 2.6.27 pv-ops (Linus' tree) guest - crash
Xensource 3.1.2 HV + F-10 2.6.27 pv-ops guest - good
Xensource 3.1.4 HV + F-10 2.6.27 pv-ops guest - good
Xensource 3.2.0 HV + F-10 2.6.27 pv-ops guest - good (according to markmc)
Xensource 3.3.0 HV + F-9 2.6.25 pv-ops guest - good
Xensource 3.3.0 HV + F-10 2.6.27 pv-ops guest - crash
Xensource 3.3.0 HV + 2.6.27 pv-ops (Linus' tree) guest - crash
So that last combination has no Redhat patches at all; only upstream Xensource HV and 2.6.27-rc3 LKML pv-ops kernel.
Just as a quick update on this:
The reason that the hypercall is failing is because the MFN that the guest is passing down to the hypervisor is completely bogus. After adding some debugging, I found that the guest is asking to change protections on page_nr 7fffffff, whereas this machine only has max_page of 400000 (16GB). What I don't quite understand about that, however, is why 3.2 would work; maybe it is missing a check that both 3.3 and the RHEL-5 HV have. I'm still investigating.
Ah, now I'm getting somewhere. This bug is memory dependent. I'm now trying to boot the 2.6.27 kernel with various amounts of memory, and I get different behaviour.
400MB = boot
512MB = boot
602MB = boot
768MB = crash
2000MB = boot
4000MB = boot
OK, interesting. The report you referred to was on a 64-bit hypervisor (I think), so this isn't a 32-on-32 specific problem.
The funny thing about the report is that the backtrace is to zap_low_mappings, which simply plugs empty_zero_page into the unused pgd slots. The only way that could have a bad mfn is if the pfn->mfn table is corrupted or incorrectly updated.
Is this bug still an issue?
I believe it is. Chris is out this week, so there won't be an answer until
As far as I know, yes. I've been pulled away to various other things for the time being, but I can reproduce this 100% with the combinations in comment #15 and a 768MB PV guest.
*** Bug 449566 has been marked as a duplicate of this bug. ***
OK. I've made a little progress here, but I still haven't found the real cause. I've added a bunch of debugging in the guest, and what is basically happening is that mm/protect.c:change_pte_range() is calling ptep_modify_prot_commit(), which resolves to arch/x86/xen/mmu.c:xen_pte_prot_commit(). In there, we are calling virt_to_machine(ptep), which is where our woes start. The PTE that is being passed in is something like 0xf57a8500, which is run through __pa to get 0x357a8500. Then we PAGE_SHIFT it to get pfn 0x357a8, and then call pfn_to_mfn on that pfn. But since this machine only has 768MB of memory, it only has 0x30000 pages of memory, which means that pfn_to_mfn looks in the p2m_top array, returns a INVALID_MFN, and it's all downhill from there.
So, the question becomes, why are we getting this bogus 0xf57a8500 address to begin with? What I believe is happening is that in mm/mprotect.c:change_pte_range(), pte_offset_map_lock() is basically resolving down to a kmap_atomic(). So this page frame is up in the kmap fixmap area, meaning that I don't think there is physical memory to cover it. What I don't quite understand is why this doesn't happen with 512MB of memory, for instance. In any case, I'll keep digging on this.
Oh, and the reason that earlier kernels (say 2.6.25) don't exhibit this behavior is because they weren't using the "lazy" MMU updates at all, so xen_ptep_modify_prot_commit basically became "set_pte_at", and we never looked through the p2m table at all. Indeed, replacing the whole body of xen_ptep_modify_prot_commit with just a set_pte_at() seems to make it work just fine, but obviously doesn't take advantage of the batching.
Oh. So if its kmap_atomic, and you have HIGHPTE, then a pagetable page will be kmapped. That means that the pfn->mfn will need to do a full pagetable walk (arbitrary_virt_to_machine()).
Unfortunately that's relatively expensive. But arbitrary_virt_to_machine could special-case vaddrs in the linear mapping.
Bingo. Using arbitrary_virt_to_machine() in xen_ptep_modify_prot_commit() fixed it. So the next thing to do is to special case the vaddrs in the linear mapping like you said so that we can get the performance back. I'll look at that next.
Created attachment 320396 [details]
Patch to fix booting problem with 768MB of memory
This is the patch I've tested out, which seems to fix the bug for me. Is this the kind of thing you had in mind? If so, I'll do a little further testing on it and then submit it for you upstream.
Worried about the test against max_pfn because it precludes the use of sparsemem. I think using __virt_addr_valid(vaddr) is the right test.
Created attachment 320408 [details]
Patch to fix booting problem using virt_addr_valid
I sent the latest patch to Jeremy and CC'ed LKML and xen-devel, so switching to POST.
Now fixed in upstream, so closing out this bug.