Bug 441543
Summary: | F-9 pv_ops xen: x86_64 pagetable pinning oops | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Mark McLoughlin <markmc> | ||||||||||
Component: | kernel-xen-2.6 | Assignee: | Eduardo Habkost <ehabkost> | ||||||||||
Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | rawhide | CC: | berrange, ehabkost, xen-maint | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | All | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | kernel-xen-2.6-2.6.25-0.24.rc9.fc9 | Doc Type: | Bug Fix | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2008-04-16 08:01:22 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 434756 | ||||||||||||
Attachments: |
|
Description
Mark McLoughlin
2008-04-08 17:36:05 UTC
Do you have the 'xm dmesg' output? No, sorry. It's gone. What are you looking for in particular? The pinning hypercall failed, so the hypervisor should dump a message explaining why pinning failed. Created attachment 301790 [details]
oops.log
Sorry, should have included this in the first place
Looks to me like we're trying to setup a process's pagetables, but failing to
pin the PTE pages because they're writable
Strange ... in what cases would making the PTE pages readonly fail?
(In reply to comment #4) > > Strange ... in what cases would making the PTE pages readonly fail? That's the question. The update_va_mapping() hypercall doesn't seem to be failing and I don't see any obvious problem on make_lowmem_page_readonly(). AFAICS, it may fail to change the page type to readonly if the same mfn is being mapped rw at another virtual address, or maybe mapped by another domain. Created attachment 301840 [details]
oops.log
Reproduced a similarish oops by just running a heap of processes concurrently.
Failed to reproduce it again, though
FYI: (gdb) l *0xffffffff80283d57 0xffffffff80283d57 is in do_wp_page (include/asm/pgtable_64.h:73). 70 static inline void native_pte_clear(struct mm_struct *mm, unsigned long addr, 71 pte_t *ptep) 72 { 73 *ptep = native_make_pte(0); 74 } (gdb) l *0xffffffff8028584f 0xffffffff8028584f is in handle_mm_fault (mm/memory.c:2484). 2482 if (write_access) { 2483 if (!pte_write(entry)) 2484 return do_wp_page(mm, vma, address, 2485 pte, pmd, ptl, entry); Also reproduced this during a guest install with kernel-xen-2.6.25-0.19.rc8.fc9 mid-way during package install Okay, I've got a very reliable reproducer ... just allocate a pile of memory: while true; do (python -c 'import time; s="s"*(1<<28); time.sleep(1)' &); done Btw, this: Bad type (saw 00000000e8000001 != exp 0000000020000000) for mfn 4904c (pfn 61e) can be decoded as: 0xe8000001 == PGT_writable_page | PGT_validated | _PAGE_PRESENT 0x20000000 == PGT_l1_page_table Created attachment 302107 [details]
batch-update-va-mapping-and-pin.patch
Okay, this patch batches together the pte update and pinning, which moves the
bug from alloc_ptpage() to the likes of exit_mm()/alloc_pages()
Clearly a concurrency issue, then ... still looking
Nevermind that theory - all the patch really does is remove the error checking so we don't hit the BUG() I still see the likes of: Bad type (saw 00000000e8000001 != exp 0000000020000000) for mfn 4902e (pfn 63c) Error while pinning mfn 4902e Bad type (saw 00000000e8000001 != exp 0000000020000000) for mfn 4902e (pfn 63c) Attempt to create linear p.t. with write perms before hitting the eventual oops So, what the batching shows us that this *isn't* a concurrency issue - the pte update and pinning happen in the same hypercall, yet the pte update seems to have failed My current theory is that the page was mapped rw by userspace but either it was freed but we didn't tell the hypervisor to drop the userspace pagetable and/or the rw reference to the page, or we the page is being allocated twice. I have hacked the hypervisor to keep track of what pagetables are keeping the rw reference count up and dump this at the pinning error. Next step will be checking why the kernel is keeping those references, using this information. Just confirmed it: when the bug occurs, there is a still-pinned pagetable pointing to the page we are trying to pin. Now I am checking when it was pinned and why it was not unpinned. Or maybe, why it was allocated while in use (I think it is unlikely, I have not discarded this possibility). I was able to narrow down where the stale PTE entry for this page was created and then in xen/arch/x86/mmu.c:do_mmu_update(), I added: case PGT_l1_page_table: { l1_pgentry_t l1e = l1e_from_intpte(req.val); + MEM_LOG("mmu update of L1 page table; mfn = %lx, val = %lx", mfn, req.val); okay = mod_l1_entry(va, l1e, mfn, ... and saw: (XEN) mm.c:2576:d0 mmu update of L1 page table; mfn = 31486, val = 8000000030b46027 then later when triggering the oops: (XEN) mm.c:1948:d1 Bad type (saw 00000000e8000001 != exp 0000000020000000) for mfn 30b46 (pfn 608) (XEN) mm.c:2258:d1 Error while pinning mfn 30b46 So, it seems the stale PTE entry is in a page table created by the domain builder in dom0 and we're not properly freeing that in the guest. (this was supposed to be posted a few hours ago, but bugzilla was waiting on the login form. duh) I have noticed two problems: - pgd_free() needs to unpin the PGD and remap it as read-write - I could dump all pages that kept the read-write reference count of the page that failed to be pinned, and noticed that the page was mapped as read-write on address 0xFFFFFFFF80600. This is before &_end, so it seems to be kernel code or data that was freed on free_init_pages(). We need to clear the __START_KERNEL_map mapping of the page on free_init_pages() However I am not being able to reproduce it in the last few hours (and I haven't fixed pgd_free() and free_init_pages() yet). Probably the moon phase have changed today. [some time later] I have found that the bug is not reproduced here when CONFIG_DEBUG_PAGEALLOC is set. Declaring a large (100 MB) __initdata array makes the bug reproducible very quickly when running a fork bomb. I am fixing free_init_pages() to behave like in the old linux-2.6.22 xen patch: set the pages as not present if they are after __START_KERNEL_map. Created attachment 302525 [details]
Patch to avoid keeping a alias mapping for freed init memory pages
The attached patch should fix the problem I have been reproducing here.
pgd_free() fix is still pending, however.
kernel-xen-2.6-2.6.25-0.24.rc9.fc9 was submitted to build, with the free_init_pages() and pgd_free() fixes. Mark, are you able to reproduce the bug on your side with kernel-xen-2.6-2.6.25-0.24.rc9.fc9? (http://koji.fedoraproject.org/koji/taskinfo?taskID=568088) I plan to test the RPM before asking for tagging, but my ISP isn't helping me, so I will be able test it just tomorrow (maybe after you already tested it). Looking good so far: - Can't reproduce the oops with the reproducer that worked before - Boots fine on 32 and 64 bit - Decent sized yum updates completed on 32 and 64 bit - Running test installs of both 32 and 64 bit atm |