0.19 was being remarkably stable while I was doing yum updates and such, but eventually got: kernel BUG at arch/x86/xen/enlighten.c:807! invalid opcode: 0000 [1] CPU 0 Modules linked in: rfcomm l2cap bluetooth autofs4 sunrpc nf_conntrack_ipv4 ipt_REJECT iptable_filter ip_tables nf_conntrack_ipv6 xt_state nf_conntrack xt_tcpudp ip6t_ipv6header ip6t_REJECT ip6table_filter ip6_tables x_tables ipv6 loop dm_multipath xen_netfront pcspkr dm_snapshot dm_zero dm_mirror dm_mod xen_blkfront ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode] Pid: 14718, comm: sh Not tainted 2.6.25-0.19.rc8.fc9.x86_64.xen #1 RIP: e030:[<ffffffff8020b7dd>] [<ffffffff8020b7dd>] xen_alloc_ptpage+0x101/0x110 RSP: e02b:ffff880000085a78 EFLAGS: 00010282 RAX: 00000000ffffffea RBX: 000000000003005a RCX: 000000000000062e RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff880000085a78 RBP: ffff880000085ab8 R08: 0000000000000002 R09: ffff880000b5b010 R10: 0000000000007ff0 R11: 0000000000000001 R12: ffff8800014c4000 R13: 0000000000000003 R14: ffff880000b5b010 R15: ffff88001f1ac3c0 FS: 00007fce05ff26f0(0000) GS:ffffffff805fc000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000043ad8d CR3: 0000000000186000 CR4: 0000000000002620 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process sh (pid: 14718, threadinfo ffff880000084000, task ffff880000dd0000) Stack: ffff880000000000 000000000003005a ffff88000fa56af0 ffff88000fa56ad8 ffff880000b5b010 ffff8800014d9a10 ffff88000fa56a00 ffff88000fa56ad8 ffff880000085ac8 ffffffff8020b80a ffff880000085af8 ffffffff80284fdd Call Trace: [<ffffffff8020b80a>] xen_alloc_pt+0xe/0x10 [<ffffffff80284fdd>] __pte_alloc+0x105/0x1ad [<ffffffff802851dd>] handle_mm_fault+0x158/0x899 [<ffffffff802500f6>] ? lock_acquired+0x273/0x283 [<ffffffff8048982e>] ? do_page_fault+0x649/0xb4e [<ffffffff8024fd57>] ? lock_acquire+0xb3/0xc3 [<ffffffff8048982e>] ? do_page_fault+0x649/0xb4e [<ffffffff804898e5>] do_page_fault+0x700/0xb4e [<ffffffff8048673d>] ? trace_hardirqs_on_thunk+0x35/0x3a [<ffffffff80211591>] ? restore_args+0x0/0x30 [<ffffffff804867ae>] ? lockdep_sys_exit_thunk+0x35/0x67 [<ffffffff8048741b>] error_exit+0x0/0xb9 Code: 7f ff ff ff ff 48 8b 1c c8 48 21 d3 48 89 5d c8 48 8d 7d c0 be 01 00 00 00 31 d2 49 c7 c2 f0 7f 00 00 e8 67 eb 43 00 85 c0 74 04 <0f> 0b eb fe 48 83 c4 28 5b 41 5c 41 5d c9 c3 55 ba 02 00 00 00 RIP [<ffffffff8020b7dd>] xen_alloc_ptpage+0x101/0x110 RSP <ffff880000085a78>
Do you have the 'xm dmesg' output?
No, sorry. It's gone. What are you looking for in particular?
The pinning hypercall failed, so the hypervisor should dump a message explaining why pinning failed.
Created attachment 301790 [details] oops.log Sorry, should have included this in the first place Looks to me like we're trying to setup a process's pagetables, but failing to pin the PTE pages because they're writable Strange ... in what cases would making the PTE pages readonly fail?
(In reply to comment #4) > > Strange ... in what cases would making the PTE pages readonly fail? That's the question. The update_va_mapping() hypercall doesn't seem to be failing and I don't see any obvious problem on make_lowmem_page_readonly(). AFAICS, it may fail to change the page type to readonly if the same mfn is being mapped rw at another virtual address, or maybe mapped by another domain.
Created attachment 301840 [details] oops.log Reproduced a similarish oops by just running a heap of processes concurrently. Failed to reproduce it again, though
FYI: (gdb) l *0xffffffff80283d57 0xffffffff80283d57 is in do_wp_page (include/asm/pgtable_64.h:73). 70 static inline void native_pte_clear(struct mm_struct *mm, unsigned long addr, 71 pte_t *ptep) 72 { 73 *ptep = native_make_pte(0); 74 } (gdb) l *0xffffffff8028584f 0xffffffff8028584f is in handle_mm_fault (mm/memory.c:2484). 2482 if (write_access) { 2483 if (!pte_write(entry)) 2484 return do_wp_page(mm, vma, address, 2485 pte, pmd, ptl, entry);
Also reproduced this during a guest install with kernel-xen-2.6.25-0.19.rc8.fc9 mid-way during package install
Okay, I've got a very reliable reproducer ... just allocate a pile of memory: while true; do (python -c 'import time; s="s"*(1<<28); time.sleep(1)' &); done
Btw, this: Bad type (saw 00000000e8000001 != exp 0000000020000000) for mfn 4904c (pfn 61e) can be decoded as: 0xe8000001 == PGT_writable_page | PGT_validated | _PAGE_PRESENT 0x20000000 == PGT_l1_page_table
Created attachment 302107 [details] batch-update-va-mapping-and-pin.patch Okay, this patch batches together the pte update and pinning, which moves the bug from alloc_ptpage() to the likes of exit_mm()/alloc_pages() Clearly a concurrency issue, then ... still looking
Nevermind that theory - all the patch really does is remove the error checking so we don't hit the BUG() I still see the likes of: Bad type (saw 00000000e8000001 != exp 0000000020000000) for mfn 4902e (pfn 63c) Error while pinning mfn 4902e Bad type (saw 00000000e8000001 != exp 0000000020000000) for mfn 4902e (pfn 63c) Attempt to create linear p.t. with write perms before hitting the eventual oops So, what the batching shows us that this *isn't* a concurrency issue - the pte update and pinning happen in the same hypercall, yet the pte update seems to have failed
My current theory is that the page was mapped rw by userspace but either it was freed but we didn't tell the hypervisor to drop the userspace pagetable and/or the rw reference to the page, or we the page is being allocated twice. I have hacked the hypervisor to keep track of what pagetables are keeping the rw reference count up and dump this at the pinning error. Next step will be checking why the kernel is keeping those references, using this information.
Just confirmed it: when the bug occurs, there is a still-pinned pagetable pointing to the page we are trying to pin. Now I am checking when it was pinned and why it was not unpinned. Or maybe, why it was allocated while in use (I think it is unlikely, I have not discarded this possibility).
I was able to narrow down where the stale PTE entry for this page was created and then in xen/arch/x86/mmu.c:do_mmu_update(), I added: case PGT_l1_page_table: { l1_pgentry_t l1e = l1e_from_intpte(req.val); + MEM_LOG("mmu update of L1 page table; mfn = %lx, val = %lx", mfn, req.val); okay = mod_l1_entry(va, l1e, mfn, ... and saw: (XEN) mm.c:2576:d0 mmu update of L1 page table; mfn = 31486, val = 8000000030b46027 then later when triggering the oops: (XEN) mm.c:1948:d1 Bad type (saw 00000000e8000001 != exp 0000000020000000) for mfn 30b46 (pfn 608) (XEN) mm.c:2258:d1 Error while pinning mfn 30b46 So, it seems the stale PTE entry is in a page table created by the domain builder in dom0 and we're not properly freeing that in the guest.
(this was supposed to be posted a few hours ago, but bugzilla was waiting on the login form. duh) I have noticed two problems: - pgd_free() needs to unpin the PGD and remap it as read-write - I could dump all pages that kept the read-write reference count of the page that failed to be pinned, and noticed that the page was mapped as read-write on address 0xFFFFFFFF80600. This is before &_end, so it seems to be kernel code or data that was freed on free_init_pages(). We need to clear the __START_KERNEL_map mapping of the page on free_init_pages() However I am not being able to reproduce it in the last few hours (and I haven't fixed pgd_free() and free_init_pages() yet). Probably the moon phase have changed today. [some time later] I have found that the bug is not reproduced here when CONFIG_DEBUG_PAGEALLOC is set.
Declaring a large (100 MB) __initdata array makes the bug reproducible very quickly when running a fork bomb. I am fixing free_init_pages() to behave like in the old linux-2.6.22 xen patch: set the pages as not present if they are after __START_KERNEL_map.
Created attachment 302525 [details] Patch to avoid keeping a alias mapping for freed init memory pages The attached patch should fix the problem I have been reproducing here. pgd_free() fix is still pending, however.
kernel-xen-2.6-2.6.25-0.24.rc9.fc9 was submitted to build, with the free_init_pages() and pgd_free() fixes.
Mark, are you able to reproduce the bug on your side with kernel-xen-2.6-2.6.25-0.24.rc9.fc9? (http://koji.fedoraproject.org/koji/taskinfo?taskID=568088) I plan to test the RPM before asking for tagging, but my ISP isn't helping me, so I will be able test it just tomorrow (maybe after you already tested it).
Looking good so far: - Can't reproduce the oops with the reproducer that worked before - Boots fine on 32 and 64 bit - Decent sized yum updates completed on 32 and 64 bit - Running test installs of both 32 and 64 bit atm