Bug 441543

Summary: F-9 pv_ops xen: x86_64 pagetable pinning oops
Product: [Fedora] Fedora Reporter: Mark McLoughlin <markmc>
Component: kernel-xen-2.6Assignee: Eduardo Habkost <ehabkost>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: high    
Version: rawhideCC: berrange, ehabkost, xen-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-xen-2.6-2.6.25-0.24.rc9.fc9 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-04-16 04:01:22 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 434756    
Attachments:
Description Flags
oops.log
none
oops.log
none
batch-update-va-mapping-and-pin.patch
none
Patch to avoid keeping a alias mapping for freed init memory pages none

Description Mark McLoughlin 2008-04-08 13:36:05 EDT
0.19 was being remarkably stable while I was doing yum updates and such, but
eventually got:

kernel BUG at arch/x86/xen/enlighten.c:807!
invalid opcode: 0000 [1] 
CPU 0 
Modules linked in: rfcomm l2cap bluetooth autofs4 sunrpc nf_conntrack_ipv4
ipt_REJECT iptable_filter ip_tables nf_conntrack_ipv6 xt_state nf_conntrack
xt_tcpudp ip6t_ipv6header ip6t_REJECT ip6table_filter ip6_tables x_tables ipv6
loop dm_multipath xen_netfront pcspkr dm_snapshot dm_zero dm_mirror dm_mod
xen_blkfront ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
Pid: 14718, comm: sh Not tainted 2.6.25-0.19.rc8.fc9.x86_64.xen #1
RIP: e030:[<ffffffff8020b7dd>]  [<ffffffff8020b7dd>] xen_alloc_ptpage+0x101/0x110
RSP: e02b:ffff880000085a78  EFLAGS: 00010282
RAX: 00000000ffffffea RBX: 000000000003005a RCX: 000000000000062e
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff880000085a78
RBP: ffff880000085ab8 R08: 0000000000000002 R09: ffff880000b5b010
R10: 0000000000007ff0 R11: 0000000000000001 R12: ffff8800014c4000
R13: 0000000000000003 R14: ffff880000b5b010 R15: ffff88001f1ac3c0
FS:  00007fce05ff26f0(0000) GS:ffffffff805fc000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000043ad8d CR3: 0000000000186000 CR4: 0000000000002620
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process sh (pid: 14718, threadinfo ffff880000084000, task ffff880000dd0000)
Stack:  ffff880000000000 000000000003005a ffff88000fa56af0 ffff88000fa56ad8
 ffff880000b5b010 ffff8800014d9a10 ffff88000fa56a00 ffff88000fa56ad8
 ffff880000085ac8 ffffffff8020b80a ffff880000085af8 ffffffff80284fdd
Call Trace:
 [<ffffffff8020b80a>] xen_alloc_pt+0xe/0x10
 [<ffffffff80284fdd>] __pte_alloc+0x105/0x1ad
 [<ffffffff802851dd>] handle_mm_fault+0x158/0x899
 [<ffffffff802500f6>] ? lock_acquired+0x273/0x283
 [<ffffffff8048982e>] ? do_page_fault+0x649/0xb4e
 [<ffffffff8024fd57>] ? lock_acquire+0xb3/0xc3
 [<ffffffff8048982e>] ? do_page_fault+0x649/0xb4e
 [<ffffffff804898e5>] do_page_fault+0x700/0xb4e
 [<ffffffff8048673d>] ? trace_hardirqs_on_thunk+0x35/0x3a
 [<ffffffff80211591>] ? restore_args+0x0/0x30
 [<ffffffff804867ae>] ? lockdep_sys_exit_thunk+0x35/0x67
 [<ffffffff8048741b>] error_exit+0x0/0xb9


Code: 7f ff ff ff ff 48 8b 1c c8 48 21 d3 48 89 5d c8 48 8d 7d c0 be 01 00 00 00
31 d2 49 c7 c2 f0 7f 00 00 e8 67 eb 43 00 85 c0 74 04 <0f> 0b eb fe 48 83 c4 28
5b 41 5c 41 5d c9 c3 55 ba 02 00 00 00 
RIP  [<ffffffff8020b7dd>] xen_alloc_ptpage+0x101/0x110
 RSP <ffff880000085a78>
Comment 1 Eduardo Habkost 2008-04-08 13:55:23 EDT
Do you have the 'xm dmesg' output?
Comment 2 Mark McLoughlin 2008-04-08 14:10:43 EDT
No, sorry. It's gone. What are you looking for in particular?
Comment 3 Eduardo Habkost 2008-04-08 14:38:21 EDT
The pinning hypercall failed, so the hypervisor should dump a message explaining
why pinning failed.
Comment 4 Mark McLoughlin 2008-04-09 08:16:31 EDT
Created attachment 301790 [details]
oops.log

Sorry, should have included this in the first place

Looks to me like we're trying to setup a process's pagetables, but failing to
pin the PTE pages because they're writable

Strange ... in what cases would making the PTE pages readonly fail?
Comment 5 Eduardo Habkost 2008-04-09 10:36:45 EDT
(In reply to comment #4)
> 
> Strange ... in what cases would making the PTE pages readonly fail?

That's the question. The update_va_mapping() hypercall doesn't seem to be
failing and I don't see any obvious problem on make_lowmem_page_readonly().
AFAICS, it may fail to change the page type to readonly if the same mfn is being
mapped rw at another virtual address, or maybe mapped by another domain.
Comment 6 Mark McLoughlin 2008-04-09 12:16:41 EDT
Created attachment 301840 [details]
oops.log

Reproduced a similarish oops by just running a heap of processes concurrently.
Failed to reproduce it again, though
Comment 7 Mark McLoughlin 2008-04-09 12:32:19 EDT
FYI:

(gdb) l *0xffffffff80283d57
0xffffffff80283d57 is in do_wp_page (include/asm/pgtable_64.h:73).
70      static inline void native_pte_clear(struct mm_struct *mm, unsigned long
addr,
71                                          pte_t *ptep)
72      {
73              *ptep = native_make_pte(0);
74      }
(gdb) l *0xffffffff8028584f
0xffffffff8028584f is in handle_mm_fault (mm/memory.c:2484).
2482            if (write_access) {
2483                    if (!pte_write(entry))
2484                            return do_wp_page(mm, vma, address,
2485                                            pte, pmd, ptl, entry);
Comment 8 Mark McLoughlin 2008-04-10 09:25:50 EDT
Also reproduced this during a guest install with kernel-xen-2.6.25-0.19.rc8.fc9
mid-way during package install
Comment 9 Mark McLoughlin 2008-04-10 15:39:24 EDT
Okay, I've got a very reliable reproducer ... just allocate a pile of memory:

while true; do (python -c 'import time; s="s"*(1<<28); time.sleep(1)' &); done
Comment 10 Mark McLoughlin 2008-04-11 07:41:47 EDT
Btw, this:

  Bad type (saw 00000000e8000001 != exp 0000000020000000) for mfn 4904c (pfn 61e)

can be decoded as:

  0xe8000001 == PGT_writable_page | PGT_validated | _PAGE_PRESENT
  0x20000000 == PGT_l1_page_table

  
Comment 11 Mark McLoughlin 2008-04-11 08:14:08 EDT
Created attachment 302107 [details]
batch-update-va-mapping-and-pin.patch

Okay, this patch batches together the pte update and pinning, which moves the
bug from alloc_ptpage() to the likes of exit_mm()/alloc_pages()

Clearly a concurrency issue, then ... still looking
Comment 12 Mark McLoughlin 2008-04-11 10:04:52 EDT
Nevermind that theory - all the patch really does is remove the error checking
so we don't hit the BUG()

I still see the likes of:

  Bad type (saw 00000000e8000001 != exp 0000000020000000) for mfn 4902e (pfn 63c)
  Error while pinning mfn 4902e
  Bad type (saw 00000000e8000001 != exp 0000000020000000) for mfn 4902e (pfn 63c)
  Attempt to create linear p.t. with write perms

before hitting the eventual oops

So, what the batching shows us that this *isn't* a concurrency issue - the pte
update and pinning happen in the same hypercall, yet the pte update seems to
have failed
Comment 13 Eduardo Habkost 2008-04-11 10:19:34 EDT
My current theory is that the page was mapped rw by userspace but either it was
freed but we didn't tell the hypervisor to drop the userspace pagetable and/or
the rw reference to the page, or we the page is being allocated twice.

I have hacked the hypervisor to keep track of what pagetables are keeping the rw
reference count up and dump this at the pinning error. Next step will be
checking why the kernel is keeping those references, using this information.
Comment 14 Eduardo Habkost 2008-04-11 12:57:29 EDT
Just confirmed it: when the bug occurs, there is a still-pinned pagetable
pointing to the page we are trying to pin.

Now I am checking when it was pinned and why it was not unpinned. Or maybe, why
it was allocated while in use (I think it is unlikely, I have not discarded this
possibility).
Comment 15 Mark McLoughlin 2008-04-15 11:17:45 EDT
I was able to narrow down where the stale PTE entry for this page was created
and then in xen/arch/x86/mmu.c:do_mmu_update(), I added:

    case PGT_l1_page_table:
    {
        l1_pgentry_t l1e = l1e_from_intpte(req.val);
+       MEM_LOG("mmu update of L1 page table; mfn = %lx, val = %lx", mfn, req.val);
        okay = mod_l1_entry(va, l1e, mfn,
        ...

and saw:

(XEN) mm.c:2576:d0 mmu update of L1 page table; mfn = 31486, val = 8000000030b46027

then later when triggering the oops:

(XEN) mm.c:1948:d1 Bad type (saw 00000000e8000001 != exp 0000000020000000) for
mfn 30b46 (pfn 608)
(XEN) mm.c:2258:d1 Error while pinning mfn 30b46

So, it seems the stale PTE entry is in a page table created by the domain
builder in dom0 and we're not properly freeing that in the guest.
Comment 16 Eduardo Habkost 2008-04-15 15:22:30 EDT
(this was supposed to be posted a few hours ago, but bugzilla was waiting on 
the login form. duh)

I have noticed two problems:

- pgd_free() needs to unpin the PGD and remap it as read-write
- I could dump all pages that kept the read-write reference count of the page 
that failed to be pinned, and noticed that the page was mapped as read-write 
on address 0xFFFFFFFF80600. This is before &_end, so it seems to be kernel 
code or data that was freed on free_init_pages(). We need to clear the 
__START_KERNEL_map mapping of the page on free_init_pages()


However I am not being able to reproduce it in the last few hours (and I 
haven't fixed pgd_free() and free_init_pages() yet). Probably the moon phase 
have changed today.

[some time later] I have found that the bug is not reproduced here when 
CONFIG_DEBUG_PAGEALLOC is set.
Comment 17 Eduardo Habkost 2008-04-15 15:33:35 EDT
Declaring a large (100 MB) __initdata array makes the bug reproducible very 
quickly when running a fork bomb.

I am fixing free_init_pages() to behave like in the old linux-2.6.22 xen 
patch: set the pages as not present if they are after __START_KERNEL_map.
Comment 18 Eduardo Habkost 2008-04-15 17:17:28 EDT
Created attachment 302525 [details]
Patch to avoid keeping a alias mapping for freed init memory pages

The attached patch should fix the problem I have been reproducing here.
pgd_free() fix is still pending, however.
Comment 19 Eduardo Habkost 2008-04-15 18:57:43 EDT
kernel-xen-2.6-2.6.25-0.24.rc9.fc9 was submitted to build, with the 
free_init_pages() and pgd_free() fixes.
Comment 20 Eduardo Habkost 2008-04-15 20:22:20 EDT
Mark, are you able to reproduce the bug on your side with 
kernel-xen-2.6-2.6.25-0.24.rc9.fc9? 
(http://koji.fedoraproject.org/koji/taskinfo?taskID=568088)

I plan to test the RPM before asking for tagging, but my ISP isn't helping me, 
so I will be able test it just tomorrow (maybe after you already tested it).
Comment 21 Mark McLoughlin 2008-04-16 04:01:22 EDT
Looking good so far:

  - Can't reproduce the oops with the reproducer that worked before
  - Boots fine on 32 and 64 bit
  - Decent sized yum updates completed on 32 and 64 bit
  - Running test installs of both 32 and 64 bit atm