Bug 730503
| Summary: | RHEL 6.1 xen guest crashes with kernel BUG at arch/x86/xen/mmu.c:1457! | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Michael Young <m.a.young> | ||||
| Component: | kernel | Assignee: | Igor Mammedov <imammedo> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 6.1 | CC: | anshockm, drjones, imammedo, leiwang, pbonzini, qguan, qwan | ||||
| Target Milestone: | rc | Keywords: | Regression | ||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | kernel-2.6.32-193.el6 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2011-12-06 14:04:14 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 653816 | ||||||
| Attachments: |
|
||||||
|
Description
Michael Young
2011-08-13 20:37:06 UTC
1451 static void pin_pagetable_pfn(unsigned cmd, unsigned long pfn)
1452 {
1453 struct mmuext_op op;
1454 op.cmd = cmd;
1455 op.arg1.mfn = pfn_to_mfn(pfn);
1456 if (HYPERVISOR_mmuext_op(&op, 1, NULL, DOMID_SELF))
1457 BUG();
1458 }
We need to try and figure out why XenServer 5.6sp2 is returning an error on this hypercall. Please also file a bug with Citrix.
Please attach the output of "xm dmesg" after a crash. Thanks! It isn't xm in XenServer, however xe host-dmesg host=servername gives the following for one guest session (XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000 00000 to 00000000:00000000. (XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000 00000 to 00000000:00000000. (XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000 00000 to 00000000:00000000. (XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000 00000 to 00000000:00000000. (XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000 00000 to 00000000:00000000. (XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000 00000 to 00000000:00000000. (XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000 00000 to 00000000:00000000. (XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000 00000 to 00000000:00000000. (XEN) mm.c:2319:d46 Bad type (saw e800000000000004 != exp 2000000000000000) for mfn 833d7e (pfn 1eb5f5) (XEN) mm.c:2708:d46 Error while pinning mfn 833d7e (XEN) grant_table.c:1408:d0 dest domain 46 dying (XEN) grant_table.c:1408:d0 dest domain 46 dying This is interesting > (XEN) mm.c:2319:d46 Bad type (saw e800000000000004 != exp 2000000000000000) for > mfn 833d7e (pfn 1eb5f5) > (XEN) mm.c:2708:d46 Error while pinning mfn 833d7e This thread looks like it could be discussing the same issue http://lists.xensource.com/archives/html/xen-devel/2011-03/msg01367.html (In reply to comment #6) > This is interesting > > > (XEN) mm.c:2319:d46 Bad type (saw e800000000000004 != exp 2000000000000000) for > > mfn 833d7e (pfn 1eb5f5) > > (XEN) mm.c:2708:d46 Error while pinning mfn 833d7e > > This thread looks like it could be discussing the same issue > > http://lists.xensource.com/archives/html/xen-devel/2011-03/msg01367.html Except that is a crash on boot, whereas mine can stay up for a few days. There is also the long thread containing http://lists.xensource.com/archives/html/xen-devel/2011-01/msg00171.html though I think you might already have the patch for that. (In reply to comment #7) > There is also the long thread containing > http://lists.xensource.com/archives/html/xen-devel/2011-01/msg00171.html though > I think you might already have the patch for that. Actually I think I am wrong about you having the patch. http://lists.xensource.com/archives/html/xen-devel/2011-02/msg01293.html later in the thread mentions two patches so I think I was checking the wrong patch in my earlier comment. Michael, Would you care to test a new kernel with patches mentioned by http://lists.xensource.com/archives/html/xen-devel/2011-02/msg01293.html if I built it for you? Yes, I will test it. I have just been creating a copy of the virtual server that was crashing for testing purposes. A new kernel build with "vmalloc: eagerly clear ptes on vunmap" patch: http://people.redhat.com/imammedo/kernel-2.6.32-131.0.1.el6.test.x86_64.rpm Please test. (In reply to comment #13) > A new kernel build with "vmalloc: eagerly clear ptes on vunmap" patch: > > http://people.redhat.com/imammedo/kernel-2.6.32-131.0.1.el6.test.x86_64.rpm > > Please test. Perfect! I could trigger the bug easily within 2 minutes yesterday and with this new kernel the system is running fine for 30 min now. My environment before: kernel 2.6.32-131.6.1.el6.x86_64 xen (pv) guests on Citrix XenServer (5.6sp2) running dovecot-2.0.9-2.el6.x86_64. Using this system as back end for imapproxy on a squirrel webmailer triggered the bug within 2 min. (In reply to comment #14) > > Perfect! > I could trigger the bug easily within 2 minutes yesterday and with this new > kernel the system is running fine for 30 min now. Thanks for verifying. Do you have any NFS related io activity when bug was reproduced? (In reply to comment #15) > (In reply to comment #14) > > > > Perfect! > > I could trigger the bug easily within 2 minutes yesterday and with this new > > kernel the system is running fine for 30 min now. > > Thanks for verifying. > > Do you have any NFS related io activity when bug was reproduced? Yes, the dovecot mail directories are on a Netapp file servers connected by NFS. fs5.serv.uni-osnabrueck.de:/vol/MailStaff /mnt/fs5/MailStaff nfs rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.17.17.216,mountvers=3,mountport=4046,mountproto=udp,local_lock=none,addr=172.17.17.216 0 0 fs4.serv.uni-osnabrueck.de:/vol/MailStudent /mnt/fs4/MailStudent nfs rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.17.17.211,mountvers=3,mountport=4046,mountproto=udp,local_lock=none,addr=172.17.17.211 0 0 Here my backtrace from yesterday: kernel BUG at arch/x86/xen/mmu.c:1457! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/module/sunrpc/initstate CPU 0 Modules linked in: nfs lockd fscache(T) nfs_acl auth_rpcgss autofs4 sunrpc xenfs ipt_REJECT xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log microcode xen_netfront ext4 mbcache jbd2 xen_blkfront dm_mod [last unloaded: scsi_wait_scan] Modules linked in: nfs lockd fscache(T) nfs_acl auth_rpcgss autofs4 sunrpc xenfs ipt_REJECT xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log microcode xen_netfront ext4 mbcache jbd2 xen_blkfront dm_mod [last unloaded: scsi_wait_scan] Pid: 3937, comm: auth Tainted: G ---------------- T 2.6.32-131.6.1.el6.x86_64 #1 RIP: e030:[<ffffffff81005b2f>] [<ffffffff81005b2f>] pin_pagetable_pfn+0x4f/0x60 RSP: e02b:ffff8800f919fd08 EFLAGS: 00010282 RAX: 00000000ffffffea RBX: 00000000000f9f57 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8800f919fd08 RBP: ffff8800f919fd28 R08: 00003ffffffff000 R09: ffff880000000000 R10: 0000000000007ff0 R11: 0000000000000246 R12: 0000000000000003 R13: ffff880003b187e0 R14: ffff8800f91d5788 R15: ffff8800f9f255f8 FS: 00007fc4205c3700(0000) GS:ffff88000d136000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fc41f9061b8 CR3: 0000000002c3a000 CR4: 0000000000002660 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process auth (pid: 3937, threadinfo ffff8800f919e000, task ffff880003bac040) Stack: 0000000000000000 000000000033a69d ffff8800f91d5788 00000000000f9f57 ffff8800f919fd48 ffffffff81005cb9 ffff8800f91d5700 00000000000f9f57 ffff8800f919fd58 ffffffff81005d13 ffff8800f919fda8 ffffffff811334ec Call Trace: [<ffffffff81005cb9>] xen_alloc_ptpage+0x99/0xa0 [<ffffffff81005d13>] xen_alloc_pte+0x13/0x20 [<ffffffff811334ec>] __pte_alloc+0x8c/0x160 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e [<ffffffff811382a9>] handle_mm_fault+0x149/0x2c0 [<ffffffff810414e9>] __do_page_fault+0x139/0x480 [<ffffffff8113e1da>] ? do_mmap_pgoff+0x33a/0x380 [<ffffffff814e054e>] do_page_fault+0x3e/0xa0 [<ffffffff814dd8d5>] page_fault+0x25/0x30 Code: 48 ba ff ff ff 7f ff ff ff ff 48 21 d0 48 89 45 e8 48 8d 7d e0 be 01 00 00 00 31 d2 41 ba f0 7f 00 00 e8 15 b8 ff ff 85 c0 74 04 <0f> 0b eb fe c9 c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 RIP [<ffffffff81005b2f>] pin_pagetable_pfn+0x4f/0x60 RSP <ffff8800f919fd08> I was trying to get my test machine to crash with the standard 2.6.32-131.6.1.el6.x86_64 kernel before testing the new kernel. It did crash but in a different way; the backtrace is BUG: unable to handle kernel paging request at ffff8801ec2f2010 IP: [<ffffffff81006db8>] xen_set_pmd+0x38/0xb0 PGD 1a26067 PUD 5dcd067 PMD 5f2f067 PTE 80100001ec2f2065 Oops: 0003 [#1] SMP last sysfs file: /sys/devices/virtual/block/dm-2/dm/name CPU 0 Modules linked in: nfs lockd fscache(T) nfs_acl auth_rpcgss autofs4 sunrpc xenfs ipv6 ext4 jbd2 dm_mirror dm_region_hash dm_log microcode xen_netfront ext3 jbd mbcache xen_blkfront dm_mod [last unloaded: scsi_wait_scan] Modules linked in: nfs lockd f2 dm_mirror dm_region_hash dm_log microcode xen_netfront ext3 jbd mbcache xen_blkfront dm_mod [last unloaded: scsi_wait_scan] Pid: 30736, comm: xe-update-guest Tainted: G ---------------- T 2.6.32-131.6.1.el6.x86_64 #1 RIP: e030:[<ffffffff81006db8>] [<ffffffff81006db8>] xen_set_pmd+0x38/0xb0 RSP: e02b:ffff8801d7a41a78 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8801ec2f2010 RCX: ffff880000000000 RDX: ffffea0000000000 RSI: 0000000000000000 RDI: ffff8801ec2f2010 RBP: ffff8801d7a41a88 R08: 00000000018b2000 R09: 0000000000000000 R10: 0000000000000010 R11: 0000000000000000 R12: 0000000000000000 R13: 00000000008e5000 R14: ffff8801ec2f2010 R15: ffff88002805d520 FS: 00007f21a9817700(0000) GS:ffff88002804f000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffff8801ec2f2010 CR3: 000000014b233000 CR4: 0000000000002660 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process xe-update-guest (pid: 30736, threadinfo ffff8801d7a40000, task ffff8801f2f90b40) Stack: 0000000000600000 000000019baf4067 ffff8801d7a41b58 ffffffff81133d3f <0> ffffffff81007c4f ffffffff8115b764 ffff8801d7a41bc8 ffffffffffffffff <0> ffffffffffffffff 0000000000000000 0000000000000000 00000000008e4fff Call Trace: [<ffffffff81133d3f>] free_pgd_range+0x25f/0x4b0 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff8115b764>] ? kmem_cache_free+0xc4/0x2b0 [<ffffffff8113405e>] free_pgtables+0xce/0x120 [<ffffffff8113af90>] exit_mmap+0xb0/0x170 [<ffffffff8106449c>] mmput+0x6c/0x120 [<ffffffff81178ba9>] flush_old_exec+0x449/0x600 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff811ca095>] load_elf_binary+0x2b5/0x1bc0 [<ffffffff81133261>] ? follow_page+0x321/0x460 [<ffffffff8113852f>] ? __get_user_pages+0x10f/0x420 [<ffffffff811c3aac>] ? load_misc_binary+0xac/0x3e0 [<ffffffff8117a0fb>] search_binary_handler+0x10b/0x350 [<ffffffff8117b289>] do_execve+0x239/0x310 [<ffffffff8126e8ba>] ? strncpy_from_user+0x4a/0x90 [<ffffffff810095ca>] sys_execve+0x4a/0x80 [<ffffffff8100b5ca>] stub_execve+0x6a/0xc0 Code: 89 64 24 08 0f 1f 44 00 00 80 3d 2f 94 d0 00 00 48 89 fb 49 89 f4 75 51 48 89 df 83 05 69 93 d0 00 01 e8 7c e4 ff ff 84 c0 75 18 <4c> 89 23 48 8b 1c 24 4c 8b 64 24 08 c9 c3 66 2e 0f 1f 84 00 00 RIP [<ffffffff81006db8>] xen_set_pmd+0x38/0xb0 RSP <ffff8801d7a41a78> CR2: ffff8801ec2f2010 ---[ end trace cb20c8b5bdd26af7 ]--- Kernel panic - not syncing: Fatal exception Pid: 30736, comm: xe-update-guest Tainted: G D ---------------- T 2.6.32-131.6.1.el6.x86_64 #1 Call Trace: [<ffffffff814da518>] ? panic+0x78/0x143 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff814dd41c>] ? _spin_unlock_irqrestore+0x1c/0x20 [<ffffffff814de564>] ? oops_end+0xe4/0x100 c9b>] ? no_context+0xfb/0x260 [<ffffffff81040f25>] ? __bad_area_nosemaphore+0x125/0x1e0 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e [<ffffffff81040ff3>] ? bad_area_nosemaphore+0x13/0x20 [<ffffffff810416cd>] ? __do_page_fault+0x31d/0x480 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10 [<ffffffff81007c62>] ? check_events+0x12/0x20 [<ffffffff8100c2fb>] ? xen_hypervisor_callback+0x1b/0x20 [<ffffffff814ddb0a>] ? error_exit+0x2a/0x60 [<ffffffff8100bb1d>] ? retint_res00140a>] ? hypercall_page+0x40a/0x1010 [<ffffffff814e054e>] ? do_page_fault+0x3e/0xa0 [<ffffffff814dd8d5>] ? page_fault+0x25/0x30 [<ffffffff81006db8>] ? xen_set_pmd+0x38/0xb0 [<ffffffff81006db4>] ? xen_set_pmd+0x34/0xb0 [<ffffffff81133d3f>] ? free_pgd_range+0x25f/0x4b0 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff8115b764>] ? kmem_cache_free+0xc4/0x2b0 [<ffffffff8113405e>] ? free_pgtables+0xce/0x120 [<ffffffff8113af90>] ? exit_mmap+0xb0/0x170 [<ffffffff8106449c>] ? mmput+0x6c/0x120 [<ffffffff81178ba9>] ? flush_old_exec+0x449/0x600 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff811ca095>] ? load_elf_binary+0x2b5/0x1bc0 [<ffffffff81133261>] ? follow_page+0x321/0x460 [<ffffffff8113852f>] ? __get_user_pages+0x10f/0x420 [<ffffffff811c3aac>] ? load_misc_binary+0xac/0x3e0 [<ffffffff8117a0fb>] ? search_binary_handler+0x10b/0x350 [<ffffffff8117b289>] ? do_execve+0x239/0x310 [<ffffffff8126e8ba>] ? strncpy_from_user+0x4a/0x90 [<ffffffff810095ca>] ? sys_execve+0x4a/0x80 [<ffffffff8100b5ca>] ? stub_execve+0x6a/0xc0 with xenserver dmesg logs (XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000. (XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000. (XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000. (XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000. (XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000. (XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000. (XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000. (XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000. (XEN) mm.c:2319:d50 Bad type (saw e800000000000016 != exp 2000000000000000) for mfn f662e7 (pfn 1f4872) (XEN) mm.c:2708:d50 Error while pinning mfn f662e7 (XEN) mm.c:2319:d50 Bad type (saw e80000000000000e != exp 6000000000000000) for mfn f69522 (pfn 1f1637) (XEN) mm.c:896:d50 Attempt to create linear p.t. with write perms (XEN) mm.c:1441:d50 Failure in alloc_l4_table: entry 255 (XEN) mm.c:2071:d50 Error while validating mfn d0f926 (pfn 14b233) for type 8000000000000000: caf=8000000000000003 taf=8000000000000001 (XEN) mm.c:2708:d50 Error while pinning mfn d0f926 (XEN) mm.c:2319:d50 Bad type (saw e80000000000000e != exp 6000000000000000) for mfn f69522 (pfn 1f1637) (XEN) mm.c:896:d50 Attempt to create linear p.t. with write perms (XEN) mm.c:1441:d50 Failure in alloc_l4_table: entry 255 (XEN) printk: 52 messages suppressed. This could be the same sort of thing though, as it resembles the comments in http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=ae333e97552c81ab10395ad1ffc6d6daaadb144a which seems to be a later version of the "vmalloc: eagerly clear ptes on vunmap" patch and is mentioned in the branches of the same thread I referred to above (which is rather messy but I think the messages http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00233.html http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00742.html http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00415.html tie the above backtrace to the patch). (In reply to comment #17) > This could be the same sort of thing though, as it resembles the comments in > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=ae333e97552c81ab10395ad1ffc6d6daaadb144a > which seems to be a later version of the "vmalloc: eagerly clear ptes on > vunmap" patch and is mentioned in the branches of the same thread I referred to > above (which is rather messy but I think the messages > http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00233.html > http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00742.html > http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00415.html > tie the above backtrace to the patch). Yes, It looks like errors in "vmalloc: eagerly clear ptes on vunmap" commit message. Have you any chance to test the patched kernel from comment 13? (In reply to comment #18) > Yes, It looks like errors in "vmalloc: eagerly clear ptes on vunmap" commit > message. > Have you any chance to test the patched kernel from comment 13? I was trying your patched kernel on my test box yesterday, attempting to crash it and didn't succeed, though I am not sure how reliable my method of reproducing the crash is so I might just have been lucky. I am going to go back to the regular kernel and will do some more testing so I can get a better idea of how significant the lack of crash yesterday is. Managed to crash 2.6.32-131.0.15.el6.i686 pv guest on rhel 5 host. To reproduce it, I've just done as "vmalloc: eagerly clear ptes on vunmap" commit message suggested. mount testbox:/test /mnt/nfs_share find /mnt/nfs_share -type f -print0 | xargs -0 file > /dev/null Kernel OOPSed: ========================================================== kernel BUG at arch/x86/xen/mmu.c:1457! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/module/sunrpc/initstate Modules linked in: nfs lockd fscache(T) nfs_acl auth_rpcgss sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 xen_netfront ext4 mbcache jbd2 xen_blkfront dm_mod [last unloaded: scsi_wait_scan] Pid: 17394, comm: file Tainted: G ---------------- T (2.6.32-131.0.15.el6.i686 #1) EIP: 0061:[<c040593b>] EFLAGS: 00010282 CPU: 1 EIP is at pin_pagetable_pfn+0x3b/0x50 EAX: ffffffea EBX: e7831eb8 ECX: 00000001 EDX: 00000000 ESI: 00007ff0 EDI: e5960230 EBP: c20d3948 ESP: e7831eb8 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069 Process file (pid: 17394, ti=e7830000 task=e9174ab0 task.ti=e7830000) Stack: 00000000 002e985b c0405ab3 c20d3900 00028198 e5960230 c04fdcf5 e5af3000 <0> e5960230 08c0859c c204ce64 c20d3900 c050265d 08c0859c 00000000 00000000 <0> 00000000 c20d3900 00000009 00000000 c204ce64 00000009 08c0859c c20d3900 Call Trace: [<c0405ab3>] ? xen_alloc_ptpage+0xa3/0xf0 [<c04fdcf5>] ? __pte_alloc+0x65/0xd0 [<c050265d>] ? handle_mm_fault+0x19d/0x1d0 [<c043293b>] ? __do_page_fault+0xfb/0x420 [<c05073de>] ? do_brk+0x23e/0x330 [<c082777a>] ? do_page_fault+0x2a/0x90 [<c0827750>] ? do_page_fault+0x0/0x90 [<c08251c7>] ? error_code+0x73/0x78 [<c0820000>] ? rcu_cpu_notify+0x4a/0x75 Code: 00 89 0c 24 75 0a e8 85 f9 ff ff 25 ff ff ff 7f 89 44 24 04 89 e3 b9 01 00 00 00 31 d2 be f0 7f 00 00 e8 09 ca ff ff 85 c0 74 04 <0f> 0b eb fe 83 c4 0c 5b 5e 5f c3 8d 76 00 8d bc 27 00 00 00 00 EIP: [<c040593b>] pin_pagetable_pfn+0x3b/0x50 SS:ESP 0069:e7831eb8 ---[ end trace 41a0c88bd81d9413 ]--- Kernel panic - not syncing: Fatal exception Pid: 17394, comm: file Tainted: G D ---------------- T 2.6.32-131.0.15.el6.i686 #1 Call Trace: [<c0821fde>] ? panic+0x42/0xf9 [<c0825ddc>] ? oops_end+0xbc/0xd0 [<c040aa80>] ? do_invalid_op+0x0/0x90 [<c040aaff>] ? do_invalid_op+0x7f/0x90 [<c040593b>] ? pin_pagetable_pfn+0x3b/0x50 [<c0407328>] ? xen_vcpuop_set_next_event+0x48/0x80 [<c04ed724>] ? __alloc_pages_nodemask+0xf4/0x800 [<c08251c7>] ? error_code+0x73/0x78 [<c04300d8>] ? cache_k8_northbridges+0x18/0x100 [<c040593b>] ? pin_pagetable_pfn+0x3b/0x50 [<c0405ab3>] ? xen_alloc_ptpage+0xa3/0xf0 [<c04fdcf5>] ? __pte_alloc+0x65/0xd0 [<c050265d>] ? handle_mm_fault+0x19d/0x1d0 [<c043293b>] ? __do_page_fault+0xfb/0x420 [<c05073de>] ? do_brk+0x23e/0x330 [<c082777a>] ? do_page_fault+0x2a/0x90 [<c0827750>] ? do_page_fault+0x0/0x90 [<c08251c7>] ? error_code+0x73/0x78 [<c0820000>] ? rcu_cpu_notify+0x4a/0x75 ========================================================== and xen complained on console with messages: ======================== (XEN) mm.c:2042:d46 Bad type (saw 00000000e8000006 != exp 0000000020000000) for mfn 2e985b (pfn 28198) (XEN) mm.c:2375:d46 Error while pinning mfn 2e985b ======================== (In reply to comment #19) > (In reply to comment #18) > > Yes, It looks like errors in "vmalloc: eagerly clear ptes on vunmap" commit > > message. > > Have you any chance to test the patched kernel from comment 13? > > I was trying your patched kernel on my test box yesterday, attempting to crash > it and didn't succeed, though I am not sure how reliable my method of > reproducing the crash is so I might just have been lucky. > I am going to go back to the regular kernel and will do some more testing so I > can get a better idea of how significant the lack of crash yesterday is. Could you try reproduce bug as per comment 20? It crashed guest in several minutes for me. PS: The nfs share used for test has several kernel trees on it. I added that to my existing testing of the regular kernel and it crashed quite quickly though with a backtrace that was different again (see below) though it looks to be the same underlying bug. I repeated the same test with the patched kernel and it hasn't crashed, so I think the bug is fixed in the test kernel. The backtrace (with a bit missing from the start) I got from this crash was andle kernel paging request at ffff8800040bc5e0 IP: [<ffffffff81045845>] ptep_set_access_flags+0x55/0x70 PGD 1a26067 PUD 1a2a067 PMD 57bc067 PTE 80100000040bc065 Oops: 0003 [#1] SMP last sysfs file: /sys/devices/virtual/block/dm-2/range CPU 1 Modules linked in: autofs4 nfs lockd fscache(T) nfs_acl auth_rpcgss sunrpc xenfs ipv6 ext4 jbd2 dm_mirror dm_region_hash dm_log microcode xen_netfront ext3 jbd mbcache xen_blkfront dm_mod [last unloaded: scsi_wait_scan] Modules linked in: autofs4 nfs lockd fext4 jbd2 dm_mirror dm_region_hash dm_log microcode xen_netfront ext3 jbd mbcache xen_blkfront dm_mod [last unloaded: scsi_wait_scan] Pid: 11447, comm: sh Tainted: G ---------------- T 2.6.32-131.6.1.el6.x86_64 #1 RIP: e030:[<ffffffff81045845>] [<ffffffff81045845>] ptep_set_access_flags+0x55/0x70 RSP: e02b:ffff8800a1a47b38 EFLAGS: 00010202 RAX: 80000001077fe145 RBX: ffff8801c716e148 RCX: 8000000f13359167 RDX: ffff8800040bc5e0 RSI: 00007f7bd5abc9d0 RDI: ffff8801c716e148 RBP: ffff8800a1a47b58 R08: 0000000000000001 R09: e400000000000000 R10: 0000000000000000 R11: 0000000000000098 R12: 00007f7bd5abc9d0 R13: 0000000000000001 R14: 0000000000000008 R15: ffffea00000e2930 FS: 00007f7bd5abc700(0000) GS:ffff88002806d000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffff8800040bc5e0 CR3: 00000000043b6000 CR4: 0000000000002660 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process sh (pid: 11447, threadinfo ffff8800a1a46000, task ffff8801f4d70080) Stack: ffff8801c716e148 0000000000000000 ffffea00039a3f90 0000000000000008 <0> ffff8800a1a47bf8 ffffffff81136c7b ffff8801ffffffff 0037f414ab99fe20 <0> 0000000000000001 ffff8800fef99568 0000000000000000 ffff8800040bc5e0 Call Trace: [<ffffffff81136c7b>] do_wp_page+0x44b/0x8d0 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e [<ffffffff811378dd>] handle_pte_fault+0x2cd/0xb50 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e [<ffffffff81138338>] handle_mm_fault+0x1d8/0x2c0 [<ffffffff810414e9>] __do_page_fault+0x139/0x480 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10 [<ffffffff81007c62>] ? check_events+0x12/0x20 [<ffffffff81006033>] ? __xen_write_cr3+0x123/0x170 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1 e_cr3+0x8f/0xc0 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10 [<ffffffff814e054e>] do_page_fault+0x3e/0xa0 [<ffffffff814dd8d5>] page_fault+0x25/0x30 [<ffffffff8126e3fd>] ? __put_user_4+0x1d/0x30 [<ffffffff8105fc34>] ? schedule_tail+0x64/0xb0 [<ffffffff8100b073>] ret_from_fork+0x13/0x80 Code: 89 f4 41 0f 95 c5 45 85 c0 75 1b 44 89 e8 48 8b 1c 24 4c 8b 64 24 08 4c 8b 6c 24 10 4c 8b 74 24 18 c9 c3 0f 1f 00 45 85 ed 74 e0 <48> 89 0a 48 8b 3f 0f 1f 80 00 00 00 00 4c 89 e6 48 89 df e8 13 f81045845>] ptep_set_access_flags+0x55/0x70 RSP <ffff8800a1a47b38> CR2: ffff8800040bc5e0 ---[ end trace 8823d2c63163302c ]--- Kernel panic - not syncing: Fatal exception Pid: 11447, comm: sh Tainted: G D ---------------- T 2.6.32-131.6.1.el6.x86_64 #1 Call Trace: [<ffffffff814da518>] ? panic+0x78/0x143 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff814dd41c>] ? _spin_unlock_irqrestore+0x1c/0x20 [<ffffffff814de564>] ? oops_end+0xe4/0x100 [<ffffffff81040c9b>] ? no_context+0xfb/0x260 [<ffffffff81040f25>] ? __bad_area_nosemaphore+0x125/0x1e0 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e [<ffffffff81040ff3>] ? bad_area_nosemaphore+0x13/0x20 [<ffffffff810416cd>] ? __do_page_fault+0x31d/0x480 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff8111e8e1>] ? get_page_from_freelist+0x3d1/0x820 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff8111e7ee>] ? get_page_from_freelist+0x2de/0x820 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10 [<ffffffff81007c62>] ? check_events+0x12/0x20 [<ffffffff814e054e>] ? do_page_fault+0x3e/0xa0 [<ffffffff814dd8d5>] ? page_fault+0x25/0x30 [<ffffffff81045845>] ? ptep_set_access_flags+0x55/0x70 [<ffffffff81136c7b>] ? do_wp_page+0x44b/0x8d0 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e [<ffffffff811378dd>] ? handle_pte_fault+0x2cd/0xb50 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e [<ffffffff81138338>] ? handle_mm_fault+0x1d8/0x2c0 [<ffffffff810414e9>] ? __do_page_fault+0x139/0x480 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10 [<ffffffff81007c62>] ? check_events+0x12/0x20 [<ffffffff81006033>] ? __xen_write_cr3+0x123/0x170 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff8100644f>] ? xen_write_cr3+0x8f/0xc0 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10 [<ffffffff814e054e>] ? do_page_fault+0x3e/0xa0 [<ffffffff814dd8d5>] ? page_fault+0x25/0x30 [<ffffffff8126e3fd>] ? __put_user_4+0x1d/0x30 [<ffffffff8105fc34>] ? schedule_tail+0x64/0xb0 [<ffffffff8100b073>] ? ret_from_fork+0x13/0x80 (XEN) traps.c:2282:d54 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000. (XEN) traps.c:2282:d54 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000. (XEN) traps.c:2282:d54 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000. (XEN) printk: 5 messages suppressed. (XEN) mm.c:2319:d54 Bad type (saw e800000000000001 != exp 2000000000000000) for mfn f1bc11 (pfn fef46) (XEN) mm.c:2708:d54 Error while pinning mfn f1bc11 (XEN) mm.c:2319:d54 Bad type (saw e800000000000001 != exp 2000000000000000) for mfn d146fd (pfn c645a) (XEN) mm.c:2708:d54 Error while pinning mfn d146fd (XEN) mm.c:2319:d54 Bad type (saw e800000000000001 != exp 2000000000000000) for mfn d336fe (pfn a7459) (XEN) mm.c:2708:d54 Error while pinning mfn d336fe (XEN) mm.c:2319:d54 Bad type (saw e800000000000001 != exp 2000000000000000) for mfn d14589 (pfn c65ce) (XEN) mm.c:2708:d54 Error while pinning mfn d14589 (XEN) mm.c:2319:d54 Bad type (saw e800000000000001 != exp 2000000000000000) for mfn f58a9b (pfn 40bc) (XEN) mm.c:2708:d54 Error while pinning mfn f58a9b (XEN) printk: 37 messages suppressed. Created attachment 519022 [details]
vmalloc: eagerly clear ptes on vunmap
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Patch(es) available on kernel-2.6.32-193.el6 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2011-1530.html |