Bug 730503 - RHEL 6.1 xen guest crashes with kernel BUG at arch/x86/xen/mmu.c:1457!
Summary: RHEL 6.1 xen guest crashes with kernel BUG at arch/x86/xen/mmu.c:1457!
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.1
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Igor Mammedov
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 653816
TreeView+ depends on / blocked
 
Reported: 2011-08-13 20:37 UTC by Michael Young
Modified: 2017-02-06 14:52 UTC (History)
7 users (show)

Fixed In Version: kernel-2.6.32-193.el6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-12-06 14:04:14 UTC


Attachments (Terms of Use)
vmalloc: eagerly clear ptes on vunmap (7.88 KB, patch)
2011-08-19 11:41 UTC, Igor Mammedov
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1530 normal SHIPPED_LIVE Moderate: Red Hat Enterprise Linux 6 kernel security, bug fix and enhancement update 2011-12-06 01:45:35 UTC

Description Michael Young 2011-08-13 20:37:06 UTC
I am running some RHEL 6.1 xen (pv) guests on Citrix XenServer (5.6sp2) and they are crashing occasionally. I managed to get the backtrace below for the most recent crash with the 2.6.32-131.6.1.el6.x86_64 kernel. The server was running a couple of java processes, and doing some virus scanning of an NFS-mounted file system when it crashed.

kernel BUG at arch/x86/xen/mmu.c:1457!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/devices/virtual/block/dm-2/range
CPU 1 
Modules linked in: autofs4 nfs lockd fscache(T) nfs_acl auth_rpcgss sunrpc xenfs ipv6 ext4 jbd2 dm_mirror dm_region_hash dm_log microcode xen_netfront ext3 jbd mbcache xen_blkfront dm_mod [last unloaded: scsi_wait_scan]

Modules linked in: autofs4 nfs lockd fscache(T) nfs_acl auth_rpcgss sunrpc xenfs ipv6 ext4 jbd2 dm_mirror dm_region_hash dm_log micont ext3 jbd mbcache xen_blkfront dm_mod [last unloaded: scsi_wait_scan]
Pid: 14844, comm: java Tainted: G           ---------------- T 2.6.32-131.6.1.el6.x86_64 #1 
RIP: e030:[<ffffffff81005b2f>]  [<ffffffff81005b2f>] pin_pagetable_pfn+0x4f/0x60
RSP: e02b:ffff880129f5bd08  EFLAGS: 00010282
RAX: 00000000ffffffea RBX: 000000000012354c RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff880129f5bd08
RBP: ffff880129f5bd28 R08: 00003ffffffff000 R09: ffff880000000000
R10: 0000000000007ff0 R11: 00007fab8fe09530 R12: 0000000000000003
R13: ffff88008fc63748 R14: ffff8801f4233a88 R15: ffff88010a471a20
FS:  00007fab8e29d700(0000) GS:ffff88002806d000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000dd202370 CR3: 0000000107d2b000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process java (pid: 14844, threadinfo ffff880129f5a000, task fffStack:
 0000000000000000 0000000000ebbe27 ffff8801f4233a88 000000000012354c
<0> ffff880129f5bd48 ffffffff81005cb9 ffff8801f4233a00 000000000012354c
<0> ffff880129f5bd58 ffffffff81005d13 ffff880129f5bda8 ffffffff811334ec
Call Trace:
 [<ffffffff81005cb9>] xen_alloc_ptpage+0x99/0xa0
 [<ffffffff81005d13>] xen_alloc_pte+0x13/0x20
 [<ffffffff811334ec>] __pte_alloc+0x8c/0x160
 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff811382a9>] handle_mm_fault+0x149/0x2c0
 [<ffffffff810414e9>] __do_page_fault+0x139/0x480
 [<ffffffff8105f73c>] ? pick_next_task_fair+0xec/0x120
 [<ffffffff814e054e>] do_page_fault+0x3e/0xa0
 [<ffffffff814dd8d5>] page_fault+0x25/0x30
Code: 48 ba ff ff ff 7f ff ff ff ff 48 21 d0 48 89 45 e8 48 8d 7d e0 be 01 00 00 00 31 d2 41 ba f0 7f 00 00 e8 15 b8 ff ff 85 c0 74 04 <0f> 0b eb fe c9 c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 
RIP  [<ffffffff81005b2f>] pin_pagetable_pfn+0x4f/0x60
 RSP <ffff880129f5bd08>
---[ end trace a05e463aa63b231d ]---
Kernel panic - not syncing: Fatal exception
Pid: 14844, comm: java Tainted: G      D    ---------------- T 2.6.32-131.6.1.el6.x86_64 #1
Call Trace:
 [<ffffffff814da518>] ? panic+0x78/0x143
 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff814dd41c>] ? _spin_unlock_irqrestore+0x1c/0x20
 [<ffffffff814de564>] ? oops_end+0xe4/0x100
 [<ffffffff8100f2eb>] ? die+0x5b/0x90
 [<ffffffff814dde34>] ? do_trap+0xc4/0x160
 [<ffffffff8100ceb5>] ? do_invalid_op+0x95/0xb0
 [<ffffffff81005b2f>] ? pin_pagetable_pfn+0x4f/0x60
 [<ffffffff8111fab1>] ? __alloc_pages_nodemask+0x111/0x8b0
 [<ffffffff8100bf5b>] ? invalid_op+0x1b/0x20
 [<ffffffff81005b2f>] ? pin_pagetable_pfn+0x4f/0x60
 [<ffffffff81005b2b>] ? pin_pagetable_pfn+0x4b/0x60
 [<ffffffff81005cb9>] ? xen_alloc_ptpage+0x99/0xa0
 [<ffffffff81005d13>] ? xen_alloc_pte+0x13/0x20
 [<ffffffff811334ec>] ? __pte_alloc+0x8c/0x160
 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff811382a9>] ? handle_mm_fault+0x149/0x2c0
 [<ffffffff810414e9>] ? __do_page_fault+0x139/0x480
 [<ffffffff8105f73c>] ? pick_next_task_fair+0xec/0x120
 [<ffffffff814e054e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff814dd8d5>] ? page_fault+0x25/0x30

Comment 2 Andrew Jones 2011-08-14 16:34:59 UTC
1451 static void pin_pagetable_pfn(unsigned cmd, unsigned long pfn)
1452 {  
1453         struct mmuext_op op;
1454         op.cmd = cmd;
1455         op.arg1.mfn = pfn_to_mfn(pfn);
1456         if (HYPERVISOR_mmuext_op(&op, 1, NULL, DOMID_SELF))
1457                 BUG();
1458 }

We need to try and figure out why XenServer 5.6sp2 is returning an error on this hypercall. Please also file a bug with Citrix.

Comment 4 Paolo Bonzini 2011-08-14 16:41:08 UTC
Please attach the output of "xm dmesg" after a crash.  Thanks!

Comment 5 Michael Young 2011-08-14 17:31:11 UTC
It isn't xm in XenServer, however xe host-dmesg host=servername gives the following for one guest session

(XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000
00000 to 00000000:00000000.
(XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000
00000 to 00000000:00000000.
(XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000
00000 to 00000000:00000000.
(XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000
00000 to 00000000:00000000.
(XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000
00000 to 00000000:00000000.
(XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000
00000 to 00000000:00000000.
(XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000
00000 to 00000000:00000000.
(XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000
00000 to 00000000:00000000.
(XEN) mm.c:2319:d46 Bad type (saw e800000000000004 != exp 2000000000000000) for 
mfn 833d7e (pfn 1eb5f5)
(XEN) mm.c:2708:d46 Error while pinning mfn 833d7e
(XEN) grant_table.c:1408:d0 dest domain 46 dying
(XEN) grant_table.c:1408:d0 dest domain 46 dying

Comment 6 Andrew Jones 2011-08-14 21:02:10 UTC
This is interesting

> (XEN) mm.c:2319:d46 Bad type (saw e800000000000004 != exp 2000000000000000) for 
> mfn 833d7e (pfn 1eb5f5)
> (XEN) mm.c:2708:d46 Error while pinning mfn 833d7e

This thread looks like it could be discussing the same issue

http://lists.xensource.com/archives/html/xen-devel/2011-03/msg01367.html

Comment 7 Michael Young 2011-08-14 21:27:26 UTC
(In reply to comment #6)
> This is interesting
> 
> > (XEN) mm.c:2319:d46 Bad type (saw e800000000000004 != exp 2000000000000000) for 
> > mfn 833d7e (pfn 1eb5f5)
> > (XEN) mm.c:2708:d46 Error while pinning mfn 833d7e
> 
> This thread looks like it could be discussing the same issue
> 
> http://lists.xensource.com/archives/html/xen-devel/2011-03/msg01367.html

Except that is a crash on boot, whereas mine can stay up for a few days. There is also the long thread containing http://lists.xensource.com/archives/html/xen-devel/2011-01/msg00171.html though I think you might already have the patch for that.

Comment 8 Michael Young 2011-08-14 21:53:52 UTC
(In reply to comment #7)
> There is also the long thread containing
> http://lists.xensource.com/archives/html/xen-devel/2011-01/msg00171.html though
> I think you might already have the patch for that.

Actually I think I am wrong about you having the patch. http://lists.xensource.com/archives/html/xen-devel/2011-02/msg01293.html later in the thread mentions two patches so I think I was checking the wrong patch in my earlier comment.

Comment 11 Igor Mammedov 2011-08-15 14:34:02 UTC
Michael,

Would you care to test a new kernel with patches mentioned by
http://lists.xensource.com/archives/html/xen-devel/2011-02/msg01293.html
if I built it for you?

Comment 12 Michael Young 2011-08-15 15:07:26 UTC
Yes, I will test it. I have just been creating a copy of the virtual server that was crashing for testing purposes.

Comment 13 Igor Mammedov 2011-08-16 05:44:24 UTC
A new kernel build with "vmalloc: eagerly clear ptes on vunmap" patch:

http://people.redhat.com/imammedo/kernel-2.6.32-131.0.1.el6.test.x86_64.rpm

Please test.

Comment 14 anshockm 2011-08-16 06:47:24 UTC
(In reply to comment #13)
> A new kernel build with "vmalloc: eagerly clear ptes on vunmap" patch:
> 
> http://people.redhat.com/imammedo/kernel-2.6.32-131.0.1.el6.test.x86_64.rpm
> 
> Please test.

Perfect!
I could trigger the bug easily within 2 minutes yesterday and with this new kernel the system is running fine for 30 min now.

My environment before:
kernel 2.6.32-131.6.1.el6.x86_64 xen (pv) guests on Citrix XenServer (5.6sp2) running dovecot-2.0.9-2.el6.x86_64. Using this system as back end for imapproxy on a squirrel webmailer triggered the bug within 2 min.

Comment 15 Igor Mammedov 2011-08-16 07:26:08 UTC
(In reply to comment #14)
> 
> Perfect!
> I could trigger the bug easily within 2 minutes yesterday and with this new
> kernel the system is running fine for 30 min now.

Thanks for verifying.
 
Do you have any NFS related io activity when bug was reproduced?

Comment 16 anshockm 2011-08-16 07:42:00 UTC
(In reply to comment #15)
> (In reply to comment #14)
> > 
> > Perfect!
> > I could trigger the bug easily within 2 minutes yesterday and with this new
> > kernel the system is running fine for 30 min now.
> 
> Thanks for verifying.
> 
> Do you have any NFS related io activity when bug was reproduced?

Yes, the dovecot mail directories are on a Netapp file servers connected by NFS.

fs5.serv.uni-osnabrueck.de:/vol/MailStaff /mnt/fs5/MailStaff nfs rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.17.17.216,mountvers=3,mountport=4046,mountproto=udp,local_lock=none,addr=172.17.17.216 0 0
fs4.serv.uni-osnabrueck.de:/vol/MailStudent /mnt/fs4/MailStudent nfs rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.17.17.211,mountvers=3,mountport=4046,mountproto=udp,local_lock=none,addr=172.17.17.211 0 0


Here my backtrace from yesterday:

kernel BUG at arch/x86/xen/mmu.c:1457!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/module/sunrpc/initstate
CPU 0 
Modules linked in: nfs lockd fscache(T) nfs_acl auth_rpcgss autofs4 sunrpc xenfs ipt_REJECT xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log microcode xen_netfront ext4 mbcache jbd2 xen_blkfront dm_mod [last unloaded: scsi_wait_scan]
Modules linked in: nfs lockd fscache(T) nfs_acl auth_rpcgss autofs4 sunrpc xenfs ipt_REJECT xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log microcode xen_netfront ext4 mbcache jbd2 xen_blkfront dm_mod [last unloaded: scsi_wait_scan]
Pid: 3937, comm: auth Tainted: G           ---------------- T 2.6.32-131.6.1.el6.x86_64 #1 
RIP: e030:[<ffffffff81005b2f>]  [<ffffffff81005b2f>] pin_pagetable_pfn+0x4f/0x60
RSP: e02b:ffff8800f919fd08  EFLAGS: 00010282
RAX: 00000000ffffffea RBX: 00000000000f9f57 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8800f919fd08
RBP: ffff8800f919fd28 R08: 00003ffffffff000 R09: ffff880000000000
R10: 0000000000007ff0 R11: 0000000000000246 R12: 0000000000000003
R13: ffff880003b187e0 R14: ffff8800f91d5788 R15: ffff8800f9f255f8
FS:  00007fc4205c3700(0000) GS:ffff88000d136000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fc41f9061b8 CR3: 0000000002c3a000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process auth (pid: 3937, threadinfo ffff8800f919e000, task ffff880003bac040)
Stack:
0000000000000000 000000000033a69d ffff8800f91d5788 00000000000f9f57
 ffff8800f919fd48 ffffffff81005cb9 ffff8800f91d5700 00000000000f9f57
 ffff8800f919fd58 ffffffff81005d13 ffff8800f919fda8 ffffffff811334ec
Call Trace:
[<ffffffff81005cb9>] xen_alloc_ptpage+0x99/0xa0
[<ffffffff81005d13>] xen_alloc_pte+0x13/0x20
[<ffffffff811334ec>] __pte_alloc+0x8c/0x160
[<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
[<ffffffff811382a9>] handle_mm_fault+0x149/0x2c0
[<ffffffff810414e9>] __do_page_fault+0x139/0x480
[<ffffffff8113e1da>] ? do_mmap_pgoff+0x33a/0x380
[<ffffffff814e054e>] do_page_fault+0x3e/0xa0
[<ffffffff814dd8d5>] page_fault+0x25/0x30
Code: 48 ba ff ff ff 7f ff ff ff ff 48 21 d0 48 89 45 e8 48 8d 7d e0 be 01 00 00 00 31 d2 41 ba f0 7f 00 00 e8 15 b8 ff ff 85 c0 74 04 <0f> 0b eb fe c9 c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 
RIP  [<ffffffff81005b2f>] pin_pagetable_pfn+0x4f/0x60
RSP <ffff8800f919fd08>

Comment 17 Michael Young 2011-08-16 09:44:11 UTC
I was trying to get my test machine to crash with the standard 2.6.32-131.6.1.el6.x86_64 kernel before testing the new kernel. It did crash but in a different way; the backtrace is

BUG: unable to handle kernel paging request at ffff8801ec2f2010
IP: [<ffffffff81006db8>] xen_set_pmd+0x38/0xb0
PGD 1a26067 PUD 5dcd067 PMD 5f2f067 PTE 80100001ec2f2065
Oops: 0003 [#1] SMP
last sysfs file: /sys/devices/virtual/block/dm-2/dm/name
CPU 0
Modules linked in: nfs lockd fscache(T) nfs_acl auth_rpcgss autofs4 sunrpc xenfs ipv6 ext4 jbd2 dm_mirror dm_region_hash dm_log microcode xen_netfront ext3 jbd mbcache xen_blkfront dm_mod [last unloaded: scsi_wait_scan]

Modules linked in: nfs lockd f2 dm_mirror dm_region_hash dm_log microcode xen_netfront ext3 jbd mbcache xen_blkfront dm_mod [last unloaded: scsi_wait_scan]
Pid: 30736, comm: xe-update-guest Tainted: G           ---------------- T 2.6.32-131.6.1.el6.x86_64 #1
RIP: e030:[<ffffffff81006db8>]  [<ffffffff81006db8>] xen_set_pmd+0x38/0xb0
RSP: e02b:ffff8801d7a41a78  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8801ec2f2010 RCX: ffff880000000000
RDX: ffffea0000000000 RSI: 0000000000000000 RDI: ffff8801ec2f2010
RBP: ffff8801d7a41a88 R08: 00000000018b2000 R09: 0000000000000000
R10: 0000000000000010 R11: 0000000000000000 R12: 0000000000000000
R13: 00000000008e5000 R14: ffff8801ec2f2010 R15: ffff88002805d520
FS:  00007f21a9817700(0000) GS:ffff88002804f000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff8801ec2f2010 CR3: 000000014b233000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process xe-update-guest (pid: 30736, threadinfo ffff8801d7a40000, task ffff8801f2f90b40)
Stack:
 0000000000600000 000000019baf4067 ffff8801d7a41b58 ffffffff81133d3f
<0> ffffffff81007c4f ffffffff8115b764 ffff8801d7a41bc8 ffffffffffffffff
<0> ffffffffffffffff 0000000000000000 0000000000000000 00000000008e4fff
Call Trace:
 [<ffffffff81133d3f>] free_pgd_range+0x25f/0x4b0
 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8115b764>] ? kmem_cache_free+0xc4/0x2b0
 [<ffffffff8113405e>] free_pgtables+0xce/0x120
 [<ffffffff8113af90>] exit_mmap+0xb0/0x170
 [<ffffffff8106449c>] mmput+0x6c/0x120
 [<ffffffff81178ba9>] flush_old_exec+0x449/0x600
 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff811ca095>] load_elf_binary+0x2b5/0x1bc0
 [<ffffffff81133261>] ? follow_page+0x321/0x460
 [<ffffffff8113852f>] ? __get_user_pages+0x10f/0x420
 [<ffffffff811c3aac>] ? load_misc_binary+0xac/0x3e0
 [<ffffffff8117a0fb>] search_binary_handler+0x10b/0x350
 [<ffffffff8117b289>] do_execve+0x239/0x310
 [<ffffffff8126e8ba>] ? strncpy_from_user+0x4a/0x90
 [<ffffffff810095ca>] sys_execve+0x4a/0x80
 [<ffffffff8100b5ca>] stub_execve+0x6a/0xc0
Code: 89 64 24 08 0f 1f 44 00 00 80 3d 2f 94 d0 00 00 48 89 fb 49 89 f4 75 51 48 89 df 83 05 69 93 d0 00 01 e8 7c e4 ff ff 84 c0 75 18 <4c> 89 23 48 8b 1c 24 4c 8b 64 24 08 c9 c3 66 2e 0f 1f 84 00 00
RIP  [<ffffffff81006db8>] xen_set_pmd+0x38/0xb0
 RSP <ffff8801d7a41a78>
CR2: ffff8801ec2f2010
---[ end trace cb20c8b5bdd26af7 ]---
Kernel panic - not syncing: Fatal exception
Pid: 30736, comm: xe-update-guest Tainted: G      D    ---------------- T 2.6.32-131.6.1.el6.x86_64 #1
Call Trace:
 [<ffffffff814da518>] ? panic+0x78/0x143
 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff814dd41c>] ? _spin_unlock_irqrestore+0x1c/0x20
 [<ffffffff814de564>] ? oops_end+0xe4/0x100
c9b>] ? no_context+0xfb/0x260
 [<ffffffff81040f25>] ? __bad_area_nosemaphore+0x125/0x1e0
 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff81040ff3>] ? bad_area_nosemaphore+0x13/0x20
 [<ffffffff810416cd>] ? __do_page_fault+0x31d/0x480
 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10
 [<ffffffff81007c62>] ? check_events+0x12/0x20
 [<ffffffff8100c2fb>] ? xen_hypervisor_callback+0x1b/0x20
 [<ffffffff814ddb0a>] ? error_exit+0x2a/0x60
 [<ffffffff8100bb1d>] ? retint_res00140a>] ? hypercall_page+0x40a/0x1010
 [<ffffffff814e054e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff814dd8d5>] ? page_fault+0x25/0x30
 [<ffffffff81006db8>] ? xen_set_pmd+0x38/0xb0
 [<ffffffff81006db4>] ? xen_set_pmd+0x34/0xb0
 [<ffffffff81133d3f>] ? free_pgd_range+0x25f/0x4b0
 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8115b764>] ? kmem_cache_free+0xc4/0x2b0
 [<ffffffff8113405e>] ? free_pgtables+0xce/0x120
 [<ffffffff8113af90>] ? exit_mmap+0xb0/0x170
 [<ffffffff8106449c>] ? mmput+0x6c/0x120
 [<ffffffff81178ba9>] ? flush_old_exec+0x449/0x600
 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff811ca095>] ? load_elf_binary+0x2b5/0x1bc0
 [<ffffffff81133261>] ? follow_page+0x321/0x460
 [<ffffffff8113852f>] ? __get_user_pages+0x10f/0x420
 [<ffffffff811c3aac>] ? load_misc_binary+0xac/0x3e0
 [<ffffffff8117a0fb>] ? search_binary_handler+0x10b/0x350
 [<ffffffff8117b289>] ? do_execve+0x239/0x310
 [<ffffffff8126e8ba>] ? strncpy_from_user+0x4a/0x90
 [<ffffffff810095ca>] ? sys_execve+0x4a/0x80
 [<ffffffff8100b5ca>] ? stub_execve+0x6a/0xc0

with xenserver dmesg logs
(XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000.
(XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000.
(XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000.
(XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000.
(XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000.
(XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000.
(XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000.
(XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000.
(XEN) mm.c:2319:d50 Bad type (saw e800000000000016 != exp 2000000000000000) for mfn f662e7 (pfn 1f4872)
(XEN) mm.c:2708:d50 Error while pinning mfn f662e7
(XEN) mm.c:2319:d50 Bad type (saw e80000000000000e != exp 6000000000000000) for mfn f69522 (pfn 1f1637)
(XEN) mm.c:896:d50 Attempt to create linear p.t. with write perms
(XEN) mm.c:1441:d50 Failure in alloc_l4_table: entry 255
(XEN) mm.c:2071:d50 Error while validating mfn d0f926 (pfn 14b233) for type 8000000000000000: caf=8000000000000003 taf=8000000000000001
(XEN) mm.c:2708:d50 Error while pinning mfn d0f926
(XEN) mm.c:2319:d50 Bad type (saw e80000000000000e != exp 6000000000000000) for mfn f69522 (pfn 1f1637)
(XEN) mm.c:896:d50 Attempt to create linear p.t. with write perms
(XEN) mm.c:1441:d50 Failure in alloc_l4_table: entry 255
(XEN) printk: 52 messages suppressed.

This could be the same sort of thing though, as it resembles the comments in
http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=ae333e97552c81ab10395ad1ffc6d6daaadb144a
which seems to be a later version of the "vmalloc: eagerly clear ptes on vunmap" patch and is mentioned in the branches of the same thread I referred to above (which is rather messy but I think the messages
http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00233.html
http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00742.html
http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00415.html
tie the above backtrace to the patch).

Comment 18 Igor Mammedov 2011-08-17 08:59:33 UTC
(In reply to comment #17)
> This could be the same sort of thing though, as it resembles the comments in
> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=ae333e97552c81ab10395ad1ffc6d6daaadb144a
> which seems to be a later version of the "vmalloc: eagerly clear ptes on
> vunmap" patch and is mentioned in the branches of the same thread I referred to
> above (which is rather messy but I think the messages
> http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00233.html
> http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00742.html
> http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00415.html
> tie the above backtrace to the patch).

Yes, It looks like errors in "vmalloc: eagerly clear ptes on vunmap" commit message.
Have you any chance to test the patched kernel from comment 13?

Comment 19 Michael Young 2011-08-17 09:24:39 UTC
(In reply to comment #18)
> Yes, It looks like errors in "vmalloc: eagerly clear ptes on vunmap" commit
> message.
> Have you any chance to test the patched kernel from comment 13?

I was trying your patched kernel on my test box yesterday, attempting to crash it and didn't succeed, though I am not sure how reliable my method of reproducing the crash is so I might just have been lucky.
I am going to go back to the regular kernel and will do some more testing so I can get a better idea of how significant the lack of crash yesterday is.

Comment 20 Igor Mammedov 2011-08-17 09:44:19 UTC
Managed to crash 2.6.32-131.0.15.el6.i686 pv guest on rhel 5 host.

To reproduce it, I've just done as "vmalloc: eagerly clear ptes on vunmap" commit message suggested. 

mount testbox:/test /mnt/nfs_share
find  /mnt/nfs_share -type f -print0 | xargs -0 file > /dev/null

Kernel OOPSed:
==========================================================
kernel BUG at arch/x86/xen/mmu.c:1457!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/module/sunrpc/initstate
Modules linked in: nfs lockd fscache(T) nfs_acl auth_rpcgss sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 xen_netfront ext4 mbcache jbd2 xen_blkfront dm_mod [last unloaded: scsi_wait_scan]

Pid: 17394, comm: file Tainted: G           ---------------- T (2.6.32-131.0.15.el6.i686 #1) 
EIP: 0061:[<c040593b>] EFLAGS: 00010282 CPU: 1
EIP is at pin_pagetable_pfn+0x3b/0x50
EAX: ffffffea EBX: e7831eb8 ECX: 00000001 EDX: 00000000
ESI: 00007ff0 EDI: e5960230 EBP: c20d3948 ESP: e7831eb8
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
Process file (pid: 17394, ti=e7830000 task=e9174ab0 task.ti=e7830000)
Stack:
 00000000 002e985b c0405ab3 c20d3900 00028198 e5960230 c04fdcf5 e5af3000
<0> e5960230 08c0859c c204ce64 c20d3900 c050265d 08c0859c 00000000 00000000
<0> 00000000 c20d3900 00000009 00000000 c204ce64 00000009 08c0859c c20d3900
Call Trace:
 [<c0405ab3>] ? xen_alloc_ptpage+0xa3/0xf0
 [<c04fdcf5>] ? __pte_alloc+0x65/0xd0
 [<c050265d>] ? handle_mm_fault+0x19d/0x1d0
 [<c043293b>] ? __do_page_fault+0xfb/0x420
 [<c05073de>] ? do_brk+0x23e/0x330
 [<c082777a>] ? do_page_fault+0x2a/0x90
 [<c0827750>] ? do_page_fault+0x0/0x90
 [<c08251c7>] ? error_code+0x73/0x78
 [<c0820000>] ? rcu_cpu_notify+0x4a/0x75
Code: 00 89 0c 24 75 0a e8 85 f9 ff ff 25 ff ff ff 7f 89 44 24 04 89 e3 b9 01 00 00 00 31 d2 be f0 7f 00 00 e8 09 ca ff ff 85 c0 74 04 <0f> 0b eb fe 83 c4 0c 5b 5e 5f c3 8d 76 00 8d bc 27 00 00 00 00 
EIP: [<c040593b>] pin_pagetable_pfn+0x3b/0x50 SS:ESP 0069:e7831eb8
---[ end trace 41a0c88bd81d9413 ]---
Kernel panic - not syncing: Fatal exception
Pid: 17394, comm: file Tainted: G      D    ---------------- T 2.6.32-131.0.15.el6.i686 #1
Call Trace:
 [<c0821fde>] ? panic+0x42/0xf9
 [<c0825ddc>] ? oops_end+0xbc/0xd0
 [<c040aa80>] ? do_invalid_op+0x0/0x90
 [<c040aaff>] ? do_invalid_op+0x7f/0x90
 [<c040593b>] ? pin_pagetable_pfn+0x3b/0x50
 [<c0407328>] ? xen_vcpuop_set_next_event+0x48/0x80
 [<c04ed724>] ? __alloc_pages_nodemask+0xf4/0x800
 [<c08251c7>] ? error_code+0x73/0x78
 [<c04300d8>] ? cache_k8_northbridges+0x18/0x100
 [<c040593b>] ? pin_pagetable_pfn+0x3b/0x50
 [<c0405ab3>] ? xen_alloc_ptpage+0xa3/0xf0
 [<c04fdcf5>] ? __pte_alloc+0x65/0xd0
 [<c050265d>] ? handle_mm_fault+0x19d/0x1d0
 [<c043293b>] ? __do_page_fault+0xfb/0x420
 [<c05073de>] ? do_brk+0x23e/0x330
 [<c082777a>] ? do_page_fault+0x2a/0x90
 [<c0827750>] ? do_page_fault+0x0/0x90
 [<c08251c7>] ? error_code+0x73/0x78
 [<c0820000>] ? rcu_cpu_notify+0x4a/0x75
==========================================================

and xen complained on console with messages:
========================
(XEN) mm.c:2042:d46 Bad type (saw 00000000e8000006 != exp 0000000020000000) for mfn 2e985b (pfn 28198)
(XEN) mm.c:2375:d46 Error while pinning mfn 2e985b
========================

Comment 21 Igor Mammedov 2011-08-17 09:48:18 UTC
(In reply to comment #19)
> (In reply to comment #18)
> > Yes, It looks like errors in "vmalloc: eagerly clear ptes on vunmap" commit
> > message.
> > Have you any chance to test the patched kernel from comment 13?
> 
> I was trying your patched kernel on my test box yesterday, attempting to crash
> it and didn't succeed, though I am not sure how reliable my method of
> reproducing the crash is so I might just have been lucky.
> I am going to go back to the regular kernel and will do some more testing so I
> can get a better idea of how significant the lack of crash yesterday is.

Could you try reproduce bug as per comment 20?
It crashed guest in several minutes for me. 

PS:
The nfs share used for test has several kernel trees  on it.

Comment 22 Michael Young 2011-08-17 14:48:58 UTC
I added that to my existing testing of the regular kernel and it crashed quite quickly though with a backtrace that was different again (see below) though it looks to be the same underlying bug.

I repeated the same test with the patched kernel and it hasn't crashed, so I think the bug is fixed in the test kernel.

The backtrace (with a bit missing from the start) I got from this crash was

andle kernel paging request at ffff8800040bc5e0
IP: [<ffffffff81045845>] ptep_set_access_flags+0x55/0x70
PGD 1a26067 PUD 1a2a067 PMD 57bc067 PTE 80100000040bc065
Oops: 0003 [#1] SMP
last sysfs file: /sys/devices/virtual/block/dm-2/range
CPU 1
Modules linked in: autofs4 nfs lockd fscache(T) nfs_acl auth_rpcgss sunrpc xenfs ipv6 ext4 jbd2 dm_mirror dm_region_hash dm_log microcode xen_netfront ext3 jbd
mbcache xen_blkfront dm_mod [last unloaded: scsi_wait_scan]

Modules linked in: autofs4 nfs lockd fext4 jbd2 dm_mirror dm_region_hash dm_log
microcode xen_netfront ext3 jbd mbcache xen_blkfront dm_mod [last unloaded: scsi_wait_scan]
Pid: 11447, comm: sh Tainted: G           ---------------- T 2.6.32-131.6.1.el6.x86_64 #1
RIP: e030:[<ffffffff81045845>]  [<ffffffff81045845>] ptep_set_access_flags+0x55/0x70
RSP: e02b:ffff8800a1a47b38  EFLAGS: 00010202
RAX: 80000001077fe145 RBX: ffff8801c716e148 RCX: 8000000f13359167
RDX: ffff8800040bc5e0 RSI: 00007f7bd5abc9d0 RDI: ffff8801c716e148
RBP: ffff8800a1a47b58 R08: 0000000000000001 R09: e400000000000000
R10: 0000000000000000 R11: 0000000000000098 R12: 00007f7bd5abc9d0
R13: 0000000000000001 R14: 0000000000000008 R15: ffffea00000e2930
FS:  00007f7bd5abc700(0000) GS:ffff88002806d000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff8800040bc5e0 CR3: 00000000043b6000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process sh (pid: 11447, threadinfo ffff8800a1a46000, task ffff8801f4d70080)
Stack:
 ffff8801c716e148 0000000000000000 ffffea00039a3f90 0000000000000008
<0> ffff8800a1a47bf8 ffffffff81136c7b ffff8801ffffffff 0037f414ab99fe20
<0> 0000000000000001 ffff8800fef99568 0000000000000000 ffff8800040bc5e0
Call Trace:
 [<ffffffff81136c7b>] do_wp_page+0x44b/0x8d0
 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff811378dd>] handle_pte_fault+0x2cd/0xb50
 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10
 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff81138338>] handle_mm_fault+0x1d8/0x2c0
 [<ffffffff810414e9>] __do_page_fault+0x139/0x480
 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10
 [<ffffffff81007c62>] ? check_events+0x12/0x20
 [<ffffffff81006033>] ? __xen_write_cr3+0x123/0x170
 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1
e_cr3+0x8f/0xc0
 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10
 [<ffffffff814e054e>] do_page_fault+0x3e/0xa0
 [<ffffffff814dd8d5>] page_fault+0x25/0x30
 [<ffffffff8126e3fd>] ? __put_user_4+0x1d/0x30
 [<ffffffff8105fc34>] ? schedule_tail+0x64/0xb0
 [<ffffffff8100b073>] ret_from_fork+0x13/0x80
Code: 89 f4 41 0f 95 c5 45 85 c0 75 1b 44 89 e8 48 8b 1c 24 4c 8b 64 24 08 4c 8b 6c 24 10 4c 8b 74 24 18 c9 c3 0f 1f 00 45 85 ed 74 e0 <48> 89 0a 48 8b 3f 0f 1f 80 00 00 00 00 4c 89 e6 48 89 df e8 13 f81045845>] ptep_set_access_flags+0x55/0x70
 RSP <ffff8800a1a47b38>
CR2: ffff8800040bc5e0
---[ end trace 8823d2c63163302c ]---
Kernel panic - not syncing: Fatal exception
Pid: 11447, comm: sh Tainted: G      D    ---------------- T 2.6.32-131.6.1.el6.x86_64 #1
Call Trace:
 [<ffffffff814da518>] ? panic+0x78/0x143
 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff814dd41c>] ? _spin_unlock_irqrestore+0x1c/0x20
 [<ffffffff814de564>] ? oops_end+0xe4/0x100
 [<ffffffff81040c9b>] ? no_context+0xfb/0x260
 [<ffffffff81040f25>] ? __bad_area_nosemaphore+0x125/0x1e0
 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff81040ff3>] ? bad_area_nosemaphore+0x13/0x20
 [<ffffffff810416cd>] ? __do_page_fault+0x31d/0x480
 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10
 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8111e8e1>] ? get_page_from_freelist+0x3d1/0x820
 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8111e7ee>] ? get_page_from_freelist+0x2de/0x820
 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10
 [<ffffffff81007c62>] ? check_events+0x12/0x20
 [<ffffffff814e054e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff814dd8d5>] ? page_fault+0x25/0x30
 [<ffffffff81045845>] ? ptep_set_access_flags+0x55/0x70
 [<ffffffff81136c7b>] ? do_wp_page+0x44b/0x8d0
 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff811378dd>] ? handle_pte_fault+0x2cd/0xb50
 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10
 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff81138338>] ? handle_mm_fault+0x1d8/0x2c0
 [<ffffffff810414e9>] ? __do_page_fault+0x139/0x480
 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10
 [<ffffffff81007c62>] ? check_events+0x12/0x20
 [<ffffffff81006033>] ? __xen_write_cr3+0x123/0x170
 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8100644f>] ? xen_write_cr3+0x8f/0xc0
 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10
 [<ffffffff814e054e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff814dd8d5>] ? page_fault+0x25/0x30
 [<ffffffff8126e3fd>] ? __put_user_4+0x1d/0x30
 [<ffffffff8105fc34>] ? schedule_tail+0x64/0xb0
 [<ffffffff8100b073>] ? ret_from_fork+0x13/0x80




(XEN) traps.c:2282:d54 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000.
(XEN) traps.c:2282:d54 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000.
(XEN) traps.c:2282:d54 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000.
(XEN) printk: 5 messages suppressed.
(XEN) mm.c:2319:d54 Bad type (saw e800000000000001 != exp 2000000000000000) for mfn f1bc11 (pfn fef46)
(XEN) mm.c:2708:d54 Error while pinning mfn f1bc11
(XEN) mm.c:2319:d54 Bad type (saw e800000000000001 != exp 2000000000000000) for mfn d146fd (pfn c645a)
(XEN) mm.c:2708:d54 Error while pinning mfn d146fd
(XEN) mm.c:2319:d54 Bad type (saw e800000000000001 != exp 2000000000000000) for mfn d336fe (pfn a7459)
(XEN) mm.c:2708:d54 Error while pinning mfn d336fe
(XEN) mm.c:2319:d54 Bad type (saw e800000000000001 != exp 2000000000000000) for mfn d14589 (pfn c65ce)
(XEN) mm.c:2708:d54 Error while pinning mfn d14589
(XEN) mm.c:2319:d54 Bad type (saw e800000000000001 != exp 2000000000000000) for mfn f58a9b (pfn 40bc)
(XEN) mm.c:2708:d54 Error while pinning mfn f58a9b
(XEN) printk: 37 messages suppressed.

Comment 23 Igor Mammedov 2011-08-19 11:41:53 UTC
Created attachment 519022 [details]
vmalloc: eagerly clear ptes on vunmap

Comment 25 RHEL Product and Program Management 2011-08-24 10:09:56 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 27 Aristeu Rozanski 2011-08-31 14:25:12 UTC
Patch(es) available on kernel-2.6.32-193.el6

Comment 31 errata-xmlrpc 2011-12-06 14:04:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2011-1530.html


Note You need to log in before you can comment on or make changes to this bug.