Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 730503

Summary:

RHEL 6.1 xen guest crashes with kernel BUG at arch/x86/xen/mmu.c:1457!

Product:

Red Hat Enterprise Linux 6

Reporter:

Michael Young <m.a.young>

Component:

kernel

Assignee:

Igor Mammedov <imammedo>

Status:

CLOSED ERRATA

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

6.1

CC:

anshockm, drjones, imammedo, leiwang, pbonzini, qguan, qwan

Target Milestone:

Keywords:

Regression

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

kernel-2.6.32-193.el6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-12-06 14:04:14 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

653816

Attachments:

Description	Flags
vmalloc: eagerly clear ptes on vunmap	none

Description Michael Young 2011-08-13 20:37:06 UTC

I am running some RHEL 6.1 xen (pv) guests on Citrix XenServer (5.6sp2) and they are crashing occasionally. I managed to get the backtrace below for the most recent crash with the 2.6.32-131.6.1.el6.x86_64 kernel. The server was running a couple of java processes, and doing some virus scanning of an NFS-mounted file system when it crashed.

kernel BUG at arch/x86/xen/mmu.c:1457!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/devices/virtual/block/dm-2/range
CPU 1 
Modules linked in: autofs4 nfs lockd fscache(T) nfs_acl auth_rpcgss sunrpc xenfs ipv6 ext4 jbd2 dm_mirror dm_region_hash dm_log microcode xen_netfront ext3 jbd mbcache xen_blkfront dm_mod [last unloaded: scsi_wait_scan]

Modules linked in: autofs4 nfs lockd fscache(T) nfs_acl auth_rpcgss sunrpc xenfs ipv6 ext4 jbd2 dm_mirror dm_region_hash dm_log micont ext3 jbd mbcache xen_blkfront dm_mod [last unloaded: scsi_wait_scan]
Pid: 14844, comm: java Tainted: G           ---------------- T 2.6.32-131.6.1.el6.x86_64 #1 
RIP: e030:[<ffffffff81005b2f>]  [<ffffffff81005b2f>] pin_pagetable_pfn+0x4f/0x60
RSP: e02b:ffff880129f5bd08  EFLAGS: 00010282
RAX: 00000000ffffffea RBX: 000000000012354c RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff880129f5bd08
RBP: ffff880129f5bd28 R08: 00003ffffffff000 R09: ffff880000000000
R10: 0000000000007ff0 R11: 00007fab8fe09530 R12: 0000000000000003
R13: ffff88008fc63748 R14: ffff8801f4233a88 R15: ffff88010a471a20
FS:  00007fab8e29d700(0000) GS:ffff88002806d000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000dd202370 CR3: 0000000107d2b000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process java (pid: 14844, threadinfo ffff880129f5a000, task fffStack:
 0000000000000000 0000000000ebbe27 ffff8801f4233a88 000000000012354c
<0> ffff880129f5bd48 ffffffff81005cb9 ffff8801f4233a00 000000000012354c
<0> ffff880129f5bd58 ffffffff81005d13 ffff880129f5bda8 ffffffff811334ec
Call Trace:
 [<ffffffff81005cb9>] xen_alloc_ptpage+0x99/0xa0
 [<ffffffff81005d13>] xen_alloc_pte+0x13/0x20
 [<ffffffff811334ec>] __pte_alloc+0x8c/0x160
 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff811382a9>] handle_mm_fault+0x149/0x2c0
 [<ffffffff810414e9>] __do_page_fault+0x139/0x480
 [<ffffffff8105f73c>] ? pick_next_task_fair+0xec/0x120
 [<ffffffff814e054e>] do_page_fault+0x3e/0xa0
 [<ffffffff814dd8d5>] page_fault+0x25/0x30
Code: 48 ba ff ff ff 7f ff ff ff ff 48 21 d0 48 89 45 e8 48 8d 7d e0 be 01 00 00 00 31 d2 41 ba f0 7f 00 00 e8 15 b8 ff ff 85 c0 74 04 <0f> 0b eb fe c9 c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 
RIP  [<ffffffff81005b2f>] pin_pagetable_pfn+0x4f/0x60
 RSP <ffff880129f5bd08>
---[ end trace a05e463aa63b231d ]---
Kernel panic - not syncing: Fatal exception
Pid: 14844, comm: java Tainted: G      D    ---------------- T 2.6.32-131.6.1.el6.x86_64 #1
Call Trace:
 [<ffffffff814da518>] ? panic+0x78/0x143
 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff814dd41c>] ? _spin_unlock_irqrestore+0x1c/0x20
 [<ffffffff814de564>] ? oops_end+0xe4/0x100
 [<ffffffff8100f2eb>] ? die+0x5b/0x90
 [<ffffffff814dde34>] ? do_trap+0xc4/0x160
 [<ffffffff8100ceb5>] ? do_invalid_op+0x95/0xb0
 [<ffffffff81005b2f>] ? pin_pagetable_pfn+0x4f/0x60
 [<ffffffff8111fab1>] ? __alloc_pages_nodemask+0x111/0x8b0
 [<ffffffff8100bf5b>] ? invalid_op+0x1b/0x20
 [<ffffffff81005b2f>] ? pin_pagetable_pfn+0x4f/0x60
 [<ffffffff81005b2b>] ? pin_pagetable_pfn+0x4b/0x60
 [<ffffffff81005cb9>] ? xen_alloc_ptpage+0x99/0xa0
 [<ffffffff81005d13>] ? xen_alloc_pte+0x13/0x20
 [<ffffffff811334ec>] ? __pte_alloc+0x8c/0x160
 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff811382a9>] ? handle_mm_fault+0x149/0x2c0
 [<ffffffff810414e9>] ? __do_page_fault+0x139/0x480
 [<ffffffff8105f73c>] ? pick_next_task_fair+0xec/0x120
 [<ffffffff814e054e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff814dd8d5>] ? page_fault+0x25/0x30

Comment 2 Andrew Jones 2011-08-14 16:34:59 UTC

1451 static void pin_pagetable_pfn(unsigned cmd, unsigned long pfn)
1452 {  
1453         struct mmuext_op op;
1454         op.cmd = cmd;
1455         op.arg1.mfn = pfn_to_mfn(pfn);
1456         if (HYPERVISOR_mmuext_op(&op, 1, NULL, DOMID_SELF))
1457                 BUG();
1458 }

We need to try and figure out why XenServer 5.6sp2 is returning an error on this hypercall. Please also file a bug with Citrix.

Comment 4 Paolo Bonzini 2011-08-14 16:41:08 UTC

Please attach the output of "xm dmesg" after a crash.  Thanks!

Comment 5 Michael Young 2011-08-14 17:31:11 UTC

It isn't xm in XenServer, however xe host-dmesg host=servername gives the following for one guest session

(XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000
00000 to 00000000:00000000.
(XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000
00000 to 00000000:00000000.
(XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000
00000 to 00000000:00000000.
(XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000
00000 to 00000000:00000000.
(XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000
00000 to 00000000:00000000.
(XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000
00000 to 00000000:00000000.
(XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000
00000 to 00000000:00000000.
(XEN) traps.c:2282:d46 Domain attempted WRMSR 000000000000008b from 00000013:000
00000 to 00000000:00000000.
(XEN) mm.c:2319:d46 Bad type (saw e800000000000004 != exp 2000000000000000) for 
mfn 833d7e (pfn 1eb5f5)
(XEN) mm.c:2708:d46 Error while pinning mfn 833d7e
(XEN) grant_table.c:1408:d0 dest domain 46 dying
(XEN) grant_table.c:1408:d0 dest domain 46 dying

Comment 6 Andrew Jones 2011-08-14 21:02:10 UTC

This is interesting

> (XEN) mm.c:2319:d46 Bad type (saw e800000000000004 != exp 2000000000000000) for 
> mfn 833d7e (pfn 1eb5f5)
> (XEN) mm.c:2708:d46 Error while pinning mfn 833d7e

This thread looks like it could be discussing the same issue

http://lists.xensource.com/archives/html/xen-devel/2011-03/msg01367.html

Comment 7 Michael Young 2011-08-14 21:27:26 UTC

(In reply to comment #6)
> This is interesting
> 
> > (XEN) mm.c:2319:d46 Bad type (saw e800000000000004 != exp 2000000000000000) for 
> > mfn 833d7e (pfn 1eb5f5)
> > (XEN) mm.c:2708:d46 Error while pinning mfn 833d7e
> 
> This thread looks like it could be discussing the same issue
> 
> http://lists.xensource.com/archives/html/xen-devel/2011-03/msg01367.html

Except that is a crash on boot, whereas mine can stay up for a few days. There is also the long thread containing http://lists.xensource.com/archives/html/xen-devel/2011-01/msg00171.html though I think you might already have the patch for that.

Comment 8 Michael Young 2011-08-14 21:53:52 UTC

(In reply to comment #7)
> There is also the long thread containing
> http://lists.xensource.com/archives/html/xen-devel/2011-01/msg00171.html though
> I think you might already have the patch for that.

Actually I think I am wrong about you having the patch. http://lists.xensource.com/archives/html/xen-devel/2011-02/msg01293.html later in the thread mentions two patches so I think I was checking the wrong patch in my earlier comment.

Comment 11 Igor Mammedov 2011-08-15 14:34:02 UTC

Michael,

Would you care to test a new kernel with patches mentioned by
http://lists.xensource.com/archives/html/xen-devel/2011-02/msg01293.html
if I built it for you?

Comment 12 Michael Young 2011-08-15 15:07:26 UTC

Yes, I will test it. I have just been creating a copy of the virtual server that was crashing for testing purposes.

Comment 13 Igor Mammedov 2011-08-16 05:44:24 UTC

A new kernel build with "vmalloc: eagerly clear ptes on vunmap" patch:

http://people.redhat.com/imammedo/kernel-2.6.32-131.0.1.el6.test.x86_64.rpm

Please test.

Comment 14 anshockm 2011-08-16 06:47:24 UTC

(In reply to comment #13)
> A new kernel build with "vmalloc: eagerly clear ptes on vunmap" patch:
> 
> http://people.redhat.com/imammedo/kernel-2.6.32-131.0.1.el6.test.x86_64.rpm
> 
> Please test.

Perfect!
I could trigger the bug easily within 2 minutes yesterday and with this new kernel the system is running fine for 30 min now.

My environment before:
kernel 2.6.32-131.6.1.el6.x86_64 xen (pv) guests on Citrix XenServer (5.6sp2) running dovecot-2.0.9-2.el6.x86_64. Using this system as back end for imapproxy on a squirrel webmailer triggered the bug within 2 min.

Comment 15 Igor Mammedov 2011-08-16 07:26:08 UTC

(In reply to comment #14)
> 
> Perfect!
> I could trigger the bug easily within 2 minutes yesterday and with this new
> kernel the system is running fine for 30 min now.

Thanks for verifying.
 
Do you have any NFS related io activity when bug was reproduced?

Comment 16 anshockm 2011-08-16 07:42:00 UTC

(In reply to comment #15)
> (In reply to comment #14)
> > 
> > Perfect!
> > I could trigger the bug easily within 2 minutes yesterday and with this new
> > kernel the system is running fine for 30 min now.
> 
> Thanks for verifying.
> 
> Do you have any NFS related io activity when bug was reproduced?

Yes, the dovecot mail directories are on a Netapp file servers connected by NFS.

fs5.serv.uni-osnabrueck.de:/vol/MailStaff /mnt/fs5/MailStaff nfs rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.17.17.216,mountvers=3,mountport=4046,mountproto=udp,local_lock=none,addr=172.17.17.216 0 0
fs4.serv.uni-osnabrueck.de:/vol/MailStudent /mnt/fs4/MailStudent nfs rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.17.17.211,mountvers=3,mountport=4046,mountproto=udp,local_lock=none,addr=172.17.17.211 0 0


Here my backtrace from yesterday:

kernel BUG at arch/x86/xen/mmu.c:1457!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/module/sunrpc/initstate
CPU 0 
Modules linked in: nfs lockd fscache(T) nfs_acl auth_rpcgss autofs4 sunrpc xenfs ipt_REJECT xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log microcode xen_netfront ext4 mbcache jbd2 xen_blkfront dm_mod [last unloaded: scsi_wait_scan]
Modules linked in: nfs lockd fscache(T) nfs_acl auth_rpcgss autofs4 sunrpc xenfs ipt_REJECT xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log microcode xen_netfront ext4 mbcache jbd2 xen_blkfront dm_mod [last unloaded: scsi_wait_scan]
Pid: 3937, comm: auth Tainted: G           ---------------- T 2.6.32-131.6.1.el6.x86_64 #1 
RIP: e030:[<ffffffff81005b2f>]  [<ffffffff81005b2f>] pin_pagetable_pfn+0x4f/0x60
RSP: e02b:ffff8800f919fd08  EFLAGS: 00010282
RAX: 00000000ffffffea RBX: 00000000000f9f57 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8800f919fd08
RBP: ffff8800f919fd28 R08: 00003ffffffff000 R09: ffff880000000000
R10: 0000000000007ff0 R11: 0000000000000246 R12: 0000000000000003
R13: ffff880003b187e0 R14: ffff8800f91d5788 R15: ffff8800f9f255f8
FS:  00007fc4205c3700(0000) GS:ffff88000d136000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fc41f9061b8 CR3: 0000000002c3a000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process auth (pid: 3937, threadinfo ffff8800f919e000, task ffff880003bac040)
Stack:
0000000000000000 000000000033a69d ffff8800f91d5788 00000000000f9f57
 ffff8800f919fd48 ffffffff81005cb9 ffff8800f91d5700 00000000000f9f57
 ffff8800f919fd58 ffffffff81005d13 ffff8800f919fda8 ffffffff811334ec
Call Trace:
[<ffffffff81005cb9>] xen_alloc_ptpage+0x99/0xa0
[<ffffffff81005d13>] xen_alloc_pte+0x13/0x20
[<ffffffff811334ec>] __pte_alloc+0x8c/0x160
[<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
[<ffffffff811382a9>] handle_mm_fault+0x149/0x2c0
[<ffffffff810414e9>] __do_page_fault+0x139/0x480
[<ffffffff8113e1da>] ? do_mmap_pgoff+0x33a/0x380
[<ffffffff814e054e>] do_page_fault+0x3e/0xa0
[<ffffffff814dd8d5>] page_fault+0x25/0x30
Code: 48 ba ff ff ff 7f ff ff ff ff 48 21 d0 48 89 45 e8 48 8d 7d e0 be 01 00 00 00 31 d2 41 ba f0 7f 00 00 e8 15 b8 ff ff 85 c0 74 04 <0f> 0b eb fe c9 c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 
RIP  [<ffffffff81005b2f>] pin_pagetable_pfn+0x4f/0x60
RSP <ffff8800f919fd08>

Comment 17 Michael Young 2011-08-16 09:44:11 UTC

I was trying to get my test machine to crash with the standard 2.6.32-131.6.1.el6.x86_64 kernel before testing the new kernel. It did crash but in a different way; the backtrace is

BUG: unable to handle kernel paging request at ffff8801ec2f2010
IP: [<ffffffff81006db8>] xen_set_pmd+0x38/0xb0
PGD 1a26067 PUD 5dcd067 PMD 5f2f067 PTE 80100001ec2f2065
Oops: 0003 [#1] SMP
last sysfs file: /sys/devices/virtual/block/dm-2/dm/name
CPU 0
Modules linked in: nfs lockd fscache(T) nfs_acl auth_rpcgss autofs4 sunrpc xenfs ipv6 ext4 jbd2 dm_mirror dm_region_hash dm_log microcode xen_netfront ext3 jbd mbcache xen_blkfront dm_mod [last unloaded: scsi_wait_scan]

Modules linked in: nfs lockd f2 dm_mirror dm_region_hash dm_log microcode xen_netfront ext3 jbd mbcache xen_blkfront dm_mod [last unloaded: scsi_wait_scan]
Pid: 30736, comm: xe-update-guest Tainted: G           ---------------- T 2.6.32-131.6.1.el6.x86_64 #1
RIP: e030:[<ffffffff81006db8>]  [<ffffffff81006db8>] xen_set_pmd+0x38/0xb0
RSP: e02b:ffff8801d7a41a78  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8801ec2f2010 RCX: ffff880000000000
RDX: ffffea0000000000 RSI: 0000000000000000 RDI: ffff8801ec2f2010
RBP: ffff8801d7a41a88 R08: 00000000018b2000 R09: 0000000000000000
R10: 0000000000000010 R11: 0000000000000000 R12: 0000000000000000
R13: 00000000008e5000 R14: ffff8801ec2f2010 R15: ffff88002805d520
FS:  00007f21a9817700(0000) GS:ffff88002804f000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff8801ec2f2010 CR3: 000000014b233000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process xe-update-guest (pid: 30736, threadinfo ffff8801d7a40000, task ffff8801f2f90b40)
Stack:
 0000000000600000 000000019baf4067 ffff8801d7a41b58 ffffffff81133d3f
<0> ffffffff81007c4f ffffffff8115b764 ffff8801d7a41bc8 ffffffffffffffff
<0> ffffffffffffffff 0000000000000000 0000000000000000 00000000008e4fff
Call Trace:
 [<ffffffff81133d3f>] free_pgd_range+0x25f/0x4b0
 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8115b764>] ? kmem_cache_free+0xc4/0x2b0
 [<ffffffff8113405e>] free_pgtables+0xce/0x120
 [<ffffffff8113af90>] exit_mmap+0xb0/0x170
 [<ffffffff8106449c>] mmput+0x6c/0x120
 [<ffffffff81178ba9>] flush_old_exec+0x449/0x600
 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff811ca095>] load_elf_binary+0x2b5/0x1bc0
 [<ffffffff81133261>] ? follow_page+0x321/0x460
 [<ffffffff8113852f>] ? __get_user_pages+0x10f/0x420
 [<ffffffff811c3aac>] ? load_misc_binary+0xac/0x3e0
 [<ffffffff8117a0fb>] search_binary_handler+0x10b/0x350
 [<ffffffff8117b289>] do_execve+0x239/0x310
 [<ffffffff8126e8ba>] ? strncpy_from_user+0x4a/0x90
 [<ffffffff810095ca>] sys_execve+0x4a/0x80
 [<ffffffff8100b5ca>] stub_execve+0x6a/0xc0
Code: 89 64 24 08 0f 1f 44 00 00 80 3d 2f 94 d0 00 00 48 89 fb 49 89 f4 75 51 48 89 df 83 05 69 93 d0 00 01 e8 7c e4 ff ff 84 c0 75 18 <4c> 89 23 48 8b 1c 24 4c 8b 64 24 08 c9 c3 66 2e 0f 1f 84 00 00
RIP  [<ffffffff81006db8>] xen_set_pmd+0x38/0xb0
 RSP <ffff8801d7a41a78>
CR2: ffff8801ec2f2010
---[ end trace cb20c8b5bdd26af7 ]---
Kernel panic - not syncing: Fatal exception
Pid: 30736, comm: xe-update-guest Tainted: G      D    ---------------- T 2.6.32-131.6.1.el6.x86_64 #1
Call Trace:
 [<ffffffff814da518>] ? panic+0x78/0x143
 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff814dd41c>] ? _spin_unlock_irqrestore+0x1c/0x20
 [<ffffffff814de564>] ? oops_end+0xe4/0x100
c9b>] ? no_context+0xfb/0x260
 [<ffffffff81040f25>] ? __bad_area_nosemaphore+0x125/0x1e0
 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff81040ff3>] ? bad_area_nosemaphore+0x13/0x20
 [<ffffffff810416cd>] ? __do_page_fault+0x31d/0x480
 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10
 [<ffffffff81007c62>] ? check_events+0x12/0x20
 [<ffffffff8100c2fb>] ? xen_hypervisor_callback+0x1b/0x20
 [<ffffffff814ddb0a>] ? error_exit+0x2a/0x60
 [<ffffffff8100bb1d>] ? retint_res00140a>] ? hypercall_page+0x40a/0x1010
 [<ffffffff814e054e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff814dd8d5>] ? page_fault+0x25/0x30
 [<ffffffff81006db8>] ? xen_set_pmd+0x38/0xb0
 [<ffffffff81006db4>] ? xen_set_pmd+0x34/0xb0
 [<ffffffff81133d3f>] ? free_pgd_range+0x25f/0x4b0
 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8115b764>] ? kmem_cache_free+0xc4/0x2b0
 [<ffffffff8113405e>] ? free_pgtables+0xce/0x120
 [<ffffffff8113af90>] ? exit_mmap+0xb0/0x170
 [<ffffffff8106449c>] ? mmput+0x6c/0x120
 [<ffffffff81178ba9>] ? flush_old_exec+0x449/0x600
 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff811ca095>] ? load_elf_binary+0x2b5/0x1bc0
 [<ffffffff81133261>] ? follow_page+0x321/0x460
 [<ffffffff8113852f>] ? __get_user_pages+0x10f/0x420
 [<ffffffff811c3aac>] ? load_misc_binary+0xac/0x3e0
 [<ffffffff8117a0fb>] ? search_binary_handler+0x10b/0x350
 [<ffffffff8117b289>] ? do_execve+0x239/0x310
 [<ffffffff8126e8ba>] ? strncpy_from_user+0x4a/0x90
 [<ffffffff810095ca>] ? sys_execve+0x4a/0x80
 [<ffffffff8100b5ca>] ? stub_execve+0x6a/0xc0

with xenserver dmesg logs
(XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000.
(XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000.
(XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000.
(XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000.
(XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000.
(XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000.
(XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000.
(XEN) traps.c:2282:d50 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000.
(XEN) mm.c:2319:d50 Bad type (saw e800000000000016 != exp 2000000000000000) for mfn f662e7 (pfn 1f4872)
(XEN) mm.c:2708:d50 Error while pinning mfn f662e7
(XEN) mm.c:2319:d50 Bad type (saw e80000000000000e != exp 6000000000000000) for mfn f69522 (pfn 1f1637)
(XEN) mm.c:896:d50 Attempt to create linear p.t. with write perms
(XEN) mm.c:1441:d50 Failure in alloc_l4_table: entry 255
(XEN) mm.c:2071:d50 Error while validating mfn d0f926 (pfn 14b233) for type 8000000000000000: caf=8000000000000003 taf=8000000000000001
(XEN) mm.c:2708:d50 Error while pinning mfn d0f926
(XEN) mm.c:2319:d50 Bad type (saw e80000000000000e != exp 6000000000000000) for mfn f69522 (pfn 1f1637)
(XEN) mm.c:896:d50 Attempt to create linear p.t. with write perms
(XEN) mm.c:1441:d50 Failure in alloc_l4_table: entry 255
(XEN) printk: 52 messages suppressed.

This could be the same sort of thing though, as it resembles the comments in
http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=ae333e97552c81ab10395ad1ffc6d6daaadb144a
which seems to be a later version of the "vmalloc: eagerly clear ptes on vunmap" patch and is mentioned in the branches of the same thread I referred to above (which is rather messy but I think the messages
http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00233.html
http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00742.html
http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00415.html
tie the above backtrace to the patch).

Comment 18 Igor Mammedov 2011-08-17 08:59:33 UTC

(In reply to comment #17)
> This could be the same sort of thing though, as it resembles the comments in
> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=ae333e97552c81ab10395ad1ffc6d6daaadb144a
> which seems to be a later version of the "vmalloc: eagerly clear ptes on
> vunmap" patch and is mentioned in the branches of the same thread I referred to
> above (which is rather messy but I think the messages
> http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00233.html
> http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00742.html
> http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00415.html
> tie the above backtrace to the patch).

Yes, It looks like errors in "vmalloc: eagerly clear ptes on vunmap" commit message.
Have you any chance to test the patched kernel from comment 13?

Comment 19 Michael Young 2011-08-17 09:24:39 UTC

(In reply to comment #18)
> Yes, It looks like errors in "vmalloc: eagerly clear ptes on vunmap" commit
> message.
> Have you any chance to test the patched kernel from comment 13?

I was trying your patched kernel on my test box yesterday, attempting to crash it and didn't succeed, though I am not sure how reliable my method of reproducing the crash is so I might just have been lucky.
I am going to go back to the regular kernel and will do some more testing so I can get a better idea of how significant the lack of crash yesterday is.

Comment 20 Igor Mammedov 2011-08-17 09:44:19 UTC

Managed to crash 2.6.32-131.0.15.el6.i686 pv guest on rhel 5 host.

To reproduce it, I've just done as "vmalloc: eagerly clear ptes on vunmap" commit message suggested. 

mount testbox:/test /mnt/nfs_share
find  /mnt/nfs_share -type f -print0 | xargs -0 file > /dev/null

Kernel OOPSed:
==========================================================
kernel BUG at arch/x86/xen/mmu.c:1457!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/module/sunrpc/initstate
Modules linked in: nfs lockd fscache(T) nfs_acl auth_rpcgss sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 xen_netfront ext4 mbcache jbd2 xen_blkfront dm_mod [last unloaded: scsi_wait_scan]

Pid: 17394, comm: file Tainted: G           ---------------- T (2.6.32-131.0.15.el6.i686 #1) 
EIP: 0061:[<c040593b>] EFLAGS: 00010282 CPU: 1
EIP is at pin_pagetable_pfn+0x3b/0x50
EAX: ffffffea EBX: e7831eb8 ECX: 00000001 EDX: 00000000
ESI: 00007ff0 EDI: e5960230 EBP: c20d3948 ESP: e7831eb8
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
Process file (pid: 17394, ti=e7830000 task=e9174ab0 task.ti=e7830000)
Stack:
 00000000 002e985b c0405ab3 c20d3900 00028198 e5960230 c04fdcf5 e5af3000
<0> e5960230 08c0859c c204ce64 c20d3900 c050265d 08c0859c 00000000 00000000
<0> 00000000 c20d3900 00000009 00000000 c204ce64 00000009 08c0859c c20d3900
Call Trace:
 [<c0405ab3>] ? xen_alloc_ptpage+0xa3/0xf0
 [<c04fdcf5>] ? __pte_alloc+0x65/0xd0
 [<c050265d>] ? handle_mm_fault+0x19d/0x1d0
 [<c043293b>] ? __do_page_fault+0xfb/0x420
 [<c05073de>] ? do_brk+0x23e/0x330
 [<c082777a>] ? do_page_fault+0x2a/0x90
 [<c0827750>] ? do_page_fault+0x0/0x90
 [<c08251c7>] ? error_code+0x73/0x78
 [<c0820000>] ? rcu_cpu_notify+0x4a/0x75
Code: 00 89 0c 24 75 0a e8 85 f9 ff ff 25 ff ff ff 7f 89 44 24 04 89 e3 b9 01 00 00 00 31 d2 be f0 7f 00 00 e8 09 ca ff ff 85 c0 74 04 <0f> 0b eb fe 83 c4 0c 5b 5e 5f c3 8d 76 00 8d bc 27 00 00 00 00 
EIP: [<c040593b>] pin_pagetable_pfn+0x3b/0x50 SS:ESP 0069:e7831eb8
---[ end trace 41a0c88bd81d9413 ]---
Kernel panic - not syncing: Fatal exception
Pid: 17394, comm: file Tainted: G      D    ---------------- T 2.6.32-131.0.15.el6.i686 #1
Call Trace:
 [<c0821fde>] ? panic+0x42/0xf9
 [<c0825ddc>] ? oops_end+0xbc/0xd0
 [<c040aa80>] ? do_invalid_op+0x0/0x90
 [<c040aaff>] ? do_invalid_op+0x7f/0x90
 [<c040593b>] ? pin_pagetable_pfn+0x3b/0x50
 [<c0407328>] ? xen_vcpuop_set_next_event+0x48/0x80
 [<c04ed724>] ? __alloc_pages_nodemask+0xf4/0x800
 [<c08251c7>] ? error_code+0x73/0x78
 [<c04300d8>] ? cache_k8_northbridges+0x18/0x100
 [<c040593b>] ? pin_pagetable_pfn+0x3b/0x50
 [<c0405ab3>] ? xen_alloc_ptpage+0xa3/0xf0
 [<c04fdcf5>] ? __pte_alloc+0x65/0xd0
 [<c050265d>] ? handle_mm_fault+0x19d/0x1d0
 [<c043293b>] ? __do_page_fault+0xfb/0x420
 [<c05073de>] ? do_brk+0x23e/0x330
 [<c082777a>] ? do_page_fault+0x2a/0x90
 [<c0827750>] ? do_page_fault+0x0/0x90
 [<c08251c7>] ? error_code+0x73/0x78
 [<c0820000>] ? rcu_cpu_notify+0x4a/0x75
==========================================================

and xen complained on console with messages:
========================
(XEN) mm.c:2042:d46 Bad type (saw 00000000e8000006 != exp 0000000020000000) for mfn 2e985b (pfn 28198)
(XEN) mm.c:2375:d46 Error while pinning mfn 2e985b
========================

Comment 21 Igor Mammedov 2011-08-17 09:48:18 UTC

(In reply to comment #19)
> (In reply to comment #18)
> > Yes, It looks like errors in "vmalloc: eagerly clear ptes on vunmap" commit
> > message.
> > Have you any chance to test the patched kernel from comment 13?
> 
> I was trying your patched kernel on my test box yesterday, attempting to crash
> it and didn't succeed, though I am not sure how reliable my method of
> reproducing the crash is so I might just have been lucky.
> I am going to go back to the regular kernel and will do some more testing so I
> can get a better idea of how significant the lack of crash yesterday is.

Could you try reproduce bug as per comment 20?
It crashed guest in several minutes for me. 

PS:
The nfs share used for test has several kernel trees  on it.

Comment 22 Michael Young 2011-08-17 14:48:58 UTC

I added that to my existing testing of the regular kernel and it crashed quite quickly though with a backtrace that was different again (see below) though it looks to be the same underlying bug.

I repeated the same test with the patched kernel and it hasn't crashed, so I think the bug is fixed in the test kernel.

The backtrace (with a bit missing from the start) I got from this crash was

andle kernel paging request at ffff8800040bc5e0
IP: [<ffffffff81045845>] ptep_set_access_flags+0x55/0x70
PGD 1a26067 PUD 1a2a067 PMD 57bc067 PTE 80100000040bc065
Oops: 0003 [#1] SMP
last sysfs file: /sys/devices/virtual/block/dm-2/range
CPU 1
Modules linked in: autofs4 nfs lockd fscache(T) nfs_acl auth_rpcgss sunrpc xenfs ipv6 ext4 jbd2 dm_mirror dm_region_hash dm_log microcode xen_netfront ext3 jbd
mbcache xen_blkfront dm_mod [last unloaded: scsi_wait_scan]

Modules linked in: autofs4 nfs lockd fext4 jbd2 dm_mirror dm_region_hash dm_log
microcode xen_netfront ext3 jbd mbcache xen_blkfront dm_mod [last unloaded: scsi_wait_scan]
Pid: 11447, comm: sh Tainted: G           ---------------- T 2.6.32-131.6.1.el6.x86_64 #1
RIP: e030:[<ffffffff81045845>]  [<ffffffff81045845>] ptep_set_access_flags+0x55/0x70
RSP: e02b:ffff8800a1a47b38  EFLAGS: 00010202
RAX: 80000001077fe145 RBX: ffff8801c716e148 RCX: 8000000f13359167
RDX: ffff8800040bc5e0 RSI: 00007f7bd5abc9d0 RDI: ffff8801c716e148
RBP: ffff8800a1a47b58 R08: 0000000000000001 R09: e400000000000000
R10: 0000000000000000 R11: 0000000000000098 R12: 00007f7bd5abc9d0
R13: 0000000000000001 R14: 0000000000000008 R15: ffffea00000e2930
FS:  00007f7bd5abc700(0000) GS:ffff88002806d000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff8800040bc5e0 CR3: 00000000043b6000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process sh (pid: 11447, threadinfo ffff8800a1a46000, task ffff8801f4d70080)
Stack:
 ffff8801c716e148 0000000000000000 ffffea00039a3f90 0000000000000008
<0> ffff8800a1a47bf8 ffffffff81136c7b ffff8801ffffffff 0037f414ab99fe20
<0> 0000000000000001 ffff8800fef99568 0000000000000000 ffff8800040bc5e0
Call Trace:
 [<ffffffff81136c7b>] do_wp_page+0x44b/0x8d0
 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff811378dd>] handle_pte_fault+0x2cd/0xb50
 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10
 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff81138338>] handle_mm_fault+0x1d8/0x2c0
 [<ffffffff810414e9>] __do_page_fault+0x139/0x480
 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10
 [<ffffffff81007c62>] ? check_events+0x12/0x20
 [<ffffffff81006033>] ? __xen_write_cr3+0x123/0x170
 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1
e_cr3+0x8f/0xc0
 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10
 [<ffffffff814e054e>] do_page_fault+0x3e/0xa0
 [<ffffffff814dd8d5>] page_fault+0x25/0x30
 [<ffffffff8126e3fd>] ? __put_user_4+0x1d/0x30
 [<ffffffff8105fc34>] ? schedule_tail+0x64/0xb0
 [<ffffffff8100b073>] ret_from_fork+0x13/0x80
Code: 89 f4 41 0f 95 c5 45 85 c0 75 1b 44 89 e8 48 8b 1c 24 4c 8b 64 24 08 4c 8b 6c 24 10 4c 8b 74 24 18 c9 c3 0f 1f 00 45 85 ed 74 e0 <48> 89 0a 48 8b 3f 0f 1f 80 00 00 00 00 4c 89 e6 48 89 df e8 13 f81045845>] ptep_set_access_flags+0x55/0x70
 RSP <ffff8800a1a47b38>
CR2: ffff8800040bc5e0
---[ end trace 8823d2c63163302c ]---
Kernel panic - not syncing: Fatal exception
Pid: 11447, comm: sh Tainted: G      D    ---------------- T 2.6.32-131.6.1.el6.x86_64 #1
Call Trace:
 [<ffffffff814da518>] ? panic+0x78/0x143
 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff814dd41c>] ? _spin_unlock_irqrestore+0x1c/0x20
 [<ffffffff814de564>] ? oops_end+0xe4/0x100
 [<ffffffff81040c9b>] ? no_context+0xfb/0x260
 [<ffffffff81040f25>] ? __bad_area_nosemaphore+0x125/0x1e0
 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff81040ff3>] ? bad_area_nosemaphore+0x13/0x20
 [<ffffffff810416cd>] ? __do_page_fault+0x31d/0x480
 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10
 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8111e8e1>] ? get_page_from_freelist+0x3d1/0x820
 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8111e7ee>] ? get_page_from_freelist+0x2de/0x820
 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10
 [<ffffffff81007c62>] ? check_events+0x12/0x20
 [<ffffffff814e054e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff814dd8d5>] ? page_fault+0x25/0x30
 [<ffffffff81045845>] ? ptep_set_access_flags+0x55/0x70
 [<ffffffff81136c7b>] ? do_wp_page+0x44b/0x8d0
 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff811378dd>] ? handle_pte_fault+0x2cd/0xb50
 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10
 [<ffffffff81004a39>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff81138338>] ? handle_mm_fault+0x1d8/0x2c0
 [<ffffffff810414e9>] ? __do_page_fault+0x139/0x480
 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10
 [<ffffffff81007c62>] ? check_events+0x12/0x20
 [<ffffffff81006033>] ? __xen_write_cr3+0x123/0x170
 [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8100644f>] ? xen_write_cr3+0x8f/0xc0
 [<ffffffff8100742d>] ? xen_force_evtchn_callback+0xd/0x10
 [<ffffffff814e054e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff814dd8d5>] ? page_fault+0x25/0x30
 [<ffffffff8126e3fd>] ? __put_user_4+0x1d/0x30
 [<ffffffff8105fc34>] ? schedule_tail+0x64/0xb0
 [<ffffffff8100b073>] ? ret_from_fork+0x13/0x80




(XEN) traps.c:2282:d54 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000.
(XEN) traps.c:2282:d54 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000.
(XEN) traps.c:2282:d54 Domain attempted WRMSR 000000000000008b from 00000013:00000000 to 00000000:00000000.
(XEN) printk: 5 messages suppressed.
(XEN) mm.c:2319:d54 Bad type (saw e800000000000001 != exp 2000000000000000) for mfn f1bc11 (pfn fef46)
(XEN) mm.c:2708:d54 Error while pinning mfn f1bc11
(XEN) mm.c:2319:d54 Bad type (saw e800000000000001 != exp 2000000000000000) for mfn d146fd (pfn c645a)
(XEN) mm.c:2708:d54 Error while pinning mfn d146fd
(XEN) mm.c:2319:d54 Bad type (saw e800000000000001 != exp 2000000000000000) for mfn d336fe (pfn a7459)
(XEN) mm.c:2708:d54 Error while pinning mfn d336fe
(XEN) mm.c:2319:d54 Bad type (saw e800000000000001 != exp 2000000000000000) for mfn d14589 (pfn c65ce)
(XEN) mm.c:2708:d54 Error while pinning mfn d14589
(XEN) mm.c:2319:d54 Bad type (saw e800000000000001 != exp 2000000000000000) for mfn f58a9b (pfn 40bc)
(XEN) mm.c:2708:d54 Error while pinning mfn f58a9b
(XEN) printk: 37 messages suppressed.

Comment 23 Igor Mammedov 2011-08-19 11:41:53 UTC

Created attachment 519022 [details]
vmalloc: eagerly clear ptes on vunmap

Comment 25 RHEL Program Management 2011-08-24 10:09:56 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 27 Aristeu Rozanski 2011-08-31 14:25:12 UTC

Patch(es) available on kernel-2.6.32-193.el6

Comment 31 errata-xmlrpc 2011-12-06 14:04:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2011-1530.html