Description of problem: The 2.6.9 kernel does not handle spurious page faults. This can result in a kernel oops such as: ---- cut ---- Oops: 0003 [#1] SMP Modules linked in: dm_snapshot dm_mirror dm_zero dm_mod ext3 jbd msdos raid6 raid5 xor raid1 raid0 xenblk xennet \ sr_mod sd_mod scsi_mod cdrom loop nfs nfs_acl lockd sunrpc vfat fat cramfs CPU: 0 EIP: 0061:[<c01122f7>] Not tainted VLI EFLAGS: 00010246 (2.6.9-67.ELxenU) EIP is at pgd_free+0x146/0x183 eax: 00000000 ebx: dd1f9000 ecx: 00000400 edx: 80000004 esi: 00000000 edi: dd1f9000 ebp: 00000003 esp: c6710f64 ds: 007b es: 007b ss: 0068 Process hardlink (pid: 3912, threadinfo=c6710000 task=ebae21b0) Stack: c014ece5 1d1f9001 00000000 ec6ae840 ec6ae840 ebae21b0 00000000 c011a26e dd18d000 ebae2700 c011e0c9 ec6ae840 00000001 ec5fe140 00000000 c6710000 c6710000 c011e3b5 00000000 00000000 00000000 401426dc c6710000 c010734f Call Trace: [<c014ece5>] exit_mmap+0x151/0x15b [<c011a26e>] __mmdrop+0x1a/0x33 [<c011e0c9>] do_exit+0x1f4/0x3ec [<c011e3b5>] sys_exit_group+0x0/0x11 [<c010734f>] syscall_call+0x7/0xb Code: f0 09 df 83 c8 01 89 44 24 04 89 7c 24 08 8b 5c 24 04 6a 00 81 eb 01 00 00 40 89 df 53 e8 57 01 00 00 59 31 c0 b9 00 04 00 00 5e <f3> \ ab 53 ff 35 44 f1 35 c0 e8 4b 14 03 00 80 3d 04 37 2f c0 00 <0>Fatal exception: panic in 5 seconds Kernel panic - not syncing: Fatal exception ---- cut ---- A spurious page fault can occur when a page's permissions are expanded (i.e. RO->RW or NX->X). If the TLB contains a stale entry then the processor is allowed to fault on the next access without re-walking the page table. These permission transitions are particularly common under Xen because pages are frequently changing between read-only and read-write for example when a page table page is reused. I think it is theoretically possible to cause a similar issue on native but I don't know offhand how (possibly one of the page debugging CONFIG options or messing with mprotect?). Intel's Nehalem processors seem to expose this issue much more frequently than previous processors. This issue was fixed in the upstream Xen kernel by http://xenbits.xensource.com/xen-unstable.hg?rev/533bad7c0883 and in Linus upstream by http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5b727a3b0158a129827c21ce3bfb0ba997e8ddd0 Version-Release number of selected component (if applicable): 2.6.9-78.EL How reproducible: On Nehalem hardware the kernel runs the RHEL 4.7 installer for only a few minutes before crashing with the above Oops.
Created attachment 320313 [details] xen-unstable.hg 10425:533bad7c0883 ported to 2.6.9-78.EL attaching backport of the upstream patch to 2.6.9-78.EL. Tested on 32 bit but only compile tested on 64 bit.
Ian, Yep, we ran into exactly the same problem here. I have a patch in 465914 that is very similar to yours; I'm going to close this out as a dup. Chris Lalancette *** This bug has been marked as a duplicate of bug 465914 ***