Description of problem:
The 2.6.9 kernel does not handle spurious page faults. This can result in a kernel oops such as:
---- cut ----
Oops: 0003 [#1]
Modules linked in: dm_snapshot dm_mirror dm_zero dm_mod ext3 jbd msdos raid6 raid5 xor raid1 raid0 xenblk xennet \
sr_mod sd_mod scsi_mod cdrom loop nfs nfs_acl lockd sunrpc vfat fat cramfs
EIP: 0061:[<c01122f7>] Not tainted VLI
EFLAGS: 00010246 (2.6.9-67.ELxenU)
EIP is at pgd_free+0x146/0x183
eax: 00000000 ebx: dd1f9000 ecx: 00000400 edx: 80000004
esi: 00000000 edi: dd1f9000 ebp: 00000003 esp: c6710f64
ds: 007b es: 007b ss: 0068
Process hardlink (pid: 3912, threadinfo=c6710000 task=ebae21b0)
Stack: c014ece5 1d1f9001 00000000 ec6ae840 ec6ae840 ebae21b0 00000000 c011a26e
dd18d000 ebae2700 c011e0c9 ec6ae840 00000001 ec5fe140 00000000 c6710000
c6710000 c011e3b5 00000000 00000000 00000000 401426dc c6710000 c010734f
Code: f0 09 df 83 c8 01 89 44 24 04 89 7c 24 08 8b 5c 24 04 6a 00 81 eb 01 00 00 40 89 df 53 e8 57 01 00 00 59 31 c0 b9 00 04 00 00 5e <f3> \
ab 53 ff 35 44 f1 35 c0 e8 4b 14 03 00 80 3d 04 37 2f c0 00
<0>Fatal exception: panic in 5 seconds
Kernel panic - not syncing: Fatal exception
---- cut ----
A spurious page fault can occur when a page's permissions are expanded (i.e. RO->RW or NX->X). If the TLB contains a stale entry then the processor is allowed to fault on the next access without re-walking the page table.
These permission transitions are particularly common under Xen because pages are frequently changing between read-only and read-write for example when a page table page is reused. I think it is theoretically possible to cause a similar issue on native but I don't know offhand how (possibly one of the page debugging CONFIG options or messing with mprotect?).
Intel's Nehalem processors seem to expose this issue much more frequently than previous processors.
This issue was fixed in the upstream Xen kernel by
http://xenbits.xensource.com/xen-unstable.hg?rev/533bad7c0883 and in Linus upstream by
Version-Release number of selected component (if applicable):
On Nehalem hardware the kernel runs the RHEL 4.7 installer for only a few minutes before crashing with the above Oops.
Created attachment 320313 [details]
xen-unstable.hg 10425:533bad7c0883 ported to 2.6.9-78.EL
attaching backport of the upstream patch to 2.6.9-78.EL. Tested on 32 bit but only compile tested on 64 bit.
Yep, we ran into exactly the same problem here. I have a patch in 465914 that is very similar to yours; I'm going to close this out as a dup.
*** This bug has been marked as a duplicate of bug 465914 ***