466932 – The 2.6.9 kernel does not handle spurious page faults.

Bug 466932 - The 2.6.9 kernel does not handle spurious page faults.

Summary: The 2.6.9 kernel does not handle spurious page faults.

Keywords:
Status:	CLOSED DUPLICATE of bug 465914
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel-xen
Sub Component:
Version:	4.7
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Xen Maintainance List
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-10-14 15:43 UTC by Ian Campbell
Modified:	2008-10-14 16:01 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-10-14 16:01:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
xen-unstable.hg 10425:533bad7c0883 ported to 2.6.9-78.EL (3.97 KB, patch) 2008-10-14 15:47 UTC, Ian Campbell	no flags	Details \| Diff
View All

Description Ian Campbell 2008-10-14 15:43:54 UTC

Description of problem:

The 2.6.9 kernel does not handle spurious page faults. This can result in a kernel oops such as:
---- cut ----
Oops: 0003 [#1]
SMP 
Modules linked in: dm_snapshot dm_mirror dm_zero dm_mod ext3 jbd msdos raid6 raid5 xor raid1 raid0 xenblk xennet \
sr_mod sd_mod scsi_mod cdrom loop nfs nfs_acl lockd sunrpc vfat fat cramfs
CPU:    0
EIP:    0061:[<c01122f7>]    Not tainted VLI
EFLAGS: 00010246   (2.6.9-67.ELxenU) 
EIP is at pgd_free+0x146/0x183
eax: 00000000   ebx: dd1f9000   ecx: 00000400   edx: 80000004
esi: 00000000   edi: dd1f9000   ebp: 00000003   esp: c6710f64
ds: 007b   es: 007b   ss: 0068
Process hardlink (pid: 3912, threadinfo=c6710000 task=ebae21b0)
Stack: c014ece5 1d1f9001 00000000 ec6ae840 ec6ae840 ebae21b0 00000000 c011a26e 
       dd18d000 ebae2700 c011e0c9 ec6ae840 00000001 ec5fe140 00000000 c6710000 
       c6710000 c011e3b5 00000000 00000000 00000000 401426dc c6710000 c010734f 
Call Trace:
 [<c014ece5>] exit_mmap+0x151/0x15b
 [<c011a26e>] __mmdrop+0x1a/0x33
 [<c011e0c9>] do_exit+0x1f4/0x3ec
 [<c011e3b5>] sys_exit_group+0x0/0x11
 [<c010734f>] syscall_call+0x7/0xb
Code: f0 09 df 83 c8 01 89 44 24 04 89 7c 24 08 8b 5c 24 04 6a 00 81 eb 01 00 00 40 89 df 53 e8 57 01 00 00 59 31 c0 b9 00 04 00 00 5e <f3> \
ab 53 ff 35 44 f1 35 c0 e8 4b 14 03 00 80 3d 04 37 2f c0 00 
 <0>Fatal exception: panic in 5 seconds

Kernel panic - not syncing: Fatal exception

---- cut ----

A spurious page fault can occur when a page's permissions are expanded (i.e. RO->RW or NX->X). If the TLB contains a stale entry then the processor is allowed to fault on the next access without re-walking the page table.

These permission transitions are particularly common under Xen because pages are frequently changing between read-only and read-write for example when a page table page is reused. I think it is theoretically possible to cause a similar issue on native but I don't know offhand how (possibly one of the page debugging CONFIG options or messing with mprotect?).

Intel's Nehalem processors seem to expose this issue much more frequently than previous processors.

This issue was fixed in the upstream Xen kernel by 
http://xenbits.xensource.com/xen-unstable.hg?rev/533bad7c0883 and in Linus upstream by
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5b727a3b0158a129827c21ce3bfb0ba997e8ddd0

Version-Release number of selected component (if applicable):

2.6.9-78.EL

How reproducible:

On Nehalem hardware the kernel runs the RHEL 4.7 installer for only a few minutes before crashing with the above Oops.

Comment 1 Ian Campbell 2008-10-14 15:47:28 UTC

Created attachment 320313 [details]
xen-unstable.hg 10425:533bad7c0883 ported to 2.6.9-78.EL

attaching backport of the upstream patch to 2.6.9-78.EL. Tested on 32 bit but only compile tested on 64 bit.

Comment 2 Chris Lalancette 2008-10-14 16:01:20 UTC

Ian,
    Yep, we ran into exactly the same problem here.  I have a patch in 465914 that is very similar to yours; I'm going to close this out as a dup.

Chris Lalancette

*** This bug has been marked as a duplicate of bug 465914 ***

Note You need to log in before you can comment on or make changes to this bug.