140070 – (IT_50129) Kernel panic while using big pages

Bug 140070 (IT_50129) - Kernel panic while using big pages

Summary: Kernel panic while using big pages

Keywords:
Status:	CLOSED WONTFIX
Alias:	IT_50129
Product:	Red Hat Enterprise Linux 2.1
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	2.1
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Larry Woodman
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-11-19 16:27 UTC by Peter Martuccelli
Modified:	2007-11-30 22:06 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-10-19 19:21:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Peter Martuccelli 2004-11-19 16:27:29 UTC

Under heavy load, e49 kernel paniced in big page code.
Attached is a partial oops (just screen scrapings, and
part of this scrolled off console- nothing got logged).
System is a Compaq ML530G2 with 2 2.4GHz Xeons and
8GB of memory, with bigpages set at 2600MB at boot.
----------
Action by: phansen
Partial oops

Status set to: Waiting on Tech
File uploaded: e49oops-bigpage0.txt

----------
Action by: phansen
We had a couple of panics on Compaq ML530G2s (one on each of two
machines) with
2 2.4GHz Xeons with 9GB of memory.

Had to copy these traces by hand from the screen, so please excuse any
errors.

File uploaded: victor-romeo.doc

----------
Action by: phansen
I've set up netdump to one of the machines involved in the crashes,
but the other
has only SysKonnect NICs in it and netconsole does not appear to start
on these,
which I seem to recall from the distant past.

I'm using netdump 0.6.11-2, which is the latest release for LAS2.1
(hope it supports
cores > 4GB).

Aside from that, both machines have been running normally since last
Monday.  They
know we're watching.


----------
Action by: phansen
Had another panic on romeo today; started from the usual point in
include/asm/pgalloc.h:31-
page idx 1502139 cannot be PSE page!- nothing new here.  I can send
the traceback from the
screen if it can help, but I got talking to the QA guy who was using
the machine when it paniced.
He showed me the logs from his test, and about a minute or so before
the panic he got an out
of memory error from his query.  In Netezza parlance, this means that
we've exhausted the
2600MB shared area.  This got me to thinking- could this problem be
caused by an overrun
from shared memory into the regular page area?  We tried to do this
with a user task, but
got a SIGSEGV back, but could the kernel write beyond shared memory
and cause the page
alloc to blow up when the page was touched.  Just a thought- let me
know if you want the
traceback.


----------
Action by: bfox
Yes, the traceback would be good to have.  Just to give you an update
of where we are currently, our test script has been running for about
a week and a half and we still haven't been able to reproduce the
problem.  We are allocating some more engineers for this ticket to try
and get a handle on what's going on.

I'm particularly interested in the out of memory error.  I have seen a
few instances of other RHEL 3 machines triggering the OOM killer even
when there is plenty of swap left.  I don't know that that's what's
happening here, but it's worth looking into.


----------
Action by: phansen
Here's the oops; sorry for the delay.  It looks a something like the
9/26 victor panic
ending in call_spurious_interrupt.

File uploaded: romeo1116.doc

Comment 1 Peter Martuccelli 2004-11-19 17:46:00 UTC

Netezza is currently an all AS2.1 shop with no current plans to
migrate to RHEL3.  It was my mistake to refer to RHEL3 in my post on
11-17.  

So when can we expect an AS 2.1 debugkernel?

Larry - an update from the IT entry.  RHEL3 is not involved, straight
2.1.

Comment 4 Larry Woodman 2004-12-03 20:22:12 UTC

I think I found the cause of this panic, a sys_munlockall will
inadvertantly unlock bigpage vma's and that will lead to this type of
corruption.

This patch will fix this problem:
-------------------------------------------------------------------
--- linux/mm/mlock.c.orig
+++ linux/mm/mlock.c
@@ -256,7 +256,7 @@ static int do_mlockall(int flags)
                unsigned int newflags;
                                                                     
                                  
                newflags = vma->vm_flags | VM_LOCKED;
-               if (!(flags & MCL_CURRENT))
+               if (!(flags & MCL_CURRENT) && !(vma->vm_flags &
VM_BIGPAGE))
                        newflags &= ~VM_LOCKED;
                error = mlock_fixup(vma, vma->vm_start, vma->vm_end,
newflags);
                if (error)
--------------------------------------------------------------------

I'll build a test kernel with this patch and make it available for
test purposes.

Larry Woodman

Comment 5 Larry Woodman 2004-12-03 21:28:54 UTC

The kernel with the above patch included is available at this location:

>>>http://people.redhat.com/~lwoodman/AS2.1/

Please test it and let us know if it fixes the BUG() ASAP!


Larry Woodman

Comment 6 Paul Hansen 2004-12-03 21:57:50 UTC

Thanks- am downloading it now and will try and get it up 
on one of the test machines over the weekend; Monday at
the latest.

Comment 7 Brent Fox 2004-12-09 17:14:39 UTC

Paul, any luck with Larry's test kernel?

Comment 8 Paul Hansen 2004-12-09 18:21:44 UTC

So far, so good.  Unfortunately, we're only running it on a DL585
based machine used by development due to problems in QA cycles, and
their reluctance to accept a new kernel late in their test cycles.
Also, the machines romeo and victor, where this panic typically 
occurred are in transit to our new facility and have yet to be 
brought up.  As soon as they are available, I'll put 57 on them.

Comment 9 Brent Fox 2005-01-12 21:04:35 UTC

Paul, have you had a chance to put the 2.4.9-e.57 kernel on victor or
romeo yet?  That kernel has now been officially released in the RHEL
2.1 Update 6 release.

Comment 10 Paul Hansen 2005-01-12 22:15:32 UTC

No, e57 hasn't been loaded on romeo/victor as yet due to QA test 
cycles.  We have been running it on a pair of DL585s since we
received the kernel, and all has been well.  Am hoping to get it
on to romeo/victor within the next week.

Comment 12 Paul Hansen 2005-02-16 18:29:04 UTC

Bad news- we just had a bigpage panic on the machine victor which is 
almost identical to the romeo panic on 9/27/04 (in victor-romeo.doc) and
the romeo panic on 10/9/04 (romeo2.doc), with shmem_nopage and shmem_getbigpage
at the top of the stack.  This was using the e57-bigpagefix kernel.

Comment 13 RHEL Program Management 2007-10-19 19:21:17 UTC

This bug is filed against RHEL2.1, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products.  Since
this bug does not meet that criteria, it is now being closed.

For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/

If you feel this bug is indeed mission critical, please contact your
support representative.  You may be asked to provide detailed
information on how this bug is affecting you.

Note You need to log in before you can comment on or make changes to this bug.