429205 – RHEL5-U2 panics when using hugepages

Bug 429205 - RHEL5-U2 panics when using hugepages

Summary: RHEL5-U2 panics when using hugepages

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Larry Woodman
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-01-17 22:07 UTC by Larry Woodman
Modified:	2008-05-21 15:07 UTC (History)
CC List:	3 users (show)
Fixed In Version:	RHBA-2008-0314
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-05-21 15:07:00 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
patch that fixes panic (406 bytes, application/octet-stream) 2008-01-17 22:07 UTC, Larry Woodman	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2008:0314	0	normal	SHIPPED_LIVE	Updated kernel packages for Red Hat Enterprise Linux 5.2	2008-05-20 18:43:34 UTC

Description Larry Woodman 2008-01-17 22:07:07 UTC

Description of problem:

Panic in __alloc_pages() when using hugepages.

Version-Release number of selected component (if applicable):

kernel-2.6.18-63.el5

How reproducible:

All the time


Steps to Reproduce:
1. echo 1 > /proc/sys/vm/nr_hugepages
2.
3.
  
Actual results:

Panic

Expected results:

Reserve hugepages.

Additional info:

This problem was caused by the introduction of  alloc_pages_thisnode() in
include/linux/gfp.h in kernel-2.6.18-63.el5 with 
linux-2.6-ppc64-unequal-allocation-of-hugepages.patch
and linux-2.6-mm-fix-hugepage-allocation-with-memoryless-nodes.patch.
Either a "echo <any number> > /proc/sys/vm/nr_hugepages" or "vm.nr_hugepages =
<any number>"
in /etc/sysctl.conf will panic the system.

Durring the evolution of alloc_pages_thisnode() we went from copying the entire
2056 byte zonelist to a private zonelist on the kernel stack(GULP!!!) to copying
only
what we need before passing it to __alloc_pages().  Since the private copy of
the zonelist
is not initialized on the kernel stack, the system panics in __alloc_pages if
the 0th zonelist
entry is junk(thats the only explanation of not panicing .01% of the time).

The fix is to include the attached patch which starts the zonelist copying at
the 0th entry
rather than the 1st entry so it can never be junk.

Having said all that, none of this code seems to be upstream.  If you google search
for "alloc_pages_thisnode" it doesnt find anything, the only references I can find
are in rhkernel-list.  So, if we want to keep this code we need the attached
patch and if
we want to remove it, eliminating both
linux-2.6-ppc64-unequal-allocation-of-hugepages.patch
and linux-2.6-mm-fix-hugepage-allocation-with-memoryless-nodes.patch does the trick.

Comment 1 Larry Woodman 2008-01-17 22:07:07 UTC

Created attachment 292077 [details]
patch that fixes panic

Comment 6 Larry Woodman 2008-01-21 18:45:53 UTC

The kernel in: barstool.build:/mnt/brew/scratch/lwoodman/task_1117775 fixes this
issue, can Mike give it a try?

Larry

BTW, whats "/kernel/vm/hugepage/173617" ???

Comment 7 Mike Gahagan 2008-01-21 19:25:31 UTC

I can verify that 2.6.18-69.test2.el5 holds up just fine to setting
vm.nr_hugepages. 

/kernel/vm/hugepage/173617 is a regression test I wrote and added to RHTS a year
or so ago (see bz 173617 for more background). Essentially it runs 2 copies of
the script snippet below for 2 minutes while it compares the values of
HugePages_Total and HugePages_Free and fails if HugePages_Free exceeds
HugePages_Total.


	while [ -x /bin/true ]
		do 
		echo 10 > /proc/sys/vm/nr_hugepages
		echo 0 > /proc/sys/vm/nr_hugepages
		echo 2 > /proc/sys/vm/nr_hugepages
		echo 0 > /proc/sys/vm/nr_hugepages
		echo 5 > /proc/sys/vm/nr_hugepages
		echo 5 > /proc/sys/vm/nr_hugepages
	done

Comment 8 Mike Gahagan 2008-01-22 17:32:04 UTC

looks like the -72 kernel has fixed this issue.

Comment 9 Don Zickus 2008-01-22 18:52:29 UTC

in 2.6.18-72.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 11 Mike Gahagan 2008-01-25 20:40:02 UTC

confirmed fixed with -72 and -75

Comment 12 Don Domingo 2008-02-06 02:40:14 UTC

added to RHEL5.2 release notes under "Kernel-Related Updates":

<quote>
The kernel no longer panics when using hugepages (i.e. echo 1 >
/proc/sys/vm/nr_hugepages).
</quote>

please advise if any further revisions are required. thanks!

Comment 13 Mike Gahagan 2008-02-06 15:42:05 UTC

Should we be saying that in the release notes? As far as I know, the bug was
introduced -63 kernel which was never released to anyone (other than possibly as
an unsupported test kernel). From the testing I've done, it doesn't look like
the 5.1 kernel ever had this bug.

Comment 14 Don Zickus 2008-02-06 17:58:22 UTC

I agree with Mike, I think this bug was introduced in pre-beta and fixed shortly
after.  We probably don't need release notes on this.  Larry, you know best,
your opinion?

Comment 15 Larry Woodman 2008-02-06 18:23:09 UTC

The change that caused this panic was never released in an official RHEL5
kernel.  It was introduced in .63 and I fixed it in .69, that should not require
release noting.

Larry

Comment 16 Don Domingo 2008-02-06 23:00:18 UTC

thanks for the heads-up, guys. removing this release note and all related flags.

Comment 18 errata-xmlrpc 2008-05-21 15:07:00 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html

Note You need to log in before you can comment on or make changes to this bug.