Bug 429205 - RHEL5-U2 panics when using hugepages
RHEL5-U2 panics when using hugepages
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.2
All Linux
high Severity high
: rc
: ---
Assigned To: Larry Woodman
Martin Jenner
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-01-17 17:07 EST by Larry Woodman
Modified: 2008-05-21 11:07 EDT (History)
3 users (show)

See Also:
Fixed In Version: RHBA-2008-0314
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-05-21 11:07:00 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch that fixes panic (406 bytes, application/octet-stream)
2008-01-17 17:07 EST, Larry Woodman
no flags Details

  None (edit)
Description Larry Woodman 2008-01-17 17:07:07 EST
Description of problem:

Panic in __alloc_pages() when using hugepages.

Version-Release number of selected component (if applicable):

kernel-2.6.18-63.el5

How reproducible:

All the time


Steps to Reproduce:
1. echo 1 > /proc/sys/vm/nr_hugepages
2.
3.
  
Actual results:

Panic

Expected results:

Reserve hugepages.

Additional info:

This problem was caused by the introduction of  alloc_pages_thisnode() in
include/linux/gfp.h in kernel-2.6.18-63.el5 with 
linux-2.6-ppc64-unequal-allocation-of-hugepages.patch
and linux-2.6-mm-fix-hugepage-allocation-with-memoryless-nodes.patch.
Either a "echo <any number> > /proc/sys/vm/nr_hugepages" or "vm.nr_hugepages =
<any number>"
in /etc/sysctl.conf will panic the system.

Durring the evolution of alloc_pages_thisnode() we went from copying the entire
2056 byte zonelist to a private zonelist on the kernel stack(GULP!!!) to copying
only
what we need before passing it to __alloc_pages().  Since the private copy of
the zonelist
is not initialized on the kernel stack, the system panics in __alloc_pages if
the 0th zonelist
entry is junk(thats the only explanation of not panicing .01% of the time).

The fix is to include the attached patch which starts the zonelist copying at
the 0th entry
rather than the 1st entry so it can never be junk.

Having said all that, none of this code seems to be upstream.  If you google search
for "alloc_pages_thisnode" it doesnt find anything, the only references I can find
are in rhkernel-list.  So, if we want to keep this code we need the attached
patch and if
we want to remove it, eliminating both
linux-2.6-ppc64-unequal-allocation-of-hugepages.patch
and linux-2.6-mm-fix-hugepage-allocation-with-memoryless-nodes.patch does the trick.
Comment 1 Larry Woodman 2008-01-17 17:07:07 EST
Created attachment 292077 [details]
patch that fixes panic
Comment 6 Larry Woodman 2008-01-21 13:45:53 EST
The kernel in: barstool.build:/mnt/brew/scratch/lwoodman/task_1117775 fixes this
issue, can Mike give it a try?

Larry

BTW, whats "/kernel/vm/hugepage/173617" ???

Comment 7 Mike Gahagan 2008-01-21 14:25:31 EST
I can verify that 2.6.18-69.test2.el5 holds up just fine to setting
vm.nr_hugepages. 

/kernel/vm/hugepage/173617 is a regression test I wrote and added to RHTS a year
or so ago (see bz 173617 for more background). Essentially it runs 2 copies of
the script snippet below for 2 minutes while it compares the values of
HugePages_Total and HugePages_Free and fails if HugePages_Free exceeds
HugePages_Total.


	while [ -x /bin/true ]
		do 
		echo 10 > /proc/sys/vm/nr_hugepages
		echo 0 > /proc/sys/vm/nr_hugepages
		echo 2 > /proc/sys/vm/nr_hugepages
		echo 0 > /proc/sys/vm/nr_hugepages
		echo 5 > /proc/sys/vm/nr_hugepages
		echo 5 > /proc/sys/vm/nr_hugepages
	done 
Comment 8 Mike Gahagan 2008-01-22 12:32:04 EST
looks like the -72 kernel has fixed this issue.
Comment 9 Don Zickus 2008-01-22 13:52:29 EST
in 2.6.18-72.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 11 Mike Gahagan 2008-01-25 15:40:02 EST
confirmed fixed with -72 and -75
Comment 12 Don Domingo 2008-02-05 21:40:14 EST
added to RHEL5.2 release notes under "Kernel-Related Updates":

<quote>
The kernel no longer panics when using hugepages (i.e. echo 1 >
/proc/sys/vm/nr_hugepages).
</quote>

please advise if any further revisions are required. thanks!
Comment 13 Mike Gahagan 2008-02-06 10:42:05 EST
Should we be saying that in the release notes? As far as I know, the bug was
introduced -63 kernel which was never released to anyone (other than possibly as
an unsupported test kernel). From the testing I've done, it doesn't look like
the 5.1 kernel ever had this bug.
Comment 14 Don Zickus 2008-02-06 12:58:22 EST
I agree with Mike, I think this bug was introduced in pre-beta and fixed shortly
after.  We probably don't need release notes on this.  Larry, you know best,
your opinion?
Comment 15 Larry Woodman 2008-02-06 13:23:09 EST
The change that caused this panic was never released in an official RHEL5
kernel.  It was introduced in .63 and I fixed it in .69, that should not require
release noting.

Larry
Comment 16 Don Domingo 2008-02-06 18:00:18 EST
thanks for the heads-up, guys. removing this release note and all related flags.
Comment 18 errata-xmlrpc 2008-05-21 11:07:00 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html

Note You need to log in before you can comment on or make changes to this bug.