Description of problem: Panic in __alloc_pages() when using hugepages. Version-Release number of selected component (if applicable): kernel-2.6.18-63.el5 How reproducible: All the time Steps to Reproduce: 1. echo 1 > /proc/sys/vm/nr_hugepages 2. 3. Actual results: Panic Expected results: Reserve hugepages. Additional info: This problem was caused by the introduction of alloc_pages_thisnode() in include/linux/gfp.h in kernel-2.6.18-63.el5 with linux-2.6-ppc64-unequal-allocation-of-hugepages.patch and linux-2.6-mm-fix-hugepage-allocation-with-memoryless-nodes.patch. Either a "echo <any number> > /proc/sys/vm/nr_hugepages" or "vm.nr_hugepages = <any number>" in /etc/sysctl.conf will panic the system. Durring the evolution of alloc_pages_thisnode() we went from copying the entire 2056 byte zonelist to a private zonelist on the kernel stack(GULP!!!) to copying only what we need before passing it to __alloc_pages(). Since the private copy of the zonelist is not initialized on the kernel stack, the system panics in __alloc_pages if the 0th zonelist entry is junk(thats the only explanation of not panicing .01% of the time). The fix is to include the attached patch which starts the zonelist copying at the 0th entry rather than the 1st entry so it can never be junk. Having said all that, none of this code seems to be upstream. If you google search for "alloc_pages_thisnode" it doesnt find anything, the only references I can find are in rhkernel-list. So, if we want to keep this code we need the attached patch and if we want to remove it, eliminating both linux-2.6-ppc64-unequal-allocation-of-hugepages.patch and linux-2.6-mm-fix-hugepage-allocation-with-memoryless-nodes.patch does the trick.
Created attachment 292077 [details] patch that fixes panic
The kernel in: barstool.build:/mnt/brew/scratch/lwoodman/task_1117775 fixes this issue, can Mike give it a try? Larry BTW, whats "/kernel/vm/hugepage/173617" ???
I can verify that 2.6.18-69.test2.el5 holds up just fine to setting vm.nr_hugepages. /kernel/vm/hugepage/173617 is a regression test I wrote and added to RHTS a year or so ago (see bz 173617 for more background). Essentially it runs 2 copies of the script snippet below for 2 minutes while it compares the values of HugePages_Total and HugePages_Free and fails if HugePages_Free exceeds HugePages_Total. while [ -x /bin/true ] do echo 10 > /proc/sys/vm/nr_hugepages echo 0 > /proc/sys/vm/nr_hugepages echo 2 > /proc/sys/vm/nr_hugepages echo 0 > /proc/sys/vm/nr_hugepages echo 5 > /proc/sys/vm/nr_hugepages echo 5 > /proc/sys/vm/nr_hugepages done
looks like the -72 kernel has fixed this issue.
in 2.6.18-72.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
confirmed fixed with -72 and -75
added to RHEL5.2 release notes under "Kernel-Related Updates": <quote> The kernel no longer panics when using hugepages (i.e. echo 1 > /proc/sys/vm/nr_hugepages). </quote> please advise if any further revisions are required. thanks!
Should we be saying that in the release notes? As far as I know, the bug was introduced -63 kernel which was never released to anyone (other than possibly as an unsupported test kernel). From the testing I've done, it doesn't look like the 5.1 kernel ever had this bug.
I agree with Mike, I think this bug was introduced in pre-beta and fixed shortly after. We probably don't need release notes on this. Larry, you know best, your opinion?
The change that caused this panic was never released in an official RHEL5 kernel. It was introduced in .63 and I fixed it in .69, that should not require release noting. Larry
thanks for the heads-up, guys. removing this release note and all related flags.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html