Under heavy load, e49 kernel paniced in big page code. Attached is a partial oops (just screen scrapings, and part of this scrolled off console- nothing got logged). System is a Compaq ML530G2 with 2 2.4GHz Xeons and 8GB of memory, with bigpages set at 2600MB at boot. ---------- Action by: phansen Partial oops Status set to: Waiting on Tech File uploaded: e49oops-bigpage0.txt ---------- Action by: phansen We had a couple of panics on Compaq ML530G2s (one on each of two machines) with 2 2.4GHz Xeons with 9GB of memory. Had to copy these traces by hand from the screen, so please excuse any errors. File uploaded: victor-romeo.doc ---------- Action by: phansen I've set up netdump to one of the machines involved in the crashes, but the other has only SysKonnect NICs in it and netconsole does not appear to start on these, which I seem to recall from the distant past. I'm using netdump 0.6.11-2, which is the latest release for LAS2.1 (hope it supports cores > 4GB). Aside from that, both machines have been running normally since last Monday. They know we're watching. ---------- Action by: phansen Had another panic on romeo today; started from the usual point in include/asm/pgalloc.h:31- page idx 1502139 cannot be PSE page!- nothing new here. I can send the traceback from the screen if it can help, but I got talking to the QA guy who was using the machine when it paniced. He showed me the logs from his test, and about a minute or so before the panic he got an out of memory error from his query. In Netezza parlance, this means that we've exhausted the 2600MB shared area. This got me to thinking- could this problem be caused by an overrun from shared memory into the regular page area? We tried to do this with a user task, but got a SIGSEGV back, but could the kernel write beyond shared memory and cause the page alloc to blow up when the page was touched. Just a thought- let me know if you want the traceback. ---------- Action by: bfox Yes, the traceback would be good to have. Just to give you an update of where we are currently, our test script has been running for about a week and a half and we still haven't been able to reproduce the problem. We are allocating some more engineers for this ticket to try and get a handle on what's going on. I'm particularly interested in the out of memory error. I have seen a few instances of other RHEL 3 machines triggering the OOM killer even when there is plenty of swap left. I don't know that that's what's happening here, but it's worth looking into. ---------- Action by: phansen Here's the oops; sorry for the delay. It looks a something like the 9/26 victor panic ending in call_spurious_interrupt. File uploaded: romeo1116.doc
Netezza is currently an all AS2.1 shop with no current plans to migrate to RHEL3. It was my mistake to refer to RHEL3 in my post on 11-17. So when can we expect an AS 2.1 debugkernel? Larry - an update from the IT entry. RHEL3 is not involved, straight 2.1.
I think I found the cause of this panic, a sys_munlockall will inadvertantly unlock bigpage vma's and that will lead to this type of corruption. This patch will fix this problem: ------------------------------------------------------------------- --- linux/mm/mlock.c.orig +++ linux/mm/mlock.c @@ -256,7 +256,7 @@ static int do_mlockall(int flags) unsigned int newflags; newflags = vma->vm_flags | VM_LOCKED; - if (!(flags & MCL_CURRENT)) + if (!(flags & MCL_CURRENT) && !(vma->vm_flags & VM_BIGPAGE)) newflags &= ~VM_LOCKED; error = mlock_fixup(vma, vma->vm_start, vma->vm_end, newflags); if (error) -------------------------------------------------------------------- I'll build a test kernel with this patch and make it available for test purposes. Larry Woodman
The kernel with the above patch included is available at this location: >>>http://people.redhat.com/~lwoodman/AS2.1/ Please test it and let us know if it fixes the BUG() ASAP! Larry Woodman
Thanks- am downloading it now and will try and get it up on one of the test machines over the weekend; Monday at the latest.
Paul, any luck with Larry's test kernel?
So far, so good. Unfortunately, we're only running it on a DL585 based machine used by development due to problems in QA cycles, and their reluctance to accept a new kernel late in their test cycles. Also, the machines romeo and victor, where this panic typically occurred are in transit to our new facility and have yet to be brought up. As soon as they are available, I'll put 57 on them.
Paul, have you had a chance to put the 2.4.9-e.57 kernel on victor or romeo yet? That kernel has now been officially released in the RHEL 2.1 Update 6 release.
No, e57 hasn't been loaded on romeo/victor as yet due to QA test cycles. We have been running it on a pair of DL585s since we received the kernel, and all has been well. Am hoping to get it on to romeo/victor within the next week.
Bad news- we just had a bigpage panic on the machine victor which is almost identical to the romeo panic on 9/27/04 (in victor-romeo.doc) and the romeo panic on 10/9/04 (romeo2.doc), with shmem_nopage and shmem_getbigpage at the top of the stack. This was using the e57-bigpagefix kernel.
This bug is filed against RHEL2.1, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.