Description of problem:
Under medium to heavy load, system locks up. Console provided a kernel
BUG at page_alloc.c:242
Version-Release number of selected component (if applicable):
Linux version 2.4.21-20.0.1.EL
System is running postfix + MailScanner + Spamassassin. When
Spamassassin is loading is when the error occurs, but only under an
average to heavy load.
Steps to Reproduce:
1. Swith the mta to postfix, configure for MailScanner integration
2. Install MailScanner - noarch rpm provided by them -- perl based
3. Install Spamasassin
4. push a bunch of email through the system
The hidden patch(to solve lvs direct routing arp problem) was applied
to the redhat kernel source and a custom kernel was compiled using
configs provided in the redhat kernel source for ppc64 bit arch.
Created attachment 108658 [details]
kernel BUG() message captured at console
Jonathan, Red Hat does not support custom-built kernels. If you can
reproduce this crash with a stock RHEL kernel, please post the full
console oops output. Otherwise, please set this to CLOSED/NOTABUG.
Thanks in advance. -ernie
rebuilt the system from scratch using a stock kernel.
Created attachment 113648 [details]
The reason this has grown a bit stale, is because I was attempting to get IBM to
deal with this issue, but they are pointing the finger at the tg3 driver.
They claim that there is a Red Hat Issue Tracker 64633.
I have been looking all over for this issue tracker and have had no success.
I've looked at issue 64633 and I don't immediately see its relevance, except
that it's updating the TG3 driver which might be the cause. The only reason I
can see that it might be the TG3 driver is that it's involved in the second
panic. However, given the initial BUG report and the subsequent first panic
whilst the kernel appears to be trying to recover from the BUG, I wouldn't
trust the second panic very far as being the cause of the problem.
The initial BUG is incurred whilst a page is being freed. The kernel checks
that the page has been correctly deinitialised before actually returning it to
the "free list", but in this case found that the page was still involved in an
RMAP chain somewhere.
My guess would be that something mucked up a page structure or several of
them, possibly by getting the allocation functions mixed up and using the page
struct pointer as the pointer to the actual page, though I'd've expected
something like that to come to light a lot earlier.
Are you able to say who at IBM suggested it might be the TG3 driver?
I am pretty desperately searching for a solution to a very similar situation on
our Oracle server running kernel 2.4.21-40.ELsmp. This on an HP Proliant DL380,
8GB RAM, 2 x Xeon 3.2GHz processors.
Have replaced memory, finally moved disks to new server box. Same problem.
Traceback on crash:
Apr 12 13:14:50 db01-01 kernel: Page has mapping still set. This is a serious si
tuation. However if you
Apr 12 13:14:50 db01-01 kernel: are using the NVidia binary only module please r
eport this bug to
Apr 12 13:14:50 db01-01 kernel: NVidia and not to the linux kernel mailinglist.
Apr 12 13:14:50 db01-01 kernel: ------------[ cut here ]------------
Apr 12 13:14:50 db01-01 kernel: kernel BUG at page_alloc.c:225!
Apr 12 13:14:50 db01-01 kernel: invalid operand: 0000
Apr 12 13:14:50 db01-01 kernel: sg nfs lockd sunrpc tg3 microcode keybdev moused
ev hid input ehci-hcd usb-uhci usbcore ext3 jbd cciss sd_mod scsi_mod
How often does this occur?
If I give you a test kernel that can print the whereabouts of the address
space operations table would you be willing to run it? That might at least
pinpoint the module that owned the bad page.
Also, wasn't there a stack trace attached to the BUG() report?
Created attachment 127679 [details]
stack trace of crash
Is occurring between 2-3 times/day to every couple of days.
Creating an attachment for the stack trace I was able to save.
I've added extra code to print extra information about a bad page that's being
freed to the kernel at:
If you would be willing to try running that, it should produce a crash dump
with more information about the page that was being freed incorrectly. This
information should appear in the kernel console log, just before the BUG()
Another tracetack, all I can get off the console:
EIP: 0060:[<c0159560>] Not tainted
EIP is at __free_pages_ok [kernel] 0x3e0 (2.4.21-40.ELsmp/i686)
eax: 00000033 ebx: c56bf9e0 ecx: 00000001 edx: c0387e98
esi: f62d0a80 edi: 00000000 ebp: 00000000 esp: cd7d5ec8
ds: 0068 es: 0068 ss: 0068
Process keventd (pid: 6, stackpage=cd7d5000)
Stack: c02c1ea8 00000363 c000a308 ff061000 cd7d5ee4 f5ce9180 00000008 cd7d5ee4
cd7d5ee4 00000000 00000001 cd7d5f10 f5ce9180 00000001 f62d0a80 00000000
00000000 c014cf3e cd7d5f10 cd7d5f10 00000000 cd7d4000 00000000 00000e00
Call Trace: [<c014cf3e>] __iodesc_free [kernel] 0xde (0xcd7d5f0c)
[<c0161e9c>] kmap_high [kernel] 0x5c (0xcd7d5f28)
[<c014d87b>] __iodesc_read_finish [kernel] 0x22b (0xcd7d5f38)
[<c01302ca>] __run_task_queue [kernel] 0x6a (0xcd7d5f74)
[<c013c9ad>] context_thread [kernel] 0x13d (0xcd7d5f8c)
[<c013c870>] context_thread [kernel] 0x0 (0xcd7d5fe0)
[<c01095cd>] kernel_thread_helper [kernel] 0x5 (0xcd7d5ff0)
Code: 0f 0b e1 00 33 17 2c c0 e9 6c fc ff ff 9c 5a fa f0 fe 0d 70
Kernel panic: Fatal exception
Created attachment 127711 [details]
Debugging patch added to test kernel
we are having a similar problem here, HP DL380G4, redhat as kernel
2 x intel xeon 3.6 , 6GB RAM, Oracle Database server
last message in /var/log/messages is
Aug 29 12:43:41 oracle4 kernel: Page has mapping still set. This is a serious
situation. However if you
Aug 29 12:43:41 oracle4 kernel: are using the NVidia binary only module please
report this bug to
Aug 29 12:43:41 oracle4 kernel: NVidia and not to the linux kernel mailinglist.
Aug 29 12:43:41 oracle4 kernel: ------------[ cut here ]------------
there's a kernel panic on the screen at this point, but we're not setup to
capture this information right now - was there any more news on this at all?
This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
For more information of the RHEL errata support policy, please visit:
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.