Bug 143042

Summary: kernel BUG at page_alloc.c:242
Product: Red Hat Enterprise Linux 3 Reporter: jonathan higgins <jhiggins>
Component: kernelAssignee: David Howells <dhowells>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: asparks, ox23fgu02, petrides, riel
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-10-19 19:11:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
kernel BUG() message captured at console
none
kernel oops
none
stack trace of crash
none
Debugging patch added to test kernel none

Description jonathan higgins 2004-12-15 22:04:13 UTC
Description of problem:
Under medium to heavy load, system locks up. Console provided a kernel
BUG at page_alloc.c:242

Version-Release number of selected component (if applicable):
Linux version 2.4.21-20.0.1.EL

How reproducible:
System is running postfix + MailScanner + Spamassassin.  When
Spamassassin is loading is when the error occurs, but only under an
average to heavy load.

Steps to Reproduce:
1. Swith the mta to postfix, configure for MailScanner integration
2. Install MailScanner - noarch rpm provided by them -- perl based
application.
3. Install Spamasassin
4. push a bunch of email through the system
  
Actual results:


Expected results:


Additional info:

The hidden patch(to solve lvs direct routing arp problem) was applied
to the redhat kernel source and a custom kernel was compiled using
configs provided in the redhat kernel source for ppc64 bit arch.

Comment 1 jonathan higgins 2004-12-15 22:06:14 UTC
Created attachment 108658 [details]
kernel BUG() message captured at console

Comment 2 Ernie Petrides 2004-12-15 23:39:28 UTC
Jonathan, Red Hat does not support custom-built kernels.  If you can
reproduce this crash with a stock RHEL kernel, please post the full
console oops output.  Otherwise, please set this to CLOSED/NOTABUG.

Thanks in advance.  -ernie


Comment 3 jonathan higgins 2005-04-25 19:51:17 UTC
rebuilt the system from scratch using a stock kernel.

Comment 4 jonathan higgins 2005-04-25 19:53:37 UTC
Created attachment 113648 [details]
kernel oops

Comment 5 jonathan higgins 2005-04-27 14:57:59 UTC
The reason this has grown a bit stale, is because I was attempting to get IBM to
deal with this issue, but they are pointing the finger at the tg3 driver.  

They claim that there is a Red Hat Issue Tracker 64633.

I have been looking all over for this issue tracker and have had no success.
 

Comment 6 David Howells 2005-05-17 09:49:39 UTC
I've looked at issue 64633 and I don't immediately see its relevance, except 
that it's updating the TG3 driver which might be the cause. The only reason I 
can see that it might be the TG3 driver is that it's involved in the second 
panic. However, given the initial BUG report and the subsequent first panic 
whilst the kernel appears to be trying to recover from the BUG, I wouldn't 
trust the second panic very far as being the cause of the problem. 
 
The initial BUG is incurred whilst a page is being freed. The kernel checks 
that the page has been correctly deinitialised before actually returning it to 
the "free list", but in this case found that the page was still involved in an 
RMAP chain somewhere. 
 
My guess would be that something mucked up a page structure or several of 
them, possibly by getting the allocation functions mixed up and using the page 
struct pointer as the pointer to the actual page, though I'd've expected 
something like that to come to light a lot earlier. 
 
Are you able to say who at IBM suggested it might be the TG3 driver? 

Comment 7 Alan Sparks 2006-04-12 20:20:42 UTC
I am pretty desperately searching for a solution to a very similar situation on
our Oracle server running kernel 2.4.21-40.ELsmp.  This on an HP Proliant DL380,
8GB RAM, 2 x Xeon 3.2GHz processors.

Have replaced memory, finally moved disks to new server box.  Same problem.

Traceback on crash:
Apr 12 13:14:50 db01-01 kernel: Page has mapping still set. This is a serious si
tuation. However if you
Apr 12 13:14:50 db01-01 kernel: are using the NVidia binary only module please r
eport this bug to
Apr 12 13:14:50 db01-01 kernel: NVidia and not to the linux kernel mailinglist.
Apr 12 13:14:50 db01-01 kernel: ------------[ cut here ]------------
Apr 12 13:14:50 db01-01 kernel: kernel BUG at page_alloc.c:225!
Apr 12 13:14:50 db01-01 kernel: invalid operand: 0000
Apr 12 13:14:50 db01-01 kernel: sg nfs lockd sunrpc tg3 microcode keybdev moused
ev hid input ehci-hcd usb-uhci usbcore ext3 jbd cciss sd_mod scsi_mod

Comment 8 David Howells 2006-04-12 20:45:59 UTC
How often does this occur?  
  
If I give you a test kernel that can print the whereabouts of the address  
space operations table would you be willing to run it? That might at least  
pinpoint the module that owned the bad page.  

Comment 9 David Howells 2006-04-12 20:50:37 UTC
Also, wasn't there a stack trace attached to the BUG() report? 

Comment 10 Alan Sparks 2006-04-12 23:57:36 UTC
Created attachment 127679 [details]
stack trace of crash

Comment 11 Alan Sparks 2006-04-13 00:00:17 UTC
Is occurring between 2-3 times/day to every couple of days.
Creating an attachment for the stack trace I was able to save.

Comment 12 David Howells 2006-04-13 14:06:49 UTC
I've added extra code to print extra information about a bad page that's being  
freed to the kernel at:  
  
http://people.redhat.com/~dhowells/.pickup/asparks-143032/kernel-smp-2.4.21-40.EL.bz143042.1.i686.rpm 
 
If you would be willing to try running that, it should produce a crash dump 
with more information about the page that was being freed incorrectly.  This 
information should appear in the kernel console log, just before the BUG() 
report. 

Comment 13 Alan Sparks 2006-04-13 15:12:20 UTC
Another tracetack, all I can get off the console:

CPU:    3
EIP:    0060:[<c0159560>]    Not tainted
EFLAGS: 00210286

EIP is at __free_pages_ok [kernel] 0x3e0 (2.4.21-40.ELsmp/i686)
eax: 00000033   ebx: c56bf9e0   ecx: 00000001   edx: c0387e98
esi: f62d0a80   edi: 00000000   ebp: 00000000   esp: cd7d5ec8
ds: 0068   es: 0068   ss: 0068
Process keventd (pid: 6, stackpage=cd7d5000)
Stack: c02c1ea8 00000363 c000a308 ff061000 cd7d5ee4 f5ce9180 00000008 cd7d5ee4
       cd7d5ee4 00000000 00000001 cd7d5f10 f5ce9180 00000001 f62d0a80 00000000
       00000000 c014cf3e cd7d5f10 cd7d5f10 00000000 cd7d4000 00000000 00000e00
Call Trace:   [<c014cf3e>] __iodesc_free [kernel] 0xde (0xcd7d5f0c)
[<c0161e9c>] kmap_high [kernel] 0x5c (0xcd7d5f28)
[<c014d87b>] __iodesc_read_finish [kernel] 0x22b (0xcd7d5f38)
[<c01302ca>] __run_task_queue [kernel] 0x6a (0xcd7d5f74)
[<c013c9ad>] context_thread [kernel] 0x13d (0xcd7d5f8c)
[<c013c870>] context_thread [kernel] 0x0 (0xcd7d5fe0)
[<c01095cd>] kernel_thread_helper [kernel] 0x5 (0xcd7d5ff0)

Code: 0f 0b e1 00 33 17 2c c0 e9 6c fc ff ff 9c 5a fa f0 fe 0d 70

Kernel panic: Fatal exception


Comment 14 David Howells 2006-04-13 15:18:51 UTC
Created attachment 127711 [details]
Debugging patch added to test kernel

Comment 15 David Elliott 2007-08-29 13:55:36 UTC
 we are having a similar problem here, HP DL380G4, redhat as kernel
2.4.21-32.0.1.ELsmp
2 x intel xeon 3.6 , 6GB RAM, Oracle Database server

last message in /var/log/messages is

Aug 29 12:43:41 oracle4 kernel: Page has mapping still set. This is a serious
situation. However if you
Aug 29 12:43:41 oracle4 kernel: are using the NVidia binary only module please
report this bug to
Aug 29 12:43:41 oracle4 kernel: NVidia and not to the linux kernel mailinglist.
Aug 29 12:43:41 oracle4 kernel: ------------[ cut here ]------------

there's a kernel panic on the screen at this point, but we're not setup to
capture this information right now - was there any more news on this at all?




Comment 16 RHEL Program Management 2007-10-19 19:11:04 UTC
This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.