Bug 143042
Summary: | kernel BUG at page_alloc.c:242 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | jonathan higgins <jhiggins> | ||||||||||
Component: | kernel | Assignee: | David Howells <dhowells> | ||||||||||
Status: | CLOSED WONTFIX | QA Contact: | Brian Brock <bbrock> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | medium | ||||||||||||
Version: | 3.0 | CC: | asparks, ox23fgu02, petrides, riel | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | All | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2007-10-19 19:11:04 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
jonathan higgins
2004-12-15 22:04:13 UTC
Created attachment 108658 [details]
kernel BUG() message captured at console
Jonathan, Red Hat does not support custom-built kernels. If you can reproduce this crash with a stock RHEL kernel, please post the full console oops output. Otherwise, please set this to CLOSED/NOTABUG. Thanks in advance. -ernie rebuilt the system from scratch using a stock kernel. Created attachment 113648 [details]
kernel oops
The reason this has grown a bit stale, is because I was attempting to get IBM to deal with this issue, but they are pointing the finger at the tg3 driver. They claim that there is a Red Hat Issue Tracker 64633. I have been looking all over for this issue tracker and have had no success. I've looked at issue 64633 and I don't immediately see its relevance, except that it's updating the TG3 driver which might be the cause. The only reason I can see that it might be the TG3 driver is that it's involved in the second panic. However, given the initial BUG report and the subsequent first panic whilst the kernel appears to be trying to recover from the BUG, I wouldn't trust the second panic very far as being the cause of the problem. The initial BUG is incurred whilst a page is being freed. The kernel checks that the page has been correctly deinitialised before actually returning it to the "free list", but in this case found that the page was still involved in an RMAP chain somewhere. My guess would be that something mucked up a page structure or several of them, possibly by getting the allocation functions mixed up and using the page struct pointer as the pointer to the actual page, though I'd've expected something like that to come to light a lot earlier. Are you able to say who at IBM suggested it might be the TG3 driver? I am pretty desperately searching for a solution to a very similar situation on our Oracle server running kernel 2.4.21-40.ELsmp. This on an HP Proliant DL380, 8GB RAM, 2 x Xeon 3.2GHz processors. Have replaced memory, finally moved disks to new server box. Same problem. Traceback on crash: Apr 12 13:14:50 db01-01 kernel: Page has mapping still set. This is a serious si tuation. However if you Apr 12 13:14:50 db01-01 kernel: are using the NVidia binary only module please r eport this bug to Apr 12 13:14:50 db01-01 kernel: NVidia and not to the linux kernel mailinglist. Apr 12 13:14:50 db01-01 kernel: ------------[ cut here ]------------ Apr 12 13:14:50 db01-01 kernel: kernel BUG at page_alloc.c:225! Apr 12 13:14:50 db01-01 kernel: invalid operand: 0000 Apr 12 13:14:50 db01-01 kernel: sg nfs lockd sunrpc tg3 microcode keybdev moused ev hid input ehci-hcd usb-uhci usbcore ext3 jbd cciss sd_mod scsi_mod How often does this occur? If I give you a test kernel that can print the whereabouts of the address space operations table would you be willing to run it? That might at least pinpoint the module that owned the bad page. Also, wasn't there a stack trace attached to the BUG() report? Created attachment 127679 [details]
stack trace of crash
Is occurring between 2-3 times/day to every couple of days. Creating an attachment for the stack trace I was able to save. I've added extra code to print extra information about a bad page that's being freed to the kernel at: http://people.redhat.com/~dhowells/.pickup/asparks-143032/kernel-smp-2.4.21-40.EL.bz143042.1.i686.rpm If you would be willing to try running that, it should produce a crash dump with more information about the page that was being freed incorrectly. This information should appear in the kernel console log, just before the BUG() report. Another tracetack, all I can get off the console: CPU: 3 EIP: 0060:[<c0159560>] Not tainted EFLAGS: 00210286 EIP is at __free_pages_ok [kernel] 0x3e0 (2.4.21-40.ELsmp/i686) eax: 00000033 ebx: c56bf9e0 ecx: 00000001 edx: c0387e98 esi: f62d0a80 edi: 00000000 ebp: 00000000 esp: cd7d5ec8 ds: 0068 es: 0068 ss: 0068 Process keventd (pid: 6, stackpage=cd7d5000) Stack: c02c1ea8 00000363 c000a308 ff061000 cd7d5ee4 f5ce9180 00000008 cd7d5ee4 cd7d5ee4 00000000 00000001 cd7d5f10 f5ce9180 00000001 f62d0a80 00000000 00000000 c014cf3e cd7d5f10 cd7d5f10 00000000 cd7d4000 00000000 00000e00 Call Trace: [<c014cf3e>] __iodesc_free [kernel] 0xde (0xcd7d5f0c) [<c0161e9c>] kmap_high [kernel] 0x5c (0xcd7d5f28) [<c014d87b>] __iodesc_read_finish [kernel] 0x22b (0xcd7d5f38) [<c01302ca>] __run_task_queue [kernel] 0x6a (0xcd7d5f74) [<c013c9ad>] context_thread [kernel] 0x13d (0xcd7d5f8c) [<c013c870>] context_thread [kernel] 0x0 (0xcd7d5fe0) [<c01095cd>] kernel_thread_helper [kernel] 0x5 (0xcd7d5ff0) Code: 0f 0b e1 00 33 17 2c c0 e9 6c fc ff ff 9c 5a fa f0 fe 0d 70 Kernel panic: Fatal exception Created attachment 127711 [details]
Debugging patch added to test kernel
we are having a similar problem here, HP DL380G4, redhat as kernel 2.4.21-32.0.1.ELsmp 2 x intel xeon 3.6 , 6GB RAM, Oracle Database server last message in /var/log/messages is Aug 29 12:43:41 oracle4 kernel: Page has mapping still set. This is a serious situation. However if you Aug 29 12:43:41 oracle4 kernel: are using the NVidia binary only module please report this bug to Aug 29 12:43:41 oracle4 kernel: NVidia and not to the linux kernel mailinglist. Aug 29 12:43:41 oracle4 kernel: ------------[ cut here ]------------ there's a kernel panic on the screen at this point, but we're not setup to capture this information right now - was there any more news on this at all? This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you. |