Bug 141905
| Summary: | kernel 2.4.21-25.ELsmp panic (kscand) | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 3 | Reporter: | John Caruso <jcaruso> |
| Component: | kernel | Assignee: | Larry Woodman <lwoodman> |
| Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.0 | CC: | anderson, jturner, peterm, petrides, riel |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | i686 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2005-07-22 02:06:47 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 156321 | ||
John, if you can set up netdump and get us a kernel core dump file for
this problem it would be very useful. Is this possible?
Larry
BTW, this appears to be memory corruption in the mem_map(array of page
structs).
The crash in page_referenced was caused by a bad page->pte.chain value.
----------------------------------------------------------------
int page_referenced(struct page * page, int * rsslimit)
...
for (pc = page->pte.chain; pc; pc = pte_chain_next(pc)) {
...
chain_ptep_t pte_paddr = pc->ptes[i];
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The assembler code for this is:
0xc015ff38 <page_referenced+0x2f8>: mov 0x4(%esi,%ebx,4),%eax
where esi: -> bcb64118
----------------------------------------------------------------------
This esi value can never be less than 0xc0000000!
So are you saying this might be related to bug 141394? That'd be good news, since the server in this bug is a development server. I've set up a netdump server...which was non-trivial, since the netdump startup script doesn't allow a client to use a netdump server that's on a different network, and I had to change it to accomodate that. So we'll see what happens the next time this system crashes (which happens about once a week--though I don't know if the crashes are always of the same type). My guess is that this is the same as bug 141394 but I cant be 100% sure at this point. The dump will certainly help us debug this problem, please let us know a soon as you get one. Larry So: today we hit the memory corruption issue from bug 141394 on this server, in non-fatal fashion (the server didn't crash, but we did receive a bogus tripwire alert). So it looks as though this bug may indeed be a duplicate of bug 141394--or at least I'm fine with treating it that way until this bug is resolved. So feel free to mark it as a duplicate of that bug, and if this server continues to be unstable after that bug is resolved I'll just open a new case. Closing as dup on advice in comment #4. *** This bug has been marked as a duplicate of 141394 *** We just experienced a kernel panic on the database server of this pair which was NOT running the database--in other words, it was sitting idle except for VCS and periodic tripwire runs. Since it's possible that this was caused by the memory corruption bug, I'll give you the info here--but if I'm wrong about that, just say so and I'll file yet another bug. Here's the panic info (we don't have a memory dump for it): ---------------------------------------------------- Unable to handle kernel NULL pointer dereference at virtual address 0000002d printing eip: 021491e4 *pde = 00003001 *pte = 00000000 Oops: 0000 nfs lockd sunrpc gab llt netconsole autofs4 audit tg3 e1000 sg sr_mod cdrom usb-storage keybdev mousedev hid input usb-ohci usbcore ext3 jbd mptscsih mptbase CPU: 3 EIP: 0060:[<021491e4>] Tainted: PF EFLAGS: 00010206 EIP is at do_generic_file_read [kernel] 0x174 (2.4.21- 25.ELhugemem/i686) eax: 0000001d ebx: 00000016 ecx: 1312b680 edx: 0000001d esi: dfb4e1c4 edi: 12ed2c94 ebp: 000000de esp: cea33ef4 ds: 0068 es: 0068 ss: 0068 Process tripwire (pid: 25603, stackpage=cea33000) Stack: dfb4e100 08208590 00000000 00001000 00000000 00001000 00000000 00000000 00000000 dfb4e100 fffffff2 00001000 df368d80 ffffffea 00001000 02149e35 df368d80 df368da0 cea33f5c 02149c80 00000000 02439680 00002710 cea32000 Call Trace: [<02149e35>] generic_file_new_read [kernel] 0xc5 (0xcea33f30) [<02149c80>] file_read_actor [kernel] 0x0 (0xcea33f40) [<02149f5f>] generic_file_read [kernel] 0x2f (0xcea33f7c) [<02164ea3>] sys_read [kernel] 0xa3 (0xcea33f94) Code: Bad EIP value. CPU#0 is frozen. CPU#1 is frozen. CPU#2 is frozen. CPU#3 is executing netdump. CPU#4 is frozen. CPU#5 is frozen. CPU#6 is frozen. CPU#7 is frozen. A fix for the /proc/kcore memory corruption bug, which we believe is the root cause of this problem, has just been committed to the RHEL3 U5 patch pool this evening (in kernel version 2.4.21-27.10.EL). The fix (referred to in bug 141394) for a data corruption problem has also just been committed to the RHEL3 E6 patch pool (in kernel version 2.4.21-32.0.1.EL). An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-472.html |
Description of problem: Kernel panic Version-Release number of selected component (if applicable): kernel 2.4.21-25.ELsmp Additional info: We have a system that's been crashing periodically since it's been running the -25.EL kernel. I was finally able to get the stack trace off of the console: nfs lockd sunrpc audit tg3 microcode sr_mod cdrom sg sd_mod usb- storage scsi_mo keybdev mousedev hid input usb-ohci usbcore ext3 jbd CPU: 2 EIP: 0060:[<c015fcd8>] Not tainted EFLAGS: 00010286 EIP is at page_referenced [kernel] 0x2f8 (2.4.21-25.ELsmp/i686) eax: bcb64118 ebx: 0000001e ecx: c5b0ea14 edx: c5b0ea14 esi: bcb64118 edi: 00000007 ebp: c03a8178 esp: cd561f2c ds: 0068 es: 0068 ss: 0068 Process kscand (pid: 12, stackpage=cd561000) Stack: 00000000 00000074 cd560038 ffffffff 00000000 00000000 00000000 00000000 00000363 00000a90 0000001e 00000000 ffffffff 00000000 00000001 00000000 00000000 00000363 c0009b90 fff72000 00000da0 bcb64118 00000001 00000000 Call Trace: [<c0155952>] scan_active_list [kernel] 0xa2 (0xcd561fa4) [<c0134ef0>] process_timeout [kernel] 0x0 (0xcd561fb0) [<c0157120>] kscand [kernel] 0xa0 (0xcd561fc8) [<c0157080>] kscand [kernel] 0x0 (0xcd561fe0) [<c01095ad>] kernel_thread_helper [kernel] 0x5 (0xcd561ff0) Code: 8b 44 9e 04 85 c0 0f 84 01 01 00 00 89 c1 31 db c1 e1 03 0f Kernel panic: Fatal exception