Bug 141905 - kernel 2.4.21-25.ELsmp panic (kscand)
kernel 2.4.21-25.ELsmp panic (kscand)
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: Larry Woodman
Brian Brock
:
Depends On:
Blocks: 156321
  Show dependency treegraph
 
Reported: 2004-12-04 21:36 EST by John Caruso
Modified: 2007-11-30 17:07 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-07-21 22:06:47 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description John Caruso 2004-12-04 21:36:21 EST
Description of problem:
Kernel panic

Version-Release number of selected component (if applicable):
kernel 2.4.21-25.ELsmp

Additional info:
We have a system that's been crashing periodically since it's been 
running  the -25.EL kernel.  I was finally able to get the stack 
trace off of the console:

nfs lockd sunrpc audit tg3 microcode sr_mod cdrom sg sd_mod usb-
storage scsi_mo
 keybdev mousedev hid input usb-ohci usbcore ext3 jbd
CPU:    2
EIP:    0060:[<c015fcd8>]    Not tainted
EFLAGS: 00010286

EIP is at page_referenced [kernel] 0x2f8 (2.4.21-25.ELsmp/i686)
eax: bcb64118   ebx: 0000001e   ecx: c5b0ea14   edx: c5b0ea14
esi: bcb64118   edi: 00000007   ebp: c03a8178   esp: cd561f2c
ds: 0068   es: 0068  ss: 0068
Process kscand (pid: 12, stackpage=cd561000)
Stack: 00000000 00000074 cd560038 ffffffff 00000000 00000000 00000000 
00000000
       00000363 00000a90 0000001e 00000000 ffffffff 00000000 00000001 
00000000
       00000000 00000363 c0009b90 fff72000 00000da0 bcb64118 00000001 
00000000
Call Trace:   [<c0155952>] scan_active_list [kernel] 0xa2 (0xcd561fa4)
[<c0134ef0>] process_timeout [kernel] 0x0 (0xcd561fb0)
[<c0157120>] kscand [kernel] 0xa0 (0xcd561fc8)
[<c0157080>] kscand [kernel] 0x0 (0xcd561fe0)
[<c01095ad>] kernel_thread_helper [kernel] 0x5 (0xcd561ff0)

Code: 8b 44 9e 04 85 c0 0f 84 01 01 00 00 89 c1 31 db c1 e1 03 0f

Kernel panic: Fatal exception
Comment 1 Larry Woodman 2004-12-07 14:17:55 EST
John, if you can set up netdump and get us a kernel core dump file for
this problem it would be very useful.  Is this possible?

Larry


BTW, this appears to be memory corruption in the mem_map(array of page
structs).

The crash in page_referenced was caused by a bad page->pte.chain value.
----------------------------------------------------------------
int page_referenced(struct page * page, int * rsslimit)
...
   for (pc = page->pte.chain; pc; pc = pte_chain_next(pc)) {
   ...
      chain_ptep_t pte_paddr = pc->ptes[i];
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The assembler code for this is:
0xc015ff38 <page_referenced+0x2f8>:     mov    0x4(%esi,%ebx,4),%eax

where esi: -> bcb64118
----------------------------------------------------------------------

This esi value can never be less than 0xc0000000!


Comment 2 John Caruso 2004-12-10 14:19:35 EST
So are you saying this might be related to bug 141394?  That'd be 
good news, since the server in this bug is a development server.

I've set up a netdump server...which was non-trivial, since the 
netdump startup script doesn't allow a client to use a netdump server 
that's on a different network, and I had to change it to accomodate 
that.  So we'll see what happens the next time this system crashes 
(which happens about once a week--though I don't know if the crashes 
are always of the same type).
Comment 3 Larry Woodman 2004-12-10 14:24:50 EST
My guess is that this is the same as bug 141394 but I cant be 100%
sure at this point.  The dump will certainly help us debug this
problem, please let us know a soon as you get one.

Larry
Comment 4 John Caruso 2004-12-13 13:18:55 EST
So: today we hit the memory corruption issue from bug 141394 on this 
server, in non-fatal fashion (the server didn't crash, but we did 
receive a bogus tripwire alert).  So it looks as though this bug may 
indeed be a duplicate of bug 141394--or at least I'm fine with 
treating it that way until this bug is resolved.  So feel free to 
mark it as a duplicate of that bug, and if this server continues to 
be unstable after that bug is resolved I'll just open a new case.
Comment 5 Ernie Petrides 2004-12-13 16:30:56 EST
Closing as dup on advice in comment #4.


*** This bug has been marked as a duplicate of 141394 ***
Comment 6 John Caruso 2004-12-14 18:46:04 EST
We just experienced a kernel panic on the database server of this 
pair which was NOT running the database--in other words, it was 
sitting idle except for VCS and periodic tripwire runs.  Since it's 
possible that this was caused by the memory corruption bug, I'll give 
you the info here--but if I'm wrong about that, just say so and I'll 
file yet another bug.  Here's the panic info (we don't have a memory 
dump for it):

----------------------------------------------------
Unable to handle kernel NULL pointer dereference at virtual address 
0000002d
printing eip:
021491e4
*pde = 00003001
*pte = 00000000
Oops: 0000
nfs lockd sunrpc gab llt netconsole autofs4 audit tg3 e1000 sg sr_mod 
cdrom usb-storage keybdev mousedev hid input usb-ohci usbcore ext3 
jbd mptscsih mptbase
CPU:    3
EIP:    0060:[<021491e4>]    Tainted: PF
EFLAGS: 00010206

EIP is at do_generic_file_read [kernel] 0x174 (2.4.21-
25.ELhugemem/i686)
eax: 0000001d   ebx: 00000016   ecx: 1312b680   edx: 0000001d
esi: dfb4e1c4   edi: 12ed2c94   ebp: 000000de   esp: cea33ef4
ds: 0068   es: 0068   ss: 0068
Process tripwire (pid: 25603, stackpage=cea33000)
Stack: dfb4e100 08208590 00000000 00001000 00000000 00001000 00000000 
00000000
00000000 dfb4e100 fffffff2 00001000 df368d80 ffffffea 00001000 
02149e35
df368d80 df368da0 cea33f5c 02149c80 00000000 02439680 00002710 
cea32000
Call Trace:   [<02149e35>] generic_file_new_read [kernel] 0xc5 
(0xcea33f30)
[<02149c80>] file_read_actor [kernel] 0x0 (0xcea33f40)
[<02149f5f>] generic_file_read [kernel] 0x2f (0xcea33f7c)
[<02164ea3>] sys_read [kernel] 0xa3 (0xcea33f94)

Code: Bad EIP value.

CPU#0 is frozen.
CPU#1 is frozen.
CPU#2 is frozen.
CPU#3 is executing netdump.
CPU#4 is frozen.
CPU#5 is frozen.
CPU#6 is frozen.
CPU#7 is frozen.
Comment 7 Ernie Petrides 2005-01-29 01:17:02 EST
A fix for the /proc/kcore memory corruption bug, which we believe
is the root cause of this problem, has just been committed to the
RHEL3 U5 patch pool this evening (in kernel version 2.4.21-27.10.EL).
Comment 10 Ernie Petrides 2005-05-17 18:30:13 EDT
The fix (referred to in bug 141394) for a data corruption problem has
also just been committed to the RHEL3 E6 patch pool (in kernel version
2.4.21-32.0.1.EL).
Comment 11 Josh Bressers 2005-05-25 12:42:36 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-472.html

Note You need to log in before you can comment on or make changes to this bug.