Bug 88870

Summary:	Kernel panic in kmem_cache_reap
Product:	Red Hat Enterprise Linux 2.1	Reporter:	Mitchell Brandsma <mbrandsma>
Component:	kernel	Assignee:	Larry Woodman <lwoodman>
Status:	CLOSED NEXTRELEASE	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	2.1	CC:	gina, hgarcia, hjp, tao, timw
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-09-29 01:55:57 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mitchell Brandsma 2003-04-15 05:53:16 UTC

Description of problem:
X335 server with dual Xeon 2G processors, hyperthreading enabled - we have 
Oracle RAC and Database software installed - 9.2.0.3 - and create a database on 
the cluster.  The private backbone for all RAC activities is connected to a 
100Mb switch.  I can fairly reliably induce the kernel panic attached.

Version-Release number of selected component (if applicable):
kernel-2.4.9-e.12smp

How reproducible:
3 installs, 1 or 2 nodes crash on each occasion.

Steps to Reproduce:
1. 9.2.0.3 Oracle RAC cluster running, using hangcheck-timer, certified ocfs 
for 2.4.9e-12.smp etc.
2. Start ocfs/mount /oracle/oradata shared network storage.
3. Start oracm as root and ensure hangcheck-timer loaded.
4. Remove database and re-add database using dbca.
    
Actual results:
The primary cluster node is not affected, but the other nodes may or may not 
panic - they always seem to produce the same error.  Previous uptime 
irrelevant.  Primary node shows 100% completion of database install.

Expected results:
Cluster should keep on going.

Additional info:
System has 1Gb of RAM, 3Gb of swap, barely uses any swap if at all.

invalid operand: 0000
Kernel 2.4.9-e.12smp
CPU:    3
EIP:    0010:[<c01386a8>]    Not tainted
EFLAGS: 00010086
EIP is at kmem_cache_reap [kernel] 0x1f8
eax: 0000001e   ebx: f6386020   ecx: c02f50a4   edx: 00005531
esi: f4b02d04   edi: f4b02d14   ebp: 00000000   esp: c2171f68
ds: 0018   es: 0018   ss: 0018
Process kswapd (pid: 10, stackpage=c2171000)
Stack: c0265a80 00000721 00000be0 f4b02d14 f4b02d0c c0137913 f4afb400 00000009
       00000000 00000000 00000000 00000000 000000c0 00000000 0008e000 c013c72f
       000000c0 00000000 00000001 00000000 c013c843 000000c0 00000000 00010f00
Call Trace: [<c0265a80>] .rodata.str1.1 [kernel] 0x2b7b
[<c0137913>] kmem_cache_shrink_nr [kernel] 0x53
[<c013c72f>] do_try_to_free_pages [kernel] 0x7f
[<c013c843>] kswapd [kernel] 0x103
[<c013c740>] kswapd [kernel] 0x0
[<c0105000>] stext [kernel] 0x0
[<c02f0018>] __kallsyms [kernel] 0x73e80
[<c0105000>] stext [kernel] 0x0
[<c0105836>] kernel_thread [kernel] 0x26
[<c013c740>] kswapd [kernel] 0x0


Code: 0f 0b 58 5a 8b 03 45 39 f8 75 dd 8b 4e 2c 89 ea 8b 7e 4c d3
CPU#0 is frozen.
CPU#2 is frozen.
CPU#1 is frozen.
< netdump activated - performing handshake with the client. >

Comment 1 Mitchell Brandsma 2003-04-15 07:03:31 UTC

I have just managed to get an almost identical kernel panic while running a 
relatively large query - here's a diff between the error encountered at the end 
of database creation and this occurrence during the large query:
3c3
< CPU:    3
---
> CPU:    1
7,8c7,8
< eax: 0000001e   ebx: f6386020   ecx: c02f50a4   edx: 00005531
< esi: f4b02d04   edi: f4b02d14   ebp: 00000000   esp: c2171f68
---
> eax: 0000001e   ebx: f6f2b000   ecx: c02f50a4   edx: 000054b0
> esi: f5cbadf8   edi: f5cbae08   ebp: 00000000   esp: c2171f68
11,12c11,12
< Stack: c0265a80 00000721 00000be0 f4b02d14 f4b02d0c c0137913 f4afb400 00000009
<        00000000 00000000 00000000 00000000 000000c0 00000000 0008e000 c013c72f
---
> Stack: c0265a80 00000721 00000be0 f5cbae08 f5cbae00 c0137913 f5cab800 00000004
>        00000001 00000001 f5cbad04 00000000 000000c0 00000000 0008e000 c013c72f
29c29
< CPU#1 is frozen.
---
> CPU#3 is frozen.

Comment 2 Tim Wright 2004-02-18 18:19:47 UTC

I am seeing an identical panic on a Dell 2650 running 2.4.9-e.35. This
is running a stress test. I am curious... what SCSI (RAID) controller
were you using on the X335? We are very suspicious of the aacraid driver.

Tim

Comment 3 Mitchell Brandsma 2004-02-19 00:46:13 UTC

The SCSI controller for the local mirror is LSI Logic 53c1030 (rev 
07), using the inbuilt RAID1 capability, works off the MPT Fusion 
drivers (mptscsih and mptbase).

The SAN access is through a QLA2312 HBA.

I must say, haven't seen this error in a long time, not since we hit 
the 2.4.9e20 or so kernel and have upgraded ocfs, currently 1.0.9-4 
on the previously affected boxes.

Comment 4 Peter J. Holzer 2004-11-18 16:55:11 UTC

We also had a very similar crash here:

kernel 2.4.9-e.49smp
Dual P4 Xeon with hyperthreading
local disks are on a Adaptec AIC-7899 Ultra 160/m SCSI host adapter
(software raid1 for all filesystems and swap).

SAN access is also through QLA2312 HBAs (+ EMC PowerPath). kernel
modules were loaded, but no filesystem was mounted at the time of the
crash.

The stack trace I typed off the screen looks almost exactly like the
one posted by Mitchell (I didn't write down register values, etc.)

klogd managed to save the following fragment to syslog before the crash:

Nov 18 16:19:51 eldil kernel: ------------[ cut here ]------------
Nov 18 16:19:51 eldil kernel: kernel BUG at slab.c:1826!
Nov 18 16:19:51 eldil kernel: invalid operand: 0000
Nov 18 16:19:51 eldil kernel: Kernel 2.4.9-e.49smp
Nov 18 16:19:51 eldil kernel: CPU:    2
Nov 18 16:19:51 eldil kernel: EIP:    0010:[kmem_cache_reap+504/912] 
  Tainted: PÂ·
Nov 18 16:19:51 eldil kernel: EIP:    0010:[<c01389a8>]    Tainted: PÂ·

Comment 5 Mitchell Brandsma 2005-01-12 00:33:52 UTC

This bug report is getting a little old... I think whatever the 
problem was, it was fixed by going to the 2.4.9e20 kernel - it had to 
have been either a bug in the hangcheck timer, an SMP related bug in 
a SCSI driver (maybe the generic module?), or Oracle's 1.09-4 OCFS 
module.  No panics like these found with AS2.1 Update 3 or above 
though.

Note for the latest entry, from slab.c's comments:
* kmem_cache_destroy() CAN CRASH if you try to allocate from the 
cache during kmem_cache_destroy(). The caller must prevent concurrent 
allocs.

Possibly SMP confusion with one CPU trying to allocate while another 
is doing kmem_cache_destroy?  Possibly a missing lock in whatever 
driver module was active?

Comment 6 Peter J. Holzer 2005-01-13 11:59:56 UTC

comment #5:
> it was fixed by going to the 2.4.9e20 kernel

I don't think so, as Tim Wright saw it with a 2.4.9-e.35 kernel and I
with a 2.4.9-e.49 kernel. Of course it is possible that these are
different bugs with similar symptoms.

Comment 8 Tim Wright 2005-08-05 13:41:32 UTC

We never got to root cause, but I will say that we have never seen it on RHEL3,
and we run it on a lot of machines in our test lab, so in all likelihood,
completing the upgrade will eliminate the problem for you.

Comment 9 Mitchell Brandsma 2005-08-08 00:08:44 UTC

We have had no issues with kernel panics on EL3, same boxes.  The cause had to 
be somewhere in old SCSI or SMP related code.

If you're curious about the patch history of the kernel, take a look at the 
errata with:
rpm -q --changelog kernel
OR:
rpm -q --changelog -p <kernel-file-with-version>.rpm

I'm usually pleasantly surprised by the detail of bug fixes in there.

Comment 10 Larry Woodman 2005-09-29 01:55:57 UTC

I doubt we are going to fix this bug in AS2.1 at this point in its life cycle.
Please upgrade to RHEL3 and the problem will not occur.

Larry Woodman