Description of problem: X335 server with dual Xeon 2G processors, hyperthreading enabled - we have Oracle RAC and Database software installed - 9.2.0.3 - and create a database on the cluster. The private backbone for all RAC activities is connected to a 100Mb switch. I can fairly reliably induce the kernel panic attached. Version-Release number of selected component (if applicable): kernel-2.4.9-e.12smp How reproducible: 3 installs, 1 or 2 nodes crash on each occasion. Steps to Reproduce: 1. 9.2.0.3 Oracle RAC cluster running, using hangcheck-timer, certified ocfs for 2.4.9e-12.smp etc. 2. Start ocfs/mount /oracle/oradata shared network storage. 3. Start oracm as root and ensure hangcheck-timer loaded. 4. Remove database and re-add database using dbca. Actual results: The primary cluster node is not affected, but the other nodes may or may not panic - they always seem to produce the same error. Previous uptime irrelevant. Primary node shows 100% completion of database install. Expected results: Cluster should keep on going. Additional info: System has 1Gb of RAM, 3Gb of swap, barely uses any swap if at all. invalid operand: 0000 Kernel 2.4.9-e.12smp CPU: 3 EIP: 0010:[<c01386a8>] Not tainted EFLAGS: 00010086 EIP is at kmem_cache_reap [kernel] 0x1f8 eax: 0000001e ebx: f6386020 ecx: c02f50a4 edx: 00005531 esi: f4b02d04 edi: f4b02d14 ebp: 00000000 esp: c2171f68 ds: 0018 es: 0018 ss: 0018 Process kswapd (pid: 10, stackpage=c2171000) Stack: c0265a80 00000721 00000be0 f4b02d14 f4b02d0c c0137913 f4afb400 00000009 00000000 00000000 00000000 00000000 000000c0 00000000 0008e000 c013c72f 000000c0 00000000 00000001 00000000 c013c843 000000c0 00000000 00010f00 Call Trace: [<c0265a80>] .rodata.str1.1 [kernel] 0x2b7b [<c0137913>] kmem_cache_shrink_nr [kernel] 0x53 [<c013c72f>] do_try_to_free_pages [kernel] 0x7f [<c013c843>] kswapd [kernel] 0x103 [<c013c740>] kswapd [kernel] 0x0 [<c0105000>] stext [kernel] 0x0 [<c02f0018>] __kallsyms [kernel] 0x73e80 [<c0105000>] stext [kernel] 0x0 [<c0105836>] kernel_thread [kernel] 0x26 [<c013c740>] kswapd [kernel] 0x0 Code: 0f 0b 58 5a 8b 03 45 39 f8 75 dd 8b 4e 2c 89 ea 8b 7e 4c d3 CPU#0 is frozen. CPU#2 is frozen. CPU#1 is frozen. < netdump activated - performing handshake with the client. >
I have just managed to get an almost identical kernel panic while running a relatively large query - here's a diff between the error encountered at the end of database creation and this occurrence during the large query: 3c3 < CPU: 3 --- > CPU: 1 7,8c7,8 < eax: 0000001e ebx: f6386020 ecx: c02f50a4 edx: 00005531 < esi: f4b02d04 edi: f4b02d14 ebp: 00000000 esp: c2171f68 --- > eax: 0000001e ebx: f6f2b000 ecx: c02f50a4 edx: 000054b0 > esi: f5cbadf8 edi: f5cbae08 ebp: 00000000 esp: c2171f68 11,12c11,12 < Stack: c0265a80 00000721 00000be0 f4b02d14 f4b02d0c c0137913 f4afb400 00000009 < 00000000 00000000 00000000 00000000 000000c0 00000000 0008e000 c013c72f --- > Stack: c0265a80 00000721 00000be0 f5cbae08 f5cbae00 c0137913 f5cab800 00000004 > 00000001 00000001 f5cbad04 00000000 000000c0 00000000 0008e000 c013c72f 29c29 < CPU#1 is frozen. --- > CPU#3 is frozen.
I am seeing an identical panic on a Dell 2650 running 2.4.9-e.35. This is running a stress test. I am curious... what SCSI (RAID) controller were you using on the X335? We are very suspicious of the aacraid driver. Tim
The SCSI controller for the local mirror is LSI Logic 53c1030 (rev 07), using the inbuilt RAID1 capability, works off the MPT Fusion drivers (mptscsih and mptbase). The SAN access is through a QLA2312 HBA. I must say, haven't seen this error in a long time, not since we hit the 2.4.9e20 or so kernel and have upgraded ocfs, currently 1.0.9-4 on the previously affected boxes.
We also had a very similar crash here: kernel 2.4.9-e.49smp Dual P4 Xeon with hyperthreading local disks are on a Adaptec AIC-7899 Ultra 160/m SCSI host adapter (software raid1 for all filesystems and swap). SAN access is also through QLA2312 HBAs (+ EMC PowerPath). kernel modules were loaded, but no filesystem was mounted at the time of the crash. The stack trace I typed off the screen looks almost exactly like the one posted by Mitchell (I didn't write down register values, etc.) klogd managed to save the following fragment to syslog before the crash: Nov 18 16:19:51 eldil kernel: ------------[ cut here ]------------ Nov 18 16:19:51 eldil kernel: kernel BUG at slab.c:1826! Nov 18 16:19:51 eldil kernel: invalid operand: 0000 Nov 18 16:19:51 eldil kernel: Kernel 2.4.9-e.49smp Nov 18 16:19:51 eldil kernel: CPU: 2 Nov 18 16:19:51 eldil kernel: EIP: 0010:[kmem_cache_reap+504/912] Tainted: P· Nov 18 16:19:51 eldil kernel: EIP: 0010:[<c01389a8>] Tainted: P·
This bug report is getting a little old... I think whatever the problem was, it was fixed by going to the 2.4.9e20 kernel - it had to have been either a bug in the hangcheck timer, an SMP related bug in a SCSI driver (maybe the generic module?), or Oracle's 1.09-4 OCFS module. No panics like these found with AS2.1 Update 3 or above though. Note for the latest entry, from slab.c's comments: * kmem_cache_destroy() CAN CRASH if you try to allocate from the cache during kmem_cache_destroy(). The caller must prevent concurrent allocs. Possibly SMP confusion with one CPU trying to allocate while another is doing kmem_cache_destroy? Possibly a missing lock in whatever driver module was active?
comment #5: > it was fixed by going to the 2.4.9e20 kernel I don't think so, as Tim Wright saw it with a 2.4.9-e.35 kernel and I with a 2.4.9-e.49 kernel. Of course it is possible that these are different bugs with similar symptoms.
We never got to root cause, but I will say that we have never seen it on RHEL3, and we run it on a lot of machines in our test lab, so in all likelihood, completing the upgrade will eliminate the problem for you.
We have had no issues with kernel panics on EL3, same boxes. The cause had to be somewhere in old SCSI or SMP related code. If you're curious about the patch history of the kernel, take a look at the errata with: rpm -q --changelog kernel OR: rpm -q --changelog -p <kernel-file-with-version>.rpm I'm usually pleasantly surprised by the detail of bug fixes in there.
I doubt we are going to fix this bug in AS2.1 at this point in its life cycle. Please upgrade to RHEL3 and the problem will not occur. Larry Woodman