Bug 88870
Summary: | Kernel panic in kmem_cache_reap | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 2.1 | Reporter: | Mitchell Brandsma <mbrandsma> |
Component: | kernel | Assignee: | Larry Woodman <lwoodman> |
Status: | CLOSED NEXTRELEASE | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 2.1 | CC: | gina, hgarcia, hjp, tao, timw |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-09-29 01:55:57 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Mitchell Brandsma
2003-04-15 05:53:16 UTC
I have just managed to get an almost identical kernel panic while running a relatively large query - here's a diff between the error encountered at the end of database creation and this occurrence during the large query: 3c3 < CPU: 3 --- > CPU: 1 7,8c7,8 < eax: 0000001e ebx: f6386020 ecx: c02f50a4 edx: 00005531 < esi: f4b02d04 edi: f4b02d14 ebp: 00000000 esp: c2171f68 --- > eax: 0000001e ebx: f6f2b000 ecx: c02f50a4 edx: 000054b0 > esi: f5cbadf8 edi: f5cbae08 ebp: 00000000 esp: c2171f68 11,12c11,12 < Stack: c0265a80 00000721 00000be0 f4b02d14 f4b02d0c c0137913 f4afb400 00000009 < 00000000 00000000 00000000 00000000 000000c0 00000000 0008e000 c013c72f --- > Stack: c0265a80 00000721 00000be0 f5cbae08 f5cbae00 c0137913 f5cab800 00000004 > 00000001 00000001 f5cbad04 00000000 000000c0 00000000 0008e000 c013c72f 29c29 < CPU#1 is frozen. --- > CPU#3 is frozen. I am seeing an identical panic on a Dell 2650 running 2.4.9-e.35. This is running a stress test. I am curious... what SCSI (RAID) controller were you using on the X335? We are very suspicious of the aacraid driver. Tim The SCSI controller for the local mirror is LSI Logic 53c1030 (rev 07), using the inbuilt RAID1 capability, works off the MPT Fusion drivers (mptscsih and mptbase). The SAN access is through a QLA2312 HBA. I must say, haven't seen this error in a long time, not since we hit the 2.4.9e20 or so kernel and have upgraded ocfs, currently 1.0.9-4 on the previously affected boxes. We also had a very similar crash here: kernel 2.4.9-e.49smp Dual P4 Xeon with hyperthreading local disks are on a Adaptec AIC-7899 Ultra 160/m SCSI host adapter (software raid1 for all filesystems and swap). SAN access is also through QLA2312 HBAs (+ EMC PowerPath). kernel modules were loaded, but no filesystem was mounted at the time of the crash. The stack trace I typed off the screen looks almost exactly like the one posted by Mitchell (I didn't write down register values, etc.) klogd managed to save the following fragment to syslog before the crash: Nov 18 16:19:51 eldil kernel: ------------[ cut here ]------------ Nov 18 16:19:51 eldil kernel: kernel BUG at slab.c:1826! Nov 18 16:19:51 eldil kernel: invalid operand: 0000 Nov 18 16:19:51 eldil kernel: Kernel 2.4.9-e.49smp Nov 18 16:19:51 eldil kernel: CPU: 2 Nov 18 16:19:51 eldil kernel: EIP: 0010:[kmem_cache_reap+504/912] Tainted: P· Nov 18 16:19:51 eldil kernel: EIP: 0010:[<c01389a8>] Tainted: P· This bug report is getting a little old... I think whatever the problem was, it was fixed by going to the 2.4.9e20 kernel - it had to have been either a bug in the hangcheck timer, an SMP related bug in a SCSI driver (maybe the generic module?), or Oracle's 1.09-4 OCFS module. No panics like these found with AS2.1 Update 3 or above though. Note for the latest entry, from slab.c's comments: * kmem_cache_destroy() CAN CRASH if you try to allocate from the cache during kmem_cache_destroy(). The caller must prevent concurrent allocs. Possibly SMP confusion with one CPU trying to allocate while another is doing kmem_cache_destroy? Possibly a missing lock in whatever driver module was active? comment #5: > it was fixed by going to the 2.4.9e20 kernel I don't think so, as Tim Wright saw it with a 2.4.9-e.35 kernel and I with a 2.4.9-e.49 kernel. Of course it is possible that these are different bugs with similar symptoms. We never got to root cause, but I will say that we have never seen it on RHEL3, and we run it on a lot of machines in our test lab, so in all likelihood, completing the upgrade will eliminate the problem for you. We have had no issues with kernel panics on EL3, same boxes. The cause had to be somewhere in old SCSI or SMP related code. If you're curious about the patch history of the kernel, take a look at the errata with: rpm -q --changelog kernel OR: rpm -q --changelog -p <kernel-file-with-version>.rpm I'm usually pleasantly surprised by the detail of bug fixes in there. I doubt we are going to fix this bug in AS2.1 at this point in its life cycle. Please upgrade to RHEL3 and the problem will not occur. Larry Woodman |