Description of problem:
Recently (seems after Fedora kernel revision 2188 release) we've been
experiencing frequent kernel panics on one of our production systems.
The frequency seems to be around 1-2 times a week, but today we've
experienced two of them in a row, the second one occuring only half an
hour after the first.
Using a digital camera I've managed to take a screenshot of the
console where a stacktrace is visible.
Stacktrace indicates that panic occured in shrink_cache function. The
stacktrace shows the following function calls:
I'll attach the screenshot and some files related to system and hw
Version-Release number of selected component (if applicable):
Random, occurs about 1-2 times a week
Probably the most relevant additional info is that the system runs on
hardware RAID5 storage on 3Ware 8506-4LP controller.
The driver for that controller isn't the one shipped with Fedora
kernel (3Ware driver v1.02.00.036), but the latest version manually
compiled from sources (v1.02.00.037), but the difference between these
2 versions of the driver is minimal (I'll attach a diff between those
No other modifications to the stock Fedora kernel and kernel modules
have been made.
Created attachment 100261 [details]
Differences between 3Ware driver v1.02.00.036 and v1.02.00.037
v1.02.00.036 is in stock Fedora kernel, but we have to use the latest
We had to update the controller's firmware because it hardlocked, and 3Ware
highly advises updating OS driver to the latest version even before updating
the controller's firmware.
Created attachment 100262 [details]
Photo of console with kernel stacktrace after panic
Created attachment 100263 [details]
dmesg file from the machine
Created attachment 100264 [details]
output from lspci -vv
Created attachment 100265 [details]
output from dmidecode
More detailed hardware specification:
CPU: dual PIV Xeon 2GHz with hyperthreading (4 virtual CPUs)
RAM: 1 GB (2 x 512MB DDR Kingstone with parity control)
Motherboard: Intel SE7501BR2
NIC: INTEL pro100 Server Board integrateon on the motherboard
Storage: Hardware RAID 5 array on 3Ware 8506-4LP controller, built
from 4 Serial ATA Seagate Serial ATA 120GB drives
Created attachment 100266 [details]
Created attachment 100572 [details]
Another kernel panic
This one occured today, this time the system was running in higher resolution
text mode, so I was able to capture the full text of kernel panic.
Possibly related is the bug 121732...
Created attachment 100666 [details]
Another kernel panic that occured today on kernel-2.4.22-1.2188.nptlsmp.
After this panic I've installed updated kernel 2.4.22-1.2190.nptlsmp which
apparently resolves the problem (accorging to bug 121732).
Another panic in refile_inode occured just today on
The problem has not been resolved.
I'll attach a screenshot next morning.
Created attachment 100732 [details]
Yesterday's panic on kernel-2.4.22-1.2190.nptlsmp
Here's the text of the latest panic with 2190 for better searchability
Unable to handle kernel NULL pointer dereference at virtual address
*pde = 0e723067
*pte = 00000000
e100 iptable_mangle ipt_REJECT ipt_multiport ipt_state ip_conntrack
iptable_filter ip_tables floppy sg microcode keybdev mousedev
hid input usb-uhci usbcore e
EIP: 0060:[<c01691ae>] Not tainted
EIP is at refile_inode [kernel] 0x4e (2.4.22-1.2190.nptlsmp)
eax: 00000000 ebx: e28fb900 ecx: 00000000 edx: e28fb908
esi: c0376028 edi: c0374fd8 ebp: 0000772e esp: c3193de0
ds: 0068 es: 0068 ss: 0068
Process spamd (pid: 31686, stackpage=c3193000)
Stack: c19187a0 e20fb9c4 c013c642 e28fb900 c19187a0 00000000 c19187a0
c19187a0 000001d2 c3192000 000005c3 000001d2 00000012 0000001d
c0374fd8 c0374fd8 c01464aa c3193e4c 000001d2 0000003c 00000020
Call Trace: [<c013c642>] __remove_inode_page [kernel] 0x82 (0x3c193de8)
[<c01461ba>] shrink_cache [kernel] 0x30a (0xc3193dfc)
[<c01464aa>] shrink_caches [kernel] 0x4a (0xc3193e28)
[<c0146522>] try_to_free_pages_zone [kernel] 0x62 (0xc3193e3c)
[<c0147102>] balance_classzone [kernel] 0x52 (0xc3193e60)
[<c0147438>] __alloc_pages [kernel] 0x188 (0xc3193e7c)
[<c010e968>] call_do_IRQ [kernel] 0x5 (0xc3193e88)
[<c0139b5f>] do_wp_page [kernel] 0x6f (0xc3193ebc)
[<c013a666>] handle_mm_fault [kernel] 0x106 (0xc3193ee0)
[<c011c94c>] do_page_fault [kernel] 0x14c (0xc3193f0c)
[<c011e9c0>] scheduler_tick [kernel] 0x120 (0xc3193f28)
[<c0107b3f>] __switch_to [kernel] 0x16f (0xc3193f44)
[<c011ed8f>] schedule [kernel] 0x7f (0xc3193f68)
[<c012e42e>] update_process_times [kernel] 0x3e (0xc3193f84)
[<c011c800>] do_page_fault [kernel] 0x0 (0xc3193fb0)
[<c0109c18>] error_code [kernel] 0x34 (0xc3193fb8)
Code: 89 01 c7 43 08 00 00 00 00 89 48 04 8b 06 89 50 04 89 43 08
Created attachment 100862 [details]
refile_inode kernel-2.4.22-1.2190 panic from today
Possible fix: 3Ware support engineer has pointed out that this issue
may have been fixed in kernel 2.4.26:
"In the changelog for 2.4.26, there was a bug in refile_inode() that
I would recommend you try this kernel.
Below is the changelog:
Trond: Avoid refile_inode() from putting locked inodes on the dirty list
Changed EXTRAVERSION to -rc1"
that patch was merged in the 2190 kernel.
It made no difference.
I've asked the author of that patch about the new issue, here is his
PÃ¥ to , 17/06/2004 klokka 11:01, skreiv Aleksander Adamowski:
>> I've seen that you've fixed a bug in linux kernel 2.4 related to
refile_inode() (fix applied to kernel-2.4.26).
>> There's still a related nasty crasher bug in refile_inode(), see
this Redhat Bugzilla bug:
I'm not really a VFS person. That said, it looks to me from the dump you
sent that refile_inode is calling list_del(&inode->i_list) on an inode
that has already been removed from all lists. Normally, such an inode is
supposed to be marked as I_FREEING...
I couldn't find any code in the 2.4.27-pre series that appeared to be
able to put the inode in this bogus state. Somebody else will have to
audit the RedHat kernels to see if they have any such bugs. 8-)
We just saw the same bug, note that it is with the 2179 kernel with
the refile_inode patch. We can't duplicate this patch in QA yet, only
in production :(
I'll attach a decoded oops.
Created attachment 102818 [details]
decoded oops from this panic
Decoded oops from the panic we had that appears to be the same as the submitter
of this bug, and different from issue 121732.
Our system experiencing this problem is a HP DL380 G3 with:
* 2x2.8GHz Xeon processors with Hyperthreading on.
* 4GB RAM
* Internal HP hardware RAID-1 (cciss driver)
So it would appear that the 3Ware driver is not the problem.
Since Trond is rarely wrong, I'm assuming that the problem here has
been fixed in Linux 2.4.27, and that means the problem was fixed
sometime between the 2.4.22-23 series and 2.4.27, but that the fix was
not merged into the FC1 kernel.
Going through the kernel changelogs on kernel.org line by line, I
found two changesets that appear to be significant. Since I am not a
kernel hacker I cannot confirm that the errors we're experiencing
could be caused by the lack of the two patches referenced below, but I
have a feeling that a kernel hacker with VFS knowledge could confirm
this relatively quickly. In particular, fixed in 2.4.25-pre7 (2.4.25
rel) by Rik Van Riel with comment:
"some more fixes for fs/inode.c inode reclaiming changes"
This bug does exactly what Trond refers to, calling
list_del(&inode->i_list), the question is, could that inode have
already been removed from all lists, that I do not know.
Rik's original post:
David Woodhouse's followup and approval:
A diff of the fs/inode.c code that is the result of the above mailing
In addition there is a second inode cache related bugfix that seems
like it belongs in the FC1 kernel, also from 2.4.25:
Fixed by David Woodhouse
"Do not leave inodes with stale waitqueue on slab cache"
Both of the above patches apply cleanly to the 2179-2199 kernels
(fs/inode.c wasn't changed between those versions).
My biggest problem right now is that I can't duplicate the oops
in a controlled environment. It happens once a week across all of our
dozen or so servers running this kernel. I've got a test machine now
running ltp, dbench, kernel compiles, and other processes to try and
duplicate this oops, but haven't seen it in 2 straight days of
testing. It's not clear that the error was seen much (if at all) in
the wild, it looks like Rik fixed the error before lots of people noticed.
From an email exchange with Aleksander, he can't duplicate this
problem in a controlled setting either, it happens about twice per
month for him.
Ideally a VFS person could look at the above patches and just say
"yes, this patch needs to be applied to the FC1 kernel, it could cause
After further looking the second fix, "do not leave inodes with stale
waitqueue on slab cache" was fixed in the FC1.2199 kernel.
I will attach the patch to FC1.2199 that implements Rik's fix, which
I'm testing now. Note we do not use quotas, so the second part of his
fix is not relevant to us, I don't think.
Created attachment 102912 [details]
patch which implement's Rik's inode reclaim pach from 2.4.25
Unfortunately I cannot test this fix as I've switched to RHEL kernel
on that machine to remedy the panics.
For the record, we're running the 2.4.21-15.ELsmp RHEL kernel to avoid
Running with the FC1.2199 kernel that implements Rik's refile_inode
fix, we've had 4 weeks (28 days) of uptime without a crash. The best
we were doing before was 1 week and often less than that.
If anyone is still running/updating the FC1 kernel this would be a
good patch to use/apply...
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases,
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/