Red Hat Bugzilla – Full Text Bug Listing
|Summary:||Kernel panic in shrink_cache/ __remove_inode_page / refile_inode|
|Product:||[Fedora] Fedora||Reporter:||Aleksander Adamowski <bugs-redhat>|
|Component:||kernel||Assignee:||Arjan van de Ven <arjanv>|
|Status:||CLOSED WONTFIX||QA Contact:|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2004-09-29 16:26:29 EDT||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
Description Aleksander Adamowski 2004-05-17 05:35:59 EDT
Description of problem: Recently (seems after Fedora kernel revision 2188 release) we've been experiencing frequent kernel panics on one of our production systems. The frequency seems to be around 1-2 times a week, but today we've experienced two of them in a row, the second one occuring only half an hour after the first. Using a digital camera I've managed to take a screenshot of the console where a stacktrace is visible. Stacktrace indicates that panic occured in shrink_cache function. The stacktrace shows the following function calls: shrink_cache shrink_caches try_to_free_pages_zone ip_local_deliver balance_classzone __alloc_pages __alloc_pages do_wp_page do_swap_page handle_mm_fault do_page_fault generic_file_new_read file_read_actor generic_file_read sys_pread smp_apic_timer_interrupt do_page_fault error_code I'll attach the screenshot and some files related to system and hw configuration shortly. Version-Release number of selected component (if applicable): 2.4.22-1.2188.nptlsmp How reproducible: Random, occurs about 1-2 times a week Additional info: Probably the most relevant additional info is that the system runs on hardware RAID5 storage on 3Ware 8506-4LP controller. The driver for that controller isn't the one shipped with Fedora kernel (3Ware driver v1.02.00.036), but the latest version manually compiled from sources (v1.02.00.037), but the difference between these 2 versions of the driver is minimal (I'll attach a diff between those two). No other modifications to the stock Fedora kernel and kernel modules have been made.
Comment 1 Aleksander Adamowski 2004-05-17 05:43:04 EDT
Created attachment 100261 [details] Differences between 3Ware driver v1.02.00.036 and v1.02.00.037 v1.02.00.036 is in stock Fedora kernel, but we have to use the latest v1.02.00.037. We had to update the controller's firmware because it hardlocked, and 3Ware highly advises updating OS driver to the latest version even before updating the controller's firmware.
Comment 2 Aleksander Adamowski 2004-05-17 05:46:11 EDT
Created attachment 100262 [details] Photo of console with kernel stacktrace after panic
Comment 3 Aleksander Adamowski 2004-05-17 05:53:05 EDT
Created attachment 100263 [details] dmesg file from the machine
Comment 4 Aleksander Adamowski 2004-05-17 05:53:58 EDT
Created attachment 100264 [details] output from lspci -vv
Comment 5 Aleksander Adamowski 2004-05-17 05:54:18 EDT
Created attachment 100265 [details] output from dmidecode
Comment 6 Aleksander Adamowski 2004-05-17 05:58:25 EDT
More detailed hardware specification: CPU: dual PIV Xeon 2GHz with hyperthreading (4 virtual CPUs) RAM: 1 GB (2 x 512MB DDR Kingstone with parity control) Motherboard: Intel SE7501BR2 NIC: INTEL pro100 Server Board integrateon on the motherboard Storage: Hardware RAID 5 array on 3Ware 8506-4LP controller, built from 4 Serial ATA Seagate Serial ATA 120GB drives
Comment 7 Aleksander Adamowski 2004-05-17 05:58:55 EDT
Created attachment 100266 [details] /et/sysconfig/hwconf file
Comment 8 Aleksander Adamowski 2004-05-26 06:43:22 EDT
Created attachment 100572 [details] Another kernel panic This one occured today, this time the system was running in higher resolution text mode, so I was able to capture the full text of kernel panic.
Comment 10 Aleksander Adamowski 2004-05-28 10:11:14 EDT
Created attachment 100666 [details] Another kernel panic that occured today on kernel-2.4.22-1.2188.nptlsmp. After this panic I've installed updated kernel 2.4.22-1.2190.nptlsmp which apparently resolves the problem (accorging to bug 121732).
Comment 11 Aleksander Adamowski 2004-05-31 15:29:00 EDT
Another panic in refile_inode occured just today on kernel-2.4.22-1.2190.nptlsmp. The problem has not been resolved. I'll attach a screenshot next morning.
Comment 12 Aleksander Adamowski 2004-06-01 05:05:42 EDT
Created attachment 100732 [details] Yesterday's panic on kernel-2.4.22-1.2190.nptlsmp
Comment 13 Aleksander Adamowski 2004-06-01 07:05:46 EDT
Here's the text of the latest panic with 2190 for better searchability and readability: Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: c01691ae *pde = 0e723067 *pte = 00000000 Oops: 0002 e100 iptable_mangle ipt_REJECT ipt_multiport ipt_state ip_conntrack iptable_filter ip_tables floppy sg microcode keybdev mousedev hid input usb-uhci usbcore e CPU: 2 EIP: 0060:[<c01691ae>] Not tainted EFLAGS: 00010246 EIP is at refile_inode [kernel] 0x4e (2.4.22-1.2190.nptlsmp) eax: 00000000 ebx: e28fb900 ecx: 00000000 edx: e28fb908 esi: c0376028 edi: c0374fd8 ebp: 0000772e esp: c3193de0 ds: 0068 es: 0068 ss: 0068 Process spamd (pid: 31686, stackpage=c3193000) Stack: c19187a0 e20fb9c4 c013c642 e28fb900 c19187a0 00000000 c19187a0 c01461ba c19187a0 000001d2 c3192000 000005c3 000001d2 00000012 0000001d 000001d2 c0374fd8 c0374fd8 c01464aa c3193e4c 000001d2 0000003c 00000020 c0146522 Call Trace: [<c013c642>] __remove_inode_page [kernel] 0x82 (0x3c193de8) [<c01461ba>] shrink_cache [kernel] 0x30a (0xc3193dfc) [<c01464aa>] shrink_caches [kernel] 0x4a (0xc3193e28) [<c0146522>] try_to_free_pages_zone [kernel] 0x62 (0xc3193e3c) [<c0147102>] balance_classzone [kernel] 0x52 (0xc3193e60) [<c0147438>] __alloc_pages [kernel] 0x188 (0xc3193e7c) [<c010e968>] call_do_IRQ [kernel] 0x5 (0xc3193e88) [<c0139b5f>] do_wp_page [kernel] 0x6f (0xc3193ebc) [<c013a666>] handle_mm_fault [kernel] 0x106 (0xc3193ee0) [<c011c94c>] do_page_fault [kernel] 0x14c (0xc3193f0c) [<c011e9c0>] scheduler_tick [kernel] 0x120 (0xc3193f28) [<c0107b3f>] __switch_to [kernel] 0x16f (0xc3193f44) [<c011ed8f>] schedule [kernel] 0x7f (0xc3193f68) [<c012e42e>] update_process_times [kernel] 0x3e (0xc3193f84) [<c011c800>] do_page_fault [kernel] 0x0 (0xc3193fb0) [<c0109c18>] error_code [kernel] 0x34 (0xc3193fb8) Code: 89 01 c7 43 08 00 00 00 00 89 48 04 8b 06 89 50 04 89 43 08
Comment 14 Aleksander Adamowski 2004-06-04 05:30:52 EDT
Created attachment 100862 [details] refile_inode kernel-2.4.22-1.2190 panic from today
Comment 15 Aleksander Adamowski 2004-06-17 03:49:27 EDT
Possible fix: 3Ware support engineer has pointed out that this issue may have been fixed in kernel 2.4.26: "In the changelog for 2.4.26, there was a bug in refile_inode() that was fixed. I would recommend you try this kernel. Below is the changelog: Marcelo Tosatti: Trond: Avoid refile_inode() from putting locked inodes on the dirty list Changed EXTRAVERSION to -rc1"
Comment 16 Dave Jones 2004-06-17 07:02:55 EDT
that patch was merged in the 2190 kernel. It made no difference.
Comment 17 Aleksander Adamowski 2004-06-18 03:56:24 EDT
I've asked the author of that patch about the new issue, here is his response: ---SNIP--- PÃ¥ to , 17/06/2004 klokka 11:01, skreiv Aleksander Adamowski: >> Hi! >> >> I've seen that you've fixed a bug in linux kernel 2.4 related to refile_inode() (fix applied to kernel-2.4.26). >> >> There's still a related nasty crasher bug in refile_inode(), see this Redhat Bugzilla bug: >> http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=123332 >> I'm not really a VFS person. That said, it looks to me from the dump you sent that refile_inode is calling list_del(&inode->i_list) on an inode that has already been removed from all lists. Normally, such an inode is supposed to be marked as I_FREEING... I couldn't find any code in the 2.4.27-pre series that appeared to be able to put the inode in this bogus state. Somebody else will have to audit the RedHat kernels to see if they have any such bugs. 8-) Cheers, Trond ---SNIP---
Comment 18 Andrew Ryan 2004-08-17 20:36:01 EDT
We just saw the same bug, note that it is with the 2179 kernel with the refile_inode patch. We can't duplicate this patch in QA yet, only in production :( I'll attach a decoded oops.
Comment 19 Andrew Ryan 2004-08-17 20:37:37 EDT
Created attachment 102818 [details] decoded oops from this panic Decoded oops from the panic we had that appears to be the same as the submitter of this bug, and different from issue 121732.
Comment 20 Andrew Ryan 2004-08-19 17:50:03 EDT
Our system experiencing this problem is a HP DL380 G3 with: * 2x2.8GHz Xeon processors with Hyperthreading on. * 4GB RAM * Internal HP hardware RAID-1 (cciss driver) So it would appear that the 3Ware driver is not the problem. Since Trond is rarely wrong, I'm assuming that the problem here has been fixed in Linux 2.4.27, and that means the problem was fixed sometime between the 2.4.22-23 series and 2.4.27, but that the fix was not merged into the FC1 kernel. Going through the kernel changelogs on kernel.org line by line, I found two changesets that appear to be significant. Since I am not a kernel hacker I cannot confirm that the errors we're experiencing could be caused by the lack of the two patches referenced below, but I have a feeling that a kernel hacker with VFS knowledge could confirm this relatively quickly. In particular, fixed in 2.4.25-pre7 (2.4.25 rel) by Rik Van Riel with comment: "some more fixes for fs/inode.c inode reclaiming changes" This bug does exactly what Trond refers to, calling list_del(&inode->i_list), the question is, could that inode have already been removed from all lists, that I do not know. Rik's original post: http://www.ussg.iu.edu/hypermail/linux/kernel/0401.2/0962.html David Woodhouse's followup and approval: http://www.ussg.iu.edu/hypermail/linux/kernel/0401.2/0970.html A diff of the fs/inode.c code that is the result of the above mailing list postings: http://source.scl.ameslab.gov:email@example.com?nav=index.html|ChangeSet@-9Mfirstname.lastname@example.org|hist/fs/inode.c In addition there is a second inode cache related bugfix that seems like it belongs in the FC1 kernel, also from 2.4.25: Fixed by David Woodhouse "Do not leave inodes with stale waitqueue on slab cache" http://source.scl.ameslab.gov:email@example.com?nav=index.html|ChangeSet@-9Mfirstname.lastname@example.org|hist/fs/inode.c Both of the above patches apply cleanly to the 2179-2199 kernels (fs/inode.c wasn't changed between those versions). My biggest problem right now is that I can't duplicate the oops in a controlled environment. It happens once a week across all of our dozen or so servers running this kernel. I've got a test machine now running ltp, dbench, kernel compiles, and other processes to try and duplicate this oops, but haven't seen it in 2 straight days of testing. It's not clear that the error was seen much (if at all) in the wild, it looks like Rik fixed the error before lots of people noticed. From an email exchange with Aleksander, he can't duplicate this problem in a controlled setting either, it happens about twice per month for him. Ideally a VFS person could look at the above patches and just say "yes, this patch needs to be applied to the FC1 kernel, it could cause that oops".
Comment 21 Andrew Ryan 2004-08-19 21:04:05 EDT
After further looking the second fix, "do not leave inodes with stale waitqueue on slab cache" was fixed in the FC1.2199 kernel. I will attach the patch to FC1.2199 that implements Rik's fix, which I'm testing now. Note we do not use quotas, so the second part of his fix is not relevant to us, I don't think.
Comment 22 Andrew Ryan 2004-08-19 21:05:08 EDT
Created attachment 102912 [details] patch which implement's Rik's inode reclaim pach from 2.4.25
Comment 23 Aleksander Adamowski 2004-08-20 08:59:57 EDT
Unfortunately I cannot test this fix as I've switched to RHEL kernel on that machine to remedy the panics.
Comment 24 Aleksander Adamowski 2004-08-20 10:45:28 EDT
For the record, we're running the 2.4.21-15.ELsmp RHEL kernel to avoid the panics.
Comment 25 Andrew Ryan 2004-09-27 13:00:08 EDT
Running with the FC1.2199 kernel that implements Rik's refile_inode fix, we've had 4 weeks (28 days) of uptime without a crash. The best we were doing before was 1 week and often less than that. If anyone is still running/updating the FC1 kernel this would be a good patch to use/apply...
Comment 26 David Lawrence 2004-09-29 16:26:29 EDT
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/