Red Hat Bugzilla – Full Text Bug Listing
|Summary:||oops in refile_inode when running high load|
|Product:||[Fedora] Fedora||Reporter:||Andrew Ryan <andrewr>|
|Component:||kernel||Assignee:||Arjan van de Ven <arjanv>|
|Status:||CLOSED WONTFIX||QA Contact:|
|Version:||1||CC:||bugs-redhat, steved, tao|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2004-09-29 16:22:29 EDT||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
Description Andrew Ryan 2004-04-26 16:47:11 EDT
From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7b) Gecko/20040415 Description of problem: While running load tests of Subversion with the repository on an NFS mounted filesystem, we're getting reliable crashes in every Redhat 9 - through Fedora Core 1 kernel. I've attached the oops and will attach the ksymoops output shortly. The hang does not seem to occur when we use a repository mounted on local disk. I don't believe that it has anything to do with Subversion, but whatever load svn is generating is tickling a kernel bug. The hardware is dual Xeon 3.0GHz, running hyperthreading, kernel 2.4.22-1.2179.nptlsmp. The mount options in use are: rw,tcp,nfsvers=3,rsize=32768,wsize=32768,intr The NFS server is a NetApp. Both NFS client and server are running at 100Mb switched ethernet. In the 2.4.26 kernel's Changelog (http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.26) I saw mention of a refile_inode bug fixed by Trond, which made me think perhaps this is what is affecting us, but I don't know. A few minutes before the machine crashes, the virtual memory system seems to deteriorate rapidly, with large amounts of 'si' and especially 'so' traffic. I will also attach 'vmstat 30' output for the 30 or so minutes preceding the system crash. The bug doesn't seem to affect us on a RH 7.2-based system running a vanilla 2.4.21 kernel that includes Trond's NFS-ALL patch cluster. Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: c01690b7 *pde = 00000000 Oops: 0002 nfs lockd sunrpc iptable_filter ip_tables autofs tg3 keybdev mousedev hid input usb-ohci usbcore ext3 jbd cciss sd_mod scsi_mod CPU: 3 EIP: 0060:[<c01690b7>] Not tainted EFLAGS: 00010246 EIP is at refile_inode [kernel] 0x47 (2.4.22-1.2179.nptlsmp) eax: 00000000 ebx: dc141b80 ecx: 00000000 edx: dc141b88 esi: c0375ea8 edi: c0374e58 ebp: 00023354 esp: e76a5dd4 ds: 0068 es: 0068 ss: 0068 Process svnlook (pid: 2038, stackpage=e76a5000) Stack: c17de430 dc141c44 c013c5e2 dc141b80 c17de430 00000000 c17de430 c01460ca c17de430 000001d2 e76a4000 00000a57 000001d2 00000019 00000020 000001d2 c0374e58 c0374e58 c01463ba e76a5e40 000001d2 0000003c 00000020 c0146432 Call Trace: [<c013c5e2>] __remove_inode_page [kernel] 0x82 (0xe76a5ddc) [<c01460ca>] shrink_cache [kernel] 0x30a (0xe76a5df0) [<c01463ba>] shrink_caches [kernel] 0x4a (0xe76a5e1c) [<c0146432>] try_to_free_pages_zone [kernel] 0x62 (0xe76a5e30) [<f885827b>] ext3_do_update_inode [ext3] 0x19b (0xe76a5e38) [<c0147012>] balance_classzone [kernel] 0x52 (0xe76a5e54) [<c0147348>] __alloc_pages [kernel] 0x188 (0xe76a5e70) [<c013df51>] do_generic_file_read [kernel] 0x401 (0xe76a5eb0) [<c013e3b0>] file_read_actor [kernel] 0x0 (0xe76a5ee0) [<c013e575>] generic_file_new_read [kernel] 0xc5 (0xe76a5f00) [<c013e3b0>] file_read_actor [kernel] 0x0 (0xe76a5f10) [<c0163131>] do_select [kernel] 0x151 (0xe76a5f24) [<c013e69f>] generic_file_read [kernel] 0x2f (0xe76a5f4c) [<f89fd608>] nfs_file_read [nfs] 0x98 (0xe76a5f64) [<c01504ba>] sys_pread [kernel] 0xca (0xe76a5f8c) [<c0109b27>] system_call [kernel] 0x33 (0xe76a5fc0) Code: 89 01 c7 43 08 00 00 00 00 89 48 04 8b 06 89 50 04 89 43 08 Version-Release number of selected component (if applicable): kernel-smp-2.4.22-1.2179.nptl How reproducible: Always Steps to Reproduce: Right now we can reproduce this using our Subversion load testing with Silk Performer. We are working on reproducing this with commonly available command-line tools. Actual Results: Test completes. Expected Results: Kernel oops. Additional info:
Comment 2 Andrew Ryan 2004-04-26 16:50:52 EDT
Created attachment 99699 [details] 'vmstat 30' output for period preceding crash
Comment 3 Andrew Ryan 2004-04-26 16:51:59 EDT
Created attachment 99700 [details] SysRq+T output from oopsed state
Comment 4 Andrew Ryan 2004-04-27 14:53:45 EDT
I submitted this to the linux-nfs mailing list, and according to Trond, this is a VM bug which should be fixed in FC1 kernels: http://marc.theaimsgroup.com/?l=linux-nfs&m=108301692018612&w=2 That it showed up on tests where we were using an NFS-mounted filesystem is, apparently, just coincidental. Subject: Re: [NFS] oops in FC1 update kernel, in refile_inode From: Trond Myklebust <trond.myklebust () fys ! uio ! no> Date: 2004-04-26 21:56:32 That is indeed a fix for a generic VFS/mm race. It has pretty much nothing to do with NFS itself but just happened to trigger on an NFS partition for someone. As far as I can see, that patch hasn't yet been applied to the latest errata kernel (linux-2.4.22-1.2188.nptl). Have you tried it out to see if it fixes your Oops? Steve, could you make sure that patch makes it into any future errata kernels? Cheers, Trond ["linux-2.4.26-refile_inode.dif" (linux-2.4.26-refile_inode.dif)] --- linux-2.4.26-up/fs/inode.c.orig 2004-03-19 17:12:46.000000000 -0500 +++ linux-2.4.26-up/fs/inode.c 2004-03-26 13:01:23.000000000 -0500 @@ -319,7 +319,8 @@ void refile_inode(struct inode *inode) if (!inode) return; spin_lock(&inode_lock); - __refile_inode(inode); + if (!(inode->i_state & I_LOCK)) + __refile_inode(inode); spin_unlock(&inode_lock); }
Comment 5 Andrew Ryan 2004-04-29 18:57:08 EDT
With the above patch applied to the FC1.2179 kernel, we have not seen the oops in 2 days of constant testing. For reference, we used to see this oops after 2-8 hours of stress testing.
Comment 6 Dave Jones 2004-04-30 07:17:04 EDT
patch is in cvs, and will be in the next update.
Comment 7 Aleksander Adamowski 2004-05-28 07:39:09 EDT
Can this be the same issue as in bug 123332? I've posted there 2 stacktraces from kerlen panics, captured with a digital camera.
Comment 8 Aleksander Adamowski 2004-05-28 07:45:52 EDT
BTW, forgot to notice, we're having those kernel panics on Fedora kernel 2.4.22-1.2188.nptlsmp, about once every 2 weeks. This is a production system, so unfortunately we cannot afford to stress-test it to reproduce this artificially. We cannot also connect a serial console, as the machine has only 1 serial port that has to be connected to a UPS. But the stacktraces captured with digital camera look exactly the same as the one reported here. We were suspecting this to be a hardware issue with 3Ware controller that runs our RAID5 array, but in the light of this bug it seems more probable to be a kernel bug, right?
Comment 9 Dave Jones 2004-05-28 08:00:20 EDT
there should be a 2190 kernel in updates-testing, which should have this fixed.
Comment 10 Aleksander Adamowski 2004-05-28 09:21:42 EDT
Out system just crashed again; I've installed the 2.4.22-1.2190.nptlsmp kernel package from 2004-05-26 - I'll let you know if it remedies the issue, but testing period will be long since this crash occurs about twice a month on this particular system.
Comment 11 Aleksander Adamowski 2004-05-28 10:33:55 EDT
Does this issue affect Fedora 2's 2.6 kernel?
Comment 12 Dave Jones 2004-05-28 10:45:13 EDT
no. refile_inode doesn't exist there.
Comment 13 Aleksander Adamowski 2004-05-31 15:28:37 EDT
Another panic in refile_inode occured just today on kernel-2.4.22-1.2190.nptlsmp. The problem has not been resolved, or the problem is separate (in that case, bug 123332 is not a dupe of this one).
Comment 14 Aleksander Adamowski 2004-06-01 05:12:42 EDT
BTW, looking at /usr/src/linux-2.4/fs/inode.c (from kernel-source-2.4.22-1.2190.nptl RPM) the fix from comment #3 is present there. But the panics still happen.
Comment 15 David Lawrence 2004-09-29 16:22:29 EDT
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/
Comment 16 Larry Troan 2004-10-19 18:04:22 EDT
Problem was found and fixed in RHEL3 U3.