From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7b) Gecko/20040415 Description of problem: While running load tests of Subversion with the repository on an NFS mounted filesystem, we're getting reliable crashes in every Redhat 9 - through Fedora Core 1 kernel. I've attached the oops and will attach the ksymoops output shortly. The hang does not seem to occur when we use a repository mounted on local disk. I don't believe that it has anything to do with Subversion, but whatever load svn is generating is tickling a kernel bug. The hardware is dual Xeon 3.0GHz, running hyperthreading, kernel 2.4.22-1.2179.nptlsmp. The mount options in use are: rw,tcp,nfsvers=3,rsize=32768,wsize=32768,intr The NFS server is a NetApp. Both NFS client and server are running at 100Mb switched ethernet. In the 2.4.26 kernel's Changelog (http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.26) I saw mention of a refile_inode bug fixed by Trond, which made me think perhaps this is what is affecting us, but I don't know. A few minutes before the machine crashes, the virtual memory system seems to deteriorate rapidly, with large amounts of 'si' and especially 'so' traffic. I will also attach 'vmstat 30' output for the 30 or so minutes preceding the system crash. The bug doesn't seem to affect us on a RH 7.2-based system running a vanilla 2.4.21 kernel that includes Trond's NFS-ALL patch cluster. Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: c01690b7 *pde = 00000000 Oops: 0002 nfs lockd sunrpc iptable_filter ip_tables autofs tg3 keybdev mousedev hid input usb-ohci usbcore ext3 jbd cciss sd_mod scsi_mod CPU: 3 EIP: 0060:[<c01690b7>] Not tainted EFLAGS: 00010246 EIP is at refile_inode [kernel] 0x47 (2.4.22-1.2179.nptlsmp) eax: 00000000 ebx: dc141b80 ecx: 00000000 edx: dc141b88 esi: c0375ea8 edi: c0374e58 ebp: 00023354 esp: e76a5dd4 ds: 0068 es: 0068 ss: 0068 Process svnlook (pid: 2038, stackpage=e76a5000) Stack: c17de430 dc141c44 c013c5e2 dc141b80 c17de430 00000000 c17de430 c01460ca c17de430 000001d2 e76a4000 00000a57 000001d2 00000019 00000020 000001d2 c0374e58 c0374e58 c01463ba e76a5e40 000001d2 0000003c 00000020 c0146432 Call Trace: [<c013c5e2>] __remove_inode_page [kernel] 0x82 (0xe76a5ddc) [<c01460ca>] shrink_cache [kernel] 0x30a (0xe76a5df0) [<c01463ba>] shrink_caches [kernel] 0x4a (0xe76a5e1c) [<c0146432>] try_to_free_pages_zone [kernel] 0x62 (0xe76a5e30) [<f885827b>] ext3_do_update_inode [ext3] 0x19b (0xe76a5e38) [<c0147012>] balance_classzone [kernel] 0x52 (0xe76a5e54) [<c0147348>] __alloc_pages [kernel] 0x188 (0xe76a5e70) [<c013df51>] do_generic_file_read [kernel] 0x401 (0xe76a5eb0) [<c013e3b0>] file_read_actor [kernel] 0x0 (0xe76a5ee0) [<c013e575>] generic_file_new_read [kernel] 0xc5 (0xe76a5f00) [<c013e3b0>] file_read_actor [kernel] 0x0 (0xe76a5f10) [<c0163131>] do_select [kernel] 0x151 (0xe76a5f24) [<c013e69f>] generic_file_read [kernel] 0x2f (0xe76a5f4c) [<f89fd608>] nfs_file_read [nfs] 0x98 (0xe76a5f64) [<c01504ba>] sys_pread [kernel] 0xca (0xe76a5f8c) [<c0109b27>] system_call [kernel] 0x33 (0xe76a5fc0) Code: 89 01 c7 43 08 00 00 00 00 89 48 04 8b 06 89 50 04 89 43 08 Version-Release number of selected component (if applicable): kernel-smp-2.4.22-1.2179.nptl How reproducible: Always Steps to Reproduce: Right now we can reproduce this using our Subversion load testing with Silk Performer. We are working on reproducing this with commonly available command-line tools. Actual Results: Test completes. Expected Results: Kernel oops. Additional info:
Created attachment 99698 [details] ksymoops output
Created attachment 99699 [details] 'vmstat 30' output for period preceding crash
Created attachment 99700 [details] SysRq+T output from oopsed state
I submitted this to the linux-nfs mailing list, and according to Trond, this is a VM bug which should be fixed in FC1 kernels: http://marc.theaimsgroup.com/?l=linux-nfs&m=108301692018612&w=2 That it showed up on tests where we were using an NFS-mounted filesystem is, apparently, just coincidental. Subject: Re: [NFS] oops in FC1 update kernel, in refile_inode From: Trond Myklebust <trond.myklebust () fys ! uio ! no> Date: 2004-04-26 21:56:32 That is indeed a fix for a generic VFS/mm race. It has pretty much nothing to do with NFS itself but just happened to trigger on an NFS partition for someone. As far as I can see, that patch hasn't yet been applied to the latest errata kernel (linux-2.4.22-1.2188.nptl). Have you tried it out to see if it fixes your Oops? Steve, could you make sure that patch makes it into any future errata kernels? Cheers, Trond ["linux-2.4.26-refile_inode.dif" (linux-2.4.26-refile_inode.dif)] --- linux-2.4.26-up/fs/inode.c.orig 2004-03-19 17:12:46.000000000 -0500 +++ linux-2.4.26-up/fs/inode.c 2004-03-26 13:01:23.000000000 -0500 @@ -319,7 +319,8 @@ void refile_inode(struct inode *inode) if (!inode) return; spin_lock(&inode_lock); - __refile_inode(inode); + if (!(inode->i_state & I_LOCK)) + __refile_inode(inode); spin_unlock(&inode_lock); }
With the above patch applied to the FC1.2179 kernel, we have not seen the oops in 2 days of constant testing. For reference, we used to see this oops after 2-8 hours of stress testing.
patch is in cvs, and will be in the next update.
Can this be the same issue as in bug 123332? I've posted there 2 stacktraces from kerlen panics, captured with a digital camera.
BTW, forgot to notice, we're having those kernel panics on Fedora kernel 2.4.22-1.2188.nptlsmp, about once every 2 weeks. This is a production system, so unfortunately we cannot afford to stress-test it to reproduce this artificially. We cannot also connect a serial console, as the machine has only 1 serial port that has to be connected to a UPS. But the stacktraces captured with digital camera look exactly the same as the one reported here. We were suspecting this to be a hardware issue with 3Ware controller that runs our RAID5 array, but in the light of this bug it seems more probable to be a kernel bug, right?
there should be a 2190 kernel in updates-testing, which should have this fixed.
Out system just crashed again; I've installed the 2.4.22-1.2190.nptlsmp kernel package from 2004-05-26 - I'll let you know if it remedies the issue, but testing period will be long since this crash occurs about twice a month on this particular system.
Does this issue affect Fedora 2's 2.6 kernel?
no. refile_inode doesn't exist there.
Another panic in refile_inode occured just today on kernel-2.4.22-1.2190.nptlsmp. The problem has not been resolved, or the problem is separate (in that case, bug 123332 is not a dupe of this one).
BTW, looking at /usr/src/linux-2.4/fs/inode.c (from kernel-source-2.4.22-1.2190.nptl RPM) the fix from comment #3 is present there. But the panics still happen.
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/
Problem was found and fixed in RHEL3 U3.