Bug 121732 (IT_41260) - oops in refile_inode when running high load
Summary: oops in refile_inode when running high load
Alias: IT_41260
Product: Fedora
Classification: Fedora
Component: kernel
Version: 1
Hardware: i386
OS: Linux
Target Milestone: ---
Assignee: Arjan van de Ven
QA Contact:
Depends On:
TreeView+ depends on / blocked
Reported: 2004-04-26 20:47 UTC by Andrew Ryan
Modified: 2007-11-30 22:10 UTC (History)
3 users (show)

Clone Of:
Last Closed: 2004-09-29 20:22:29 UTC

Attachments (Terms of Use)
ksymoops output (4.33 KB, text/plain)
2004-04-26 20:50 UTC, Andrew Ryan
no flags Details
'vmstat 30' output for period preceding crash (4.90 KB, text/plain)
2004-04-26 20:50 UTC, Andrew Ryan
no flags Details
SysRq+T output from oopsed state (96.47 KB, text/plain)
2004-04-26 20:51 UTC, Andrew Ryan
no flags Details

Description Andrew Ryan 2004-04-26 20:47:11 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7b)

Description of problem:
While running load tests of Subversion with the repository on an NFS
mounted filesystem, we're getting reliable crashes in every Redhat 9 -
through Fedora Core 1 kernel. I've attached the oops and will attach
the ksymoops output shortly. The hang does not seem to occur when we
use a repository mounted on local disk. I don't believe that it has
anything to do with Subversion, but whatever load svn is generating is
tickling a kernel bug.

The hardware is dual Xeon 3.0GHz, running hyperthreading, kernel
2.4.22-1.2179.nptlsmp. The mount options in use are:
The NFS server is a NetApp. Both NFS client and server are running at
100Mb switched ethernet.

In the 2.4.26 kernel's Changelog
(http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.26) I saw
mention of a refile_inode bug fixed by Trond, which made me think
perhaps this is what is affecting us, but I don't know.

A few minutes before the machine crashes, the virtual memory system
seems to deteriorate rapidly, with large amounts of 'si' and
especially 'so' traffic. I will also attach 'vmstat 30' output for the
30 or so minutes preceding the system crash.

The bug doesn't seem to affect us on a RH 7.2-based system running a
vanilla 2.4.21 kernel that includes Trond's NFS-ALL patch cluster.

Unable to handle kernel NULL pointer dereference at virtual address
 printing eip:
*pde = 00000000
Oops: 0002
nfs lockd sunrpc iptable_filter ip_tables autofs tg3 keybdev mousedev
hid input usb-ohci usbcore ext3 jbd cciss sd_mod scsi_mod
CPU:    3
EIP:    0060:[<c01690b7>]    Not tainted
EFLAGS: 00010246

EIP is at refile_inode [kernel] 0x47 (2.4.22-1.2179.nptlsmp)
eax: 00000000   ebx: dc141b80   ecx: 00000000   edx: dc141b88
esi: c0375ea8   edi: c0374e58   ebp: 00023354   esp: e76a5dd4
ds: 0068   es: 0068   ss: 0068
Process svnlook (pid: 2038, stackpage=e76a5000)
Stack: c17de430 dc141c44 c013c5e2 dc141b80 c17de430 00000000 c17de430
       c17de430 000001d2 e76a4000 00000a57 000001d2 00000019 00000020
       c0374e58 c0374e58 c01463ba e76a5e40 000001d2 0000003c 00000020
Call Trace:   [<c013c5e2>] __remove_inode_page [kernel] 0x82 (0xe76a5ddc)
[<c01460ca>] shrink_cache [kernel] 0x30a (0xe76a5df0)
[<c01463ba>] shrink_caches [kernel] 0x4a (0xe76a5e1c)
[<c0146432>] try_to_free_pages_zone [kernel] 0x62 (0xe76a5e30)
[<f885827b>] ext3_do_update_inode [ext3] 0x19b (0xe76a5e38)
[<c0147012>] balance_classzone [kernel] 0x52 (0xe76a5e54)
[<c0147348>] __alloc_pages [kernel] 0x188 (0xe76a5e70)
[<c013df51>] do_generic_file_read [kernel] 0x401 (0xe76a5eb0)
[<c013e3b0>] file_read_actor [kernel] 0x0 (0xe76a5ee0)
[<c013e575>] generic_file_new_read [kernel] 0xc5 (0xe76a5f00)
[<c013e3b0>] file_read_actor [kernel] 0x0 (0xe76a5f10)
[<c0163131>] do_select [kernel] 0x151 (0xe76a5f24)
[<c013e69f>] generic_file_read [kernel] 0x2f (0xe76a5f4c)
[<f89fd608>] nfs_file_read [nfs] 0x98 (0xe76a5f64)
[<c01504ba>] sys_pread [kernel] 0xca (0xe76a5f8c)
[<c0109b27>] system_call [kernel] 0x33 (0xe76a5fc0)

Code: 89 01 c7 43 08 00 00 00 00 89 48 04 8b 06 89 50 04 89 43 08

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
Right now we can reproduce this using our Subversion load testing with
Silk Performer. We are working on reproducing this with commonly
available command-line tools.

Actual Results:  Test completes.

Expected Results:  Kernel oops.

Additional info:

Comment 1 Andrew Ryan 2004-04-26 20:50:02 UTC
Created attachment 99698 [details]
ksymoops output

Comment 2 Andrew Ryan 2004-04-26 20:50:52 UTC
Created attachment 99699 [details]
'vmstat 30' output for period preceding crash

Comment 3 Andrew Ryan 2004-04-26 20:51:59 UTC
Created attachment 99700 [details]
SysRq+T output from oopsed state

Comment 4 Andrew Ryan 2004-04-27 18:53:45 UTC
I submitted this to the linux-nfs mailing list, and according to
Trond, this is a VM bug which should be fixed in FC1 kernels:

That it showed up on tests where we were using an NFS-mounted
filesystem is, apparently, just coincidental.

Subject:    Re: [NFS] oops in FC1 update kernel, in refile_inode
From:       Trond Myklebust <trond.myklebust () fys ! uio ! no>
Date:       2004-04-26 21:56:32

That is indeed a fix for a generic VFS/mm race. It has pretty much
nothing to do with NFS itself but just happened to trigger on an NFS
partition for someone.
As far as I can see, that patch hasn't yet been applied to the latest
errata kernel (linux-2.4.22-1.2188.nptl). Have you tried it out to see
if it fixes your Oops?

Steve, could you make sure that patch makes it into any future errata


["linux-2.4.26-refile_inode.dif" (linux-2.4.26-refile_inode.dif)]

--- linux-2.4.26-up/fs/inode.c.orig	2004-03-19 17:12:46.000000000 -0500
+++ linux-2.4.26-up/fs/inode.c	2004-03-26 13:01:23.000000000 -0500
@@ -319,7 +319,8 @@ void refile_inode(struct inode *inode)
 	if (!inode)
-	__refile_inode(inode);
+	if (!(inode->i_state & I_LOCK))
+		__refile_inode(inode);

Comment 5 Andrew Ryan 2004-04-29 22:57:08 UTC
With the above patch applied to the FC1.2179 kernel, we have not seen
the oops in 2 days of constant testing. For reference, we used to see
this oops after 2-8 hours of stress testing.

Comment 6 Dave Jones 2004-04-30 11:17:04 UTC
patch is in cvs, and will be in the next update.

Comment 7 Aleksander Adamowski 2004-05-28 11:39:09 UTC
Can this be the same issue as in bug 123332? I've posted there 2
stacktraces from kerlen panics, captured with a digital camera.

Comment 8 Aleksander Adamowski 2004-05-28 11:45:52 UTC
BTW, forgot to notice, we're having those kernel panics on Fedora
kernel 2.4.22-1.2188.nptlsmp, about once every 2 weeks. This is a
production system, so unfortunately we cannot afford to stress-test it
to reproduce this artificially.

We cannot also connect a serial console, as the machine has only 1
serial port that has to be connected to a UPS.

But the stacktraces captured with digital camera look exactly the same
as the one reported here.

We were suspecting this to be a hardware issue with 3Ware controller
that runs our RAID5 array, but in the light of this bug it seems more
probable to be a kernel bug, right?

Comment 9 Dave Jones 2004-05-28 12:00:20 UTC
there should be a 2190 kernel in updates-testing, which should have
this fixed.

Comment 10 Aleksander Adamowski 2004-05-28 13:21:42 UTC
Out system just crashed again;

I've installed the 2.4.22-1.2190.nptlsmp kernel package from
2004-05-26 - I'll let you know if it remedies the issue, but testing
period will be long since this crash occurs about twice a month on
this particular system.

Comment 11 Aleksander Adamowski 2004-05-28 14:33:55 UTC
Does this issue affect Fedora 2's 2.6 kernel?

Comment 12 Dave Jones 2004-05-28 14:45:13 UTC
no. refile_inode doesn't exist there.

Comment 13 Aleksander Adamowski 2004-05-31 19:28:37 UTC
Another panic in refile_inode occured just today on

The problem has not been resolved, or the problem is separate (in that
case, bug 123332 is not a dupe of this one).

Comment 14 Aleksander Adamowski 2004-06-01 09:12:42 UTC
BTW, looking at /usr/src/linux-2.4/fs/inode.c (from
kernel-source-2.4.22-1.2190.nptl RPM) the fix from comment #3 is
present there.

But the panics still happen.

Comment 15 David Lawrence 2004-09-29 20:22:29 UTC
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/

Comment 16 Larry Troan 2004-10-19 22:04:22 UTC
Problem was found and fixed in RHEL3 U3. 

Note You need to log in before you can comment on or make changes to this bug.