121732 – (IT_41260) oops in refile_inode when running high load

Bug 121732 (IT_41260) - oops in refile_inode when running high load

Summary: oops in refile_inode when running high load

Keywords:
Status:	CLOSED WONTFIX
Alias:	IT_41260
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	1
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-04-26 20:47 UTC by Andrew Ryan
Modified:	2007-11-30 22:10 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-09-29 20:22:29 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
ksymoops output (4.33 KB, text/plain) 2004-04-26 20:50 UTC, Andrew Ryan	no flags	Details
'vmstat 30' output for period preceding crash (4.90 KB, text/plain) 2004-04-26 20:50 UTC, Andrew Ryan	no flags	Details
SysRq+T output from oopsed state (96.47 KB, text/plain) 2004-04-26 20:51 UTC, Andrew Ryan	no flags	Details
View All

Description Andrew Ryan 2004-04-26 20:47:11 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7b)
Gecko/20040415

Description of problem:
While running load tests of Subversion with the repository on an NFS
mounted filesystem, we're getting reliable crashes in every Redhat 9 -
through Fedora Core 1 kernel. I've attached the oops and will attach
the ksymoops output shortly. The hang does not seem to occur when we
use a repository mounted on local disk. I don't believe that it has
anything to do with Subversion, but whatever load svn is generating is
tickling a kernel bug.

The hardware is dual Xeon 3.0GHz, running hyperthreading, kernel
2.4.22-1.2179.nptlsmp. The mount options in use are:
rw,tcp,nfsvers=3,rsize=32768,wsize=32768,intr
The NFS server is a NetApp. Both NFS client and server are running at
100Mb switched ethernet.

In the 2.4.26 kernel's Changelog
(http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.26) I saw
mention of a refile_inode bug fixed by Trond, which made me think
perhaps this is what is affecting us, but I don't know.

A few minutes before the machine crashes, the virtual memory system
seems to deteriorate rapidly, with large amounts of 'si' and
especially 'so' traffic. I will also attach 'vmstat 30' output for the
30 or so minutes preceding the system crash.

The bug doesn't seem to affect us on a RH 7.2-based system running a
vanilla 2.4.21 kernel that includes Trond's NFS-ALL patch cluster.


Unable to handle kernel NULL pointer dereference at virtual address
00000000
 printing eip:
c01690b7
*pde = 00000000
Oops: 0002
nfs lockd sunrpc iptable_filter ip_tables autofs tg3 keybdev mousedev
hid input usb-ohci usbcore ext3 jbd cciss sd_mod scsi_mod
CPU:    3
EIP:    0060:[<c01690b7>]    Not tainted
EFLAGS: 00010246

EIP is at refile_inode [kernel] 0x47 (2.4.22-1.2179.nptlsmp)
eax: 00000000   ebx: dc141b80   ecx: 00000000   edx: dc141b88
esi: c0375ea8   edi: c0374e58   ebp: 00023354   esp: e76a5dd4
ds: 0068   es: 0068   ss: 0068
Process svnlook (pid: 2038, stackpage=e76a5000)
Stack: c17de430 dc141c44 c013c5e2 dc141b80 c17de430 00000000 c17de430
c01460ca
       c17de430 000001d2 e76a4000 00000a57 000001d2 00000019 00000020
000001d2
       c0374e58 c0374e58 c01463ba e76a5e40 000001d2 0000003c 00000020
c0146432
Call Trace:   [<c013c5e2>] __remove_inode_page [kernel] 0x82 (0xe76a5ddc)
[<c01460ca>] shrink_cache [kernel] 0x30a (0xe76a5df0)
[<c01463ba>] shrink_caches [kernel] 0x4a (0xe76a5e1c)
[<c0146432>] try_to_free_pages_zone [kernel] 0x62 (0xe76a5e30)
[<f885827b>] ext3_do_update_inode [ext3] 0x19b (0xe76a5e38)
[<c0147012>] balance_classzone [kernel] 0x52 (0xe76a5e54)
[<c0147348>] __alloc_pages [kernel] 0x188 (0xe76a5e70)
[<c013df51>] do_generic_file_read [kernel] 0x401 (0xe76a5eb0)
[<c013e3b0>] file_read_actor [kernel] 0x0 (0xe76a5ee0)
[<c013e575>] generic_file_new_read [kernel] 0xc5 (0xe76a5f00)
[<c013e3b0>] file_read_actor [kernel] 0x0 (0xe76a5f10)
[<c0163131>] do_select [kernel] 0x151 (0xe76a5f24)
[<c013e69f>] generic_file_read [kernel] 0x2f (0xe76a5f4c)
[<f89fd608>] nfs_file_read [nfs] 0x98 (0xe76a5f64)
[<c01504ba>] sys_pread [kernel] 0xca (0xe76a5f8c)
[<c0109b27>] system_call [kernel] 0x33 (0xe76a5fc0)


Code: 89 01 c7 43 08 00 00 00 00 89 48 04 8b 06 89 50 04 89 43 08


Version-Release number of selected component (if applicable):
kernel-smp-2.4.22-1.2179.nptl

How reproducible:
Always

Steps to Reproduce:
Right now we can reproduce this using our Subversion load testing with
Silk Performer. We are working on reproducing this with commonly
available command-line tools.

Actual Results:  Test completes.

Expected Results:  Kernel oops.

Additional info:

Comment 1 Andrew Ryan 2004-04-26 20:50:02 UTC

Created attachment 99698 [details]
ksymoops output

Comment 2 Andrew Ryan 2004-04-26 20:50:52 UTC

Created attachment 99699 [details]
'vmstat 30' output for period preceding crash

Comment 3 Andrew Ryan 2004-04-26 20:51:59 UTC

Created attachment 99700 [details]
SysRq+T output from oopsed state

Comment 4 Andrew Ryan 2004-04-27 18:53:45 UTC

I submitted this to the linux-nfs mailing list, and according to
Trond, this is a VM bug which should be fixed in FC1 kernels:
http://marc.theaimsgroup.com/?l=linux-nfs&m=108301692018612&w=2

That it showed up on tests where we were using an NFS-mounted
filesystem is, apparently, just coincidental.

Subject:    Re: [NFS] oops in FC1 update kernel, in refile_inode
From:       Trond Myklebust <trond.myklebust () fys ! uio ! no>
Date:       2004-04-26 21:56:32

That is indeed a fix for a generic VFS/mm race. It has pretty much
nothing to do with NFS itself but just happened to trigger on an NFS
partition for someone.
As far as I can see, that patch hasn't yet been applied to the latest
errata kernel (linux-2.4.22-1.2188.nptl). Have you tried it out to see
if it fixes your Oops?

Steve, could you make sure that patch makes it into any future errata
kernels?

Cheers,
  Trond

["linux-2.4.26-refile_inode.dif" (linux-2.4.26-refile_inode.dif)]

--- linux-2.4.26-up/fs/inode.c.orig	2004-03-19 17:12:46.000000000 -0500
+++ linux-2.4.26-up/fs/inode.c	2004-03-26 13:01:23.000000000 -0500
@@ -319,7 +319,8 @@ void refile_inode(struct inode *inode)
 	if (!inode)
 		return;
 	spin_lock(&inode_lock);
-	__refile_inode(inode);
+	if (!(inode->i_state & I_LOCK))
+		__refile_inode(inode);
 	spin_unlock(&inode_lock);
 }

Comment 5 Andrew Ryan 2004-04-29 22:57:08 UTC

With the above patch applied to the FC1.2179 kernel, we have not seen
the oops in 2 days of constant testing. For reference, we used to see
this oops after 2-8 hours of stress testing.

Comment 6 Dave Jones 2004-04-30 11:17:04 UTC

patch is in cvs, and will be in the next update.

Comment 7 Aleksander Adamowski 2004-05-28 11:39:09 UTC

Can this be the same issue as in bug 123332? I've posted there 2
stacktraces from kerlen panics, captured with a digital camera.

Comment 8 Aleksander Adamowski 2004-05-28 11:45:52 UTC

BTW, forgot to notice, we're having those kernel panics on Fedora
kernel 2.4.22-1.2188.nptlsmp, about once every 2 weeks. This is a
production system, so unfortunately we cannot afford to stress-test it
to reproduce this artificially.

We cannot also connect a serial console, as the machine has only 1
serial port that has to be connected to a UPS.

But the stacktraces captured with digital camera look exactly the same
as the one reported here.

We were suspecting this to be a hardware issue with 3Ware controller
that runs our RAID5 array, but in the light of this bug it seems more
probable to be a kernel bug, right?

Comment 9 Dave Jones 2004-05-28 12:00:20 UTC

there should be a 2190 kernel in updates-testing, which should have
this fixed.

Comment 10 Aleksander Adamowski 2004-05-28 13:21:42 UTC

Out system just crashed again;

I've installed the 2.4.22-1.2190.nptlsmp kernel package from
2004-05-26 - I'll let you know if it remedies the issue, but testing
period will be long since this crash occurs about twice a month on
this particular system.

Comment 11 Aleksander Adamowski 2004-05-28 14:33:55 UTC

Does this issue affect Fedora 2's 2.6 kernel?

Comment 12 Dave Jones 2004-05-28 14:45:13 UTC

no. refile_inode doesn't exist there.

Comment 13 Aleksander Adamowski 2004-05-31 19:28:37 UTC

Another panic in refile_inode occured just today on
kernel-2.4.22-1.2190.nptlsmp.

The problem has not been resolved, or the problem is separate (in that
case, bug 123332 is not a dupe of this one).

Comment 14 Aleksander Adamowski 2004-06-01 09:12:42 UTC

BTW, looking at /usr/src/linux-2.4/fs/inode.c (from
kernel-source-2.4.22-1.2190.nptl RPM) the fix from comment #3 is
present there.

But the panics still happen.

Comment 15 David Lawrence 2004-09-29 20:22:29 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/

Comment 16 Larry Troan 2004-10-19 22:04:22 UTC

Problem was found and fixed in RHEL3 U3.

Note You need to log in before you can comment on or make changes to this bug.