217072 – RHEL4 kernel crash (2.6.9, LVM, XFS, NFS)

Bug 217072 - RHEL4 kernel crash (2.6.9, LVM, XFS, NFS)

Summary: RHEL4 kernel crash (2.6.9, LVM, XFS, NFS)

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.4
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Eric Sandeen
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-11-23 18:01 UTC by Sidney Polyakoff
Modified:	2007-11-17 01:14 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-12-07 20:32:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
NFS+XFS inode corruption fix for 2.6.9-42.0.3 kernel (15.26 KB, text/plain) 2006-11-23 18:07 UTC, Sidney Polyakoff	no flags	Details
View All

Description Sidney Polyakoff 2006-11-23 18:01:40 UTC

Description of problem:
We have got several Opteron-based servers on RHEL4.2 (updated to RHEL4.4). 
Each server's connected to SAN through QLogic FC HBA to IBM FAStT storage
controller.
We use 2-4 TB LV's and XFS filesystem above them. The main function of each
server is NFS service (version 3 only with TCP).

So we've got RHEL4.4 + LVM2 + XFS + NFS 

After several weeks of production one of our servers crashed with folowing Oops: 
     Unable to handle kernel NULL pointer dereference ... (see attachment)
Such crashes happened again and again. Their frequency is very unstable - couple
of days all servers work fine and after than server can crash 5 times a day.
Symptoms remain the same - trouble while performing the nfsd_lookup function.
We reattached BAD filesystem from one server to other for several times with
same results - it's not the servers hardware fault.

In order to get the cause we set up netconsole+netdump server and have hooked
couple of crashes successfully.
Crash analysis shows the last point of normal kernel execution - fs/namei.c:1041
in __lookup_hash function.

	struct dentry *new = d_alloc(base, name);
	dentry = ERR_PTR(-ENOMEM);
	if (!new)
		goto out;
	dentry = inode->i_op->lookup(inode, new, nd);
>>>>>	if (!dentry)
		dentry = new;
	else
		dput(new);

Obviously the point is in inode's lookup function call. 
I've spent some time in google searching same or nearly the same issues. 
I think one of appropriate patches was made by Christoph Hellwig (by sgi.com)
His post "Fix NFS inode data corruption (SGI-PV: 923968; SGI-Modid:
xfs-linux:xfs_kern:185126a) was included into vanilla 2.6.11 kernel.
I've patched 2.6.9-42.0.3 kernel with patch recommended by Christoph (see
attachment) and I hope it'll fix the bug.


Version-Release number of selected component (if applicable):
We've got crushes on kernels 2.6.9-34 and 2.6.9-42.0.3

How reproducible:
I don't know. I haven't got any stable test. All crashes we saw happened in
daylight when the interactive NFS load is high.

Steps to Reproduce:
1. One NFS server serves 20-30 clients. Logical volume with XFS above. NFSv3
exported share.
2. Most clients read and write files 1-2 GB into NFS share
3. Couple of clients surf in that NFS share and create/delete some
files/directories.
4. Repeat 
  
Actual results:
Kernel crashes

Expected results:
Do not crash

Additional info:
Sometimes between two crashes we've found obvious file corruption. 
Client creates the file but we've got the directory with file's name and length 0. 
Uid & Gid of that diectory don't correspond to owner attributes.
Change dir to such directory doesn't work - "It's not a directory". 
Aslo we can't delete it - "Directory is not empty" smth like that.
Umount + xfs_repair fixes this issue - now we can see the file with right name
but zero length.
Seems to me this is another kind of XFS+NFS troubles in 2.6.9 kernel.

Now we have to move ALL our filesystems into ext3. It's real hard work :(

Comment 1 Sidney Polyakoff 2006-11-23 18:07:05 UTC

Created attachment 142004 [details]
NFS+XFS inode corruption fix for 2.6.9-42.0.3 kernel

Patch, Oops and crash traceback

Comment 2 Eric Sandeen 2006-12-07 18:13:33 UTC

Of course xfs is not supported in RHEL4....  but anyway...

Are you using the xfs code that originally shipped with the RHEL4 kernel, or the
external xfs module rpm packages from CENTOS4, for example?

Whatever xfs code shipped with the original RHEL4 kernel is completely untested
in that kernel since xfs isn't supported.

The external xfs module I've put together already has this patch:

http://sandeen.net/rhel4_xfs/

You'll want kernel-module-xfs-2.6.9-42.0.2.EL-0.2-1.src.rpm for kernels beyond
2.6.9.42

Can you test the xfs module in that rpm?

Comment 3 Eric Sandeen 2006-12-07 20:32:25 UTC

Actually I'm going to have to mark this one CANTFIX, because xfs isn't even -in-
RHEL4, much less a supported component.

However, I think the rpms I pointed you at should solve the problem, and if they
don't feel free to let me know, and in my spare time in the evenings perhaps I
can look into it further ;-)

-Eric

Comment 4 Sidney Polyakoff 2006-12-11 14:31:55 UTC

Thank you Eric. I got that RedHat doesn't support XFS and obviusly will not. Any
way I will post our results in few weeks.

Sid

Note You need to log in before you can comment on or make changes to this bug.