Red Hat Bugzilla – Bug 217072
RHEL4 kernel crash (2.6.9, LVM, XFS, NFS)
Last modified: 2007-11-16 20:14:54 EST
Description of problem:
We have got several Opteron-based servers on RHEL4.2 (updated to RHEL4.4).
Each server's connected to SAN through QLogic FC HBA to IBM FAStT storage
We use 2-4 TB LV's and XFS filesystem above them. The main function of each
server is NFS service (version 3 only with TCP).
So we've got RHEL4.4 + LVM2 + XFS + NFS
After several weeks of production one of our servers crashed with folowing Oops:
Unable to handle kernel NULL pointer dereference ... (see attachment)
Such crashes happened again and again. Their frequency is very unstable - couple
of days all servers work fine and after than server can crash 5 times a day.
Symptoms remain the same - trouble while performing the nfsd_lookup function.
We reattached BAD filesystem from one server to other for several times with
same results - it's not the servers hardware fault.
In order to get the cause we set up netconsole+netdump server and have hooked
couple of crashes successfully.
Crash analysis shows the last point of normal kernel execution - fs/namei.c:1041
in __lookup_hash function.
struct dentry *new = d_alloc(base, name);
dentry = ERR_PTR(-ENOMEM);
dentry = inode->i_op->lookup(inode, new, nd);
>>>>> if (!dentry)
dentry = new;
Obviously the point is in inode's lookup function call.
I've spent some time in google searching same or nearly the same issues.
I think one of appropriate patches was made by Christoph Hellwig (by sgi.com)
His post "Fix NFS inode data corruption (SGI-PV: 923968; SGI-Modid:
xfs-linux:xfs_kern:185126a) was included into vanilla 2.6.11 kernel.
I've patched 2.6.9-42.0.3 kernel with patch recommended by Christoph (see
attachment) and I hope it'll fix the bug.
Version-Release number of selected component (if applicable):
We've got crushes on kernels 2.6.9-34 and 2.6.9-42.0.3
I don't know. I haven't got any stable test. All crashes we saw happened in
daylight when the interactive NFS load is high.
Steps to Reproduce:
1. One NFS server serves 20-30 clients. Logical volume with XFS above. NFSv3
2. Most clients read and write files 1-2 GB into NFS share
3. Couple of clients surf in that NFS share and create/delete some
Do not crash
Sometimes between two crashes we've found obvious file corruption.
Client creates the file but we've got the directory with file's name and length 0.
Uid & Gid of that diectory don't correspond to owner attributes.
Change dir to such directory doesn't work - "It's not a directory".
Aslo we can't delete it - "Directory is not empty" smth like that.
Umount + xfs_repair fixes this issue - now we can see the file with right name
but zero length.
Seems to me this is another kind of XFS+NFS troubles in 2.6.9 kernel.
Now we have to move ALL our filesystems into ext3. It's real hard work :(
Created attachment 142004 [details]
NFS+XFS inode corruption fix for 2.6.9-42.0.3 kernel
Patch, Oops and crash traceback
Of course xfs is not supported in RHEL4.... but anyway...
Are you using the xfs code that originally shipped with the RHEL4 kernel, or the
external xfs module rpm packages from CENTOS4, for example?
Whatever xfs code shipped with the original RHEL4 kernel is completely untested
in that kernel since xfs isn't supported.
The external xfs module I've put together already has this patch:
You'll want kernel-module-xfs-2.6.9-42.0.2.EL-0.2-1.src.rpm for kernels beyond
Can you test the xfs module in that rpm?
Actually I'm going to have to mark this one CANTFIX, because xfs isn't even -in-
RHEL4, much less a supported component.
However, I think the rpms I pointed you at should solve the problem, and if they
don't feel free to let me know, and in my spare time in the evenings perhaps I
can look into it further ;-)
Thank you Eric. I got that RedHat doesn't support XFS and obviusly will not. Any
way I will post our results in few weeks.