Red Hat Bugzilla – Bug 1735
nfs server crashes with high usage from solaris
Last modified: 2008-05-01 11:37:49 EDT
The NFS server shipped with 5.2 crashes regularly under high
stress. By crash I mean the rpc.nfsd process dies (I have
seen the rpc.mountd process die also, but usually it is just
nfsd). Then the NFS client starts getting `is a directory'
or `not a file' error on plain files, clearly because the
nfs reads are failing.
I have seen nothing in the system logs to indicate what is
going on. I've tried running nfsd in debugging mode, but
haven't caught it yet.
Unfortunately I have no recipe guaranteed to reproduce it.
I've seen it happen most often when I do a find on a deep
NFS mounted partition, or doing a cvs update on an NFS
mounted partition, or some other operation that is beating
on the filesystem.
In our present environment, the clients are Solaris 2.7
sparcs and the server is a Dell PowerEdge running Red Hat
5.2. I've had NFS problems with linux since the beginning,
though, including when the clients were also Linux/x86
I realize this is not an ideal bug report, but I sure hope
the problem can be found. As it is, we're forced to move
our home directories to a Solaris host :(.
This is not an ideal response, but I can give you some useful
information, at least...
I am told (I have no first-hand knowledge) that there are some
bugs in the solaris client nfs implementation which interact
badly with Linux and for which fixes are available from Sun.
I do not know if the bugs I was told about affect only the
2.2.x kernel-based nfsd or whether they affect the user-level
nfsd as well.
The latest 2.2.x kernels have kernel-based nfs and benefit from
a connectathon session in which NFS interaction was stressed and
the NFS server improved on Linux. Therefore, you can expect
improvements here in the future.
Just a comment to say that I'm seeing similar problems with
Redhat 5.2, and Solaris 2.5.1 with all recommended patches applied.
rpc.nfsd dies with a SIGSEGV, in the middle of the glibc RPC handling
code. Its unclear at the moment whether the problem is the rpc.nfsd
code polluting something that the libc code needs (or providing bad
parameters), or whether libc is reacting badly to something received
across the network.
Snooping the packets going into our NFS server hasn't yet revealed
anything peculiar at the times that the crashes occur.
I'm going to continue investigating this, as using 6.0 isn't an
option for us at present, and if I find anything I'll update this
I am running Solaris 2.6 as the client and RedHat 6.0 as the nfs server. All I
have to do is copy a large file to the server and rpc.nfsd dies. When I try to
restart it I get nfssvc: Address already. I have to actually reboot the server
to fix everything.
unfsd was retired - the bug was eventually fixed however