Red Hat Bugzilla – Bug 129861
Hard mounted NFS clients don't recover once server recovers
Last modified: 2007-11-30 17:07:03 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4b)
Gecko/20030516 Mozilla Firebird/0.6
Description of problem:
NFS server dies... it was down for about 7 hours last night.
All NFS clients mounted "hard".
Server recovers, RHEL3-U2 clients do not, and must be rebooted (other
distro/kernel clients recover w/o reboot).
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Mount NFS clients using "hard" option.
2. Kill NFS server for 7 hours.
3. Restart NFS server.
4. See if client re-connects.
Actual Results: Clients never reconnect to server
Expected Results: Clients connect to server w/o rebooting
Are the clients busy? if so doing what (i.e. read,writes, readdirs, etc).
Clients weren't busy at the time. Just the normal background daemons.
Very little CPU use on the clients for the three hours prior to the
server going down.
Not to say "no process on the client system had open files on the NFS
mounted directories"... I'm sure some tasks did. The "hard" mount
should hang those apps, and bring them back to life when the server
The clients were not dead/hung... as long as ones PATH didn't conflict
with the NFS mounts, you can work on the clients. I have sucessfully
recovered from this state before, using "umount -l -f ..." on the
mounted partitions, but this is dirty and can cause other problems...
the "hard" mount is the clean solution.
Note: We've been seing this problem with Update3 - 2.4.21-20.EL. Both
the server and client system were recently updated to U3 (though the
-15 kernel had been skipped on both servers, so this shouldn't rule
out U2), and three times in the last five days, a hang caused by the
NFS server resulted in the client system being unable to recover
(requiring a reboot).
Another data point:
On the server in question, setting 'intr' did *not* allow us to break
the read/write request if the mount type was 'hard'; basically, the
processes hung as if the share had *not* been mounted with intr.
"intr" is also my default mount option. My client mount options are:
I've found that the problem only occurs if there are apps, with open
files in the NFS mounted fs, actively running; the amount of time the
server was down seems unrelated. If I kill those apps, then the
client system returns to normal, and new apps can see files in the
mount point again... which is broken behavior, and I might as well
reboot as all the work those apps were doing is now lost. I am
convinced that this is an RHEL kernel issue, as RH7.3 and SuSE9 based
clients, with the same mount options, don't exhibit this behavior.
see my comments in bug 126598 - these two seem related, if not the
same issue altogether..
There is a fix for bug 129861 (which is similar to this bug)
that is in the U4 beta kernel which is available through the
RHN beta channel. I believe its kernel-2.4.21-22.EL or (23.EL).
Please give that a try to see if the problem clears up.
Steve, please provide the correct bug number that you
were writing about in the above comment.
Correction.... the fix is in bz 118839
Just an update, The new 2.4.21-22.EL kernel did NOT clear up the
problem. I've attached details to bug 126598 (alt-sysrq-t trace output)
Could this be a duplicate of 139101?
I'm seeing there seem to be kernel 2.6.9 fixes for this. Any chance
of a back-port to RHEL kernels?
Created attachment 115492 [details]
Unfortunately, I've had no success reproducing this problem. Running the
program (which uses most of the major NFS operations) from an RHEL3, RHEL4
and Solaris 10 client to a RHEL3 server, I as not able to get any of the
to hang when I (constantly) rebooted the RHEL3 server. I used both UDP and TCP
as well as crashed the machine (via SysRq-B command) and rebooted nicely
and I was still unable to reproduce this...
I used the mount option in Comment #5 with the Linux clients, and just the
defaults with the Solaris client. To see if this is even a valid test problem,
could some one who can reproduce this hang, run the test program to
see if it hangs. tia...
Development NAK - no customer activity and lack of reproducer. Moving to the