From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4b) Gecko/20030516 Mozilla Firebird/0.6 Description of problem: NFS server dies... it was down for about 7 hours last night. All NFS clients mounted "hard". Server recovers, RHEL3-U2 clients do not, and must be rebooted (other distro/kernel clients recover w/o reboot). Version-Release number of selected component (if applicable): 2.4.21-15smp How reproducible: Sometimes Steps to Reproduce: 1. Mount NFS clients using "hard" option. 2. Kill NFS server for 7 hours. 3. Restart NFS server. 4. See if client re-connects. Actual Results: Clients never reconnect to server Expected Results: Clients connect to server w/o rebooting
Are the clients busy? if so doing what (i.e. read,writes, readdirs, etc).
Clients weren't busy at the time. Just the normal background daemons. Very little CPU use on the clients for the three hours prior to the server going down. Not to say "no process on the client system had open files on the NFS mounted directories"... I'm sure some tasks did. The "hard" mount should hang those apps, and bring them back to life when the server reappears. The clients were not dead/hung... as long as ones PATH didn't conflict with the NFS mounts, you can work on the clients. I have sucessfully recovered from this state before, using "umount -l -f ..." on the mounted partitions, but this is dirty and can cause other problems... the "hard" mount is the clean solution.
Note: We've been seing this problem with Update3 - 2.4.21-20.EL. Both the server and client system were recently updated to U3 (though the -15 kernel had been skipped on both servers, so this shouldn't rule out U2), and three times in the last five days, a hang caused by the NFS server resulted in the client system being unable to recover (requiring a reboot).
Another data point: On the server in question, setting 'intr' did *not* allow us to break the read/write request if the mount type was 'hard'; basically, the processes hung as if the share had *not* been mounted with intr.
"intr" is also my default mount option. My client mount options are: bg,nocto,intr,vers=3,rsize=32768,wsize=32768,hard,retrans=1000,timeo=3,nolock,async I've found that the problem only occurs if there are apps, with open files in the NFS mounted fs, actively running; the amount of time the server was down seems unrelated. If I kill those apps, then the client system returns to normal, and new apps can see files in the mount point again... which is broken behavior, and I might as well reboot as all the work those apps were doing is now lost. I am convinced that this is an RHEL kernel issue, as RH7.3 and SuSE9 based clients, with the same mount options, don't exhibit this behavior.
see my comments in bug 126598 - these two seem related, if not the same issue altogether..
There is a fix for bug 129861 (which is similar to this bug) that is in the U4 beta kernel which is available through the RHN beta channel. I believe its kernel-2.4.21-22.EL or (23.EL). Please give that a try to see if the problem clears up.
Steve, please provide the correct bug number that you were writing about in the above comment.
Correction.... the fix is in bz 118839
Just an update, The new 2.4.21-22.EL kernel did NOT clear up the problem. I've attached details to bug 126598 (alt-sysrq-t trace output)
Could this be a duplicate of 139101?
I'm seeing there seem to be kernel 2.6.9 fixes for this. Any chance of a back-port to RHEL kernels?
Created attachment 115492 [details] test program Unfortunately, I've had no success reproducing this problem. Running the attached test program (which uses most of the major NFS operations) from an RHEL3, RHEL4 and Solaris 10 client to a RHEL3 server, I as not able to get any of the clients to hang when I (constantly) rebooted the RHEL3 server. I used both UDP and TCP as well as crashed the machine (via SysRq-B command) and rebooted nicely and I was still unable to reproduce this... I used the mount option in Comment #5 with the Linux clients, and just the defaults with the Solaris client. To see if this is even a valid test problem, could some one who can reproduce this hang, run the test program to see if it hangs. tia...
Development NAK - no customer activity and lack of reproducer. Moving to the NAK list.