Bug 129861 - Hard mounted NFS clients don't recover once server recovers
Summary: Hard mounted NFS clients don't recover once server recovers
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Steve Dickson
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 170445
TreeView+ depends on / blocked
 
Reported: 2004-08-13 15:23 UTC by Chris Worley
Modified: 2007-11-30 22:07 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-03-14 04:17:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
test program (897 bytes, text/plain)
2005-06-15 17:26 UTC, Steve Dickson
no flags Details

Description Chris Worley 2004-08-13 15:23:03 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4b)
Gecko/20030516 Mozilla Firebird/0.6

Description of problem:

NFS server dies... it was down for about 7 hours last night.

All NFS clients mounted "hard".

Server recovers, RHEL3-U2 clients do not, and must be rebooted (other
distro/kernel clients recover w/o reboot).

Version-Release number of selected component (if applicable):
2.4.21-15smp

How reproducible:
Sometimes

Steps to Reproduce:
1. Mount NFS clients using "hard" option.
2. Kill NFS server for 7 hours.
3. Restart NFS server.
4. See if client re-connects.
    

Actual Results:  Clients never reconnect to server

Expected Results:  Clients connect to server w/o rebooting

Comment 1 Steve Dickson 2004-08-13 17:33:40 UTC
Are the clients busy? if so doing what (i.e. read,writes, readdirs, etc). 

Comment 2 Chris Worley 2004-08-13 18:50:51 UTC
Clients weren't busy at the time.  Just the normal background daemons.
 Very little CPU use on the clients for the three hours prior to the
server going down.

Not to say "no process on the client system had open files on the NFS
mounted directories"... I'm sure some tasks did.  The "hard" mount
should hang those apps, and bring them back to life when the server
reappears.

The clients were not dead/hung... as long as ones PATH didn't conflict
with the NFS mounts, you can work on the clients.  I have sucessfully
recovered from this state before, using "umount -l -f ..." on the
mounted partitions, but this is dirty and can cause other problems...
the "hard" mount is the clean solution.



Comment 3 Ken Snider 2004-09-15 16:35:59 UTC
Note: We've been seing this problem with Update3 - 2.4.21-20.EL. Both
the server and client system were recently updated to U3 (though the
-15 kernel had been skipped on both servers, so this shouldn't rule
out U2), and three times in the last five days, a hang caused by the
NFS server resulted in the client system being unable to recover
(requiring a reboot).

Comment 4 Ken Snider 2004-09-16 19:51:14 UTC
Another data point:

On the server in question, setting 'intr' did *not* allow us to break
the read/write request if the mount type was 'hard'; basically, the
processes hung as if the share had *not* been mounted with intr.

Comment 5 Chris Worley 2004-09-28 16:18:05 UTC
"intr" is also my default mount option.  My client mount options are:

bg,nocto,intr,vers=3,rsize=32768,wsize=32768,hard,retrans=1000,timeo=3,nolock,async

I've found that the problem only occurs if there are apps, with open
files in the NFS mounted fs, actively running; the amount of time the
server was down seems unrelated.  If I kill those apps, then the
client system returns to normal, and new apps can see files in the
mount point again... which is broken behavior, and I might as well
reboot as all the work those apps were doing is now lost.  I am
convinced that this is an RHEL kernel issue, as RH7.3 and SuSE9 based
clients, with the same mount options, don't exhibit this behavior.
 

Comment 6 Ken Snider 2004-11-01 19:42:31 UTC
see my comments in bug 126598 - these two seem related, if not the
same issue altogether..

Comment 7 Steve Dickson 2004-11-02 20:22:49 UTC
There is a fix for bug 129861 (which is similar to this bug)
that is in the U4 beta  kernel which is available through the 
RHN beta channel. I believe its kernel-2.4.21-22.EL or (23.EL).

Please give that a try to see if the problem clears up.



Comment 8 Ernie Petrides 2004-11-02 22:05:47 UTC
Steve, please provide the correct bug number that you
were writing about in the above comment.


Comment 9 Steve Dickson 2004-11-02 23:30:42 UTC
Correction.... the fix is in bz 118839

Comment 10 Ken Snider 2004-11-08 20:36:51 UTC
Just an update, The new 2.4.21-22.EL kernel did NOT clear up the
problem. I've attached details to bug 126598 (alt-sysrq-t trace output)

Comment 11 Chris Worley 2004-11-19 13:51:33 UTC
Could this be a duplicate of 139101?

Comment 12 Chris Worley 2004-12-07 15:05:29 UTC
I'm seeing there seem to be kernel 2.6.9 fixes for this.  Any chance
of a back-port to RHEL kernels? 

Comment 15 Steve Dickson 2005-06-15 17:26:16 UTC
Created attachment 115492 [details]
test program

Unfortunately, I've had no success reproducing this problem. Running the
attached test
program (which uses most of the major NFS operations) from an RHEL3, RHEL4
and Solaris 10 client to a RHEL3 server, I as not able to get any of the
clients
to hang when I (constantly) rebooted the RHEL3 server. I used both UDP and TCP
as well as crashed the machine (via SysRq-B command) and rebooted nicely
and I was still unable to reproduce this...

I used the mount option in Comment #5 with the Linux clients, and just the
defaults with the Solaris client. To see if this is even a valid test problem,
could some one who can reproduce this hang, run the test program to
see if it hangs. tia...

Comment 17 Peter Martuccelli 2005-10-21 15:26:27 UTC
Development NAK - no customer activity and lack of reproducer.  Moving to the
NAK list.




Note You need to log in before you can comment on or make changes to this bug.