Bug 126598 - nfs: server not responding, still trying.. server OK
Summary: nfs: server not responding, still trying.. server OK
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Steve Dickson
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-06-23 17:29 UTC by WhidbeyNet
Modified: 2007-11-30 22:07 UTC (History)
19 users (show)

Fixed In Version: 2.6.9-5.0.5.ELsmp
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-10-19 19:23:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
tcpdump of the most recent hang. (11.49 KB, text/plain)
2004-10-20 03:49 UTC, Geronimo A. Ordanza II
no flags Details
alt-sysrq-t capture from client and server (2.74 KB, text/plain)
2004-10-20 17:00 UTC, Ken Snider
no flags Details
ethereal capture from the same time period (2.00 KB, application/octet-stream)
2004-10-20 17:01 UTC, Ken Snider
no flags Details
full sysrq from client server (51.60 KB, text/plain)
2004-10-20 20:39 UTC, Ken Snider
no flags Details
logs before 18/out/2004 with kernel 21-4.ELsmp and after with kernel 21-20.ELsmp (70.71 KB, text/plain)
2004-10-25 17:54 UTC, Sergio Basto
no flags Details
alt-sysrq-t update from client running 2.4.21-22.EL (51.69 KB, text/plain)
2004-11-08 20:25 UTC, Ken Snider
no flags Details
Upstream patch that fixes hangs in NFS unlink code. (792 bytes, patch)
2004-11-09 14:51 UTC, Steve Dickson
no flags Details | Diff
Trace of hung NFS client running modified kernel above (2.4.21-23.EL w/patch) (51.53 KB, text/plain)
2004-11-24 15:53 UTC, Ken Snider
no flags Details
second trace from same client, 15 minutes after reboot for previous attachment, same symptoms (51.02 KB, text/plain)
2004-11-24 15:54 UTC, Ken Snider
no flags Details
unlink-deadlock patch (2.07 KB, text/plain)
2004-11-29 12:15 UTC, Steve Dickson
no flags Details

Description WhidbeyNet 2004-06-23 17:29:05 UTC
We've been running stock Enterprise 3 kernels (2.4.21-4.ELsmp) because of a 
problem we began experiencing with 2.4.21-9.0.1.ELsmp.  The original kernel does 
not exhibit a problem.

The issue exists in every kernel update since 9.0.1, including the latest 15.0.2.

Our problem is continuous errors in /var/log/messages generated by the kernel:

Jun 23 10:07:48 mail5 kernel: nfs: server ecluster not responding, still trying
Jun 23 10:07:48 mail5 kernel: nfs: server ecluster OK
Jun 23 10:07:53 mail5 kernel: nfs: server ecluster not responding, still trying
Jun 23 10:07:53 mail5 kernel: nfs: server ecluster OK

Server 'ecluster' is a RedHat Enterprise 3 AS NFS cluster.

The "timeo" option in /etc/fstab seems to have no affect. We've tried the following 
fstab entries:

ecluster:/var/mail      /var/mail               nfs     rw,async 0 0
ecluster:/var/mail      /var/mail               nfs     rw,async,hard,intr 0 0
ecluster:/var/mail      /var/mail               nfs     rw,async,hard,intr,timeo=90 0 0
ecluster:/var/mail      /var/mail               nfs     rw,async,wsize=1024,rsize=1024 0 0
ecluster:/var/mail      /var/mail               nfs     rw,async,wsize=8192,rsize=8192 0 0

Others have told us this problem is the result of an incorrect NFS timeout value in the 
kernel, and is fixed in later (non-Enterprise) kernels:

http://groups.google.com/groups?hl=en&lr=&ie=UTF-
8&selm=M3oM.78t.9%40gated-at.bofh.it

However, we could find no bug ID, so we're submitting this to see if there's a solution 
in 2.4.21-15.0.2.ELsmp.

Version-Release number of selected component (if applicable):
2.4.21-15.0.2.ELsmp

How reproducible:
Always

Steps to Reproduce:
1.  Install RedHat Enterprise 3 from the original media.
2.  Run up2date on the kernel to bring it to 2.4.21-15.0.2.ELsmp.
3.  Mount a volume from a RedHat Enterprise 3 NFS cluster.
4.  Generate a light load of traffic to the volume.
    
Actual Results:  Errors are generated in /var/log/messages:

Jun 23 10:07:48 mail5 kernel: nfs: server ecluster not responding, still trying
Jun 23 10:07:48 mail5 kernel: nfs: server ecluster OK
Jun 23 10:07:53 mail5 kernel: nfs: server ecluster not responding, still trying
Jun 23 10:07:53 mail5 kernel: nfs: server ecluster OK

Expected Results:  No errors, as in kernel-2.4.21-4.ELsmp, which works flawlessly.

Additional info:

Comment 1 Steve Dickson 2004-07-16 01:10:47 UTC
Try removing the async to see of if messages go away

Comment 2 WhidbeyNet 2004-07-16 17:50:26 UTC
We had already tried that, but we tried it again, and the messages persist:

Jul 16 10:32:08 mail14 kernel: nfs: server ecluster not responding, still trying
Jul 16 10:32:08 mail14 kernel: nfs: server ecluster OK
Jul 16 10:40:31 mail14 kernel: nfs: server ecluster not responding, still trying
Jul 16 10:40:31 mail14 kernel: nfs: server ecluster OK

Comment 3 Steve Dickson 2004-07-30 14:09:35 UTC
What does nfsstat -rc show abundant amount of retrans?
Also, does ifconfig eth? show any errors?

Comment 4 WhidbeyNet 2004-07-30 15:06:50 UTC
nfsstat -rc:

Fri Jul 30 08:03:00 PDT 2004
calls      retrans    authrefrsh
34597602   207082     0

Fri Jul 30 08:04:00 PDT 2004
calls      retrans    authrefrsh
34600977   207108     0 

ifconfig eth1 (dedicated to NFS):
          RX packets:64320794 errors:0 dropped:0 overruns:0 frame:0
          TX packets:39745262 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:3136520972 (2991.2 Mb)  TX bytes:1599758946 (1525.6 Mb)
          Interrupt:29 

Isn't this just a case of changing the timeout back to a higher value?  We see 
no adverse affects, and have no problems with the stock kernels.

Comment 5 Steve Dickson 2004-08-11 16:00:39 UTC
wrt
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&selm=M3oM.78t.9%40gated-at.bofh.it
we already have all the patches this is referring to... 

Who is the server and are you using soft mounts?

Comment 6 WhidbeyNet 2004-08-11 16:48:27 UTC
The NFS server is a Red Hat Enterprise AS 3 2.4.21-4.0.2.ELsmp in a 
passive/active SCSI cluster.

The nfs client servers are using hard mounts; the default.  We've tried several 
other fstab options, with no change (such as setting a high timeo value).

The previous thread was an old example. Here is one from July 21st, right on 
RedHats forum:

http://www.redhat.com/archives/taroon-list/2004-July/msg00296.html

Also:

http://www.experts-exchange.com/Networking/Linux_Networking/
Q_21005301.html



Comment 7 Steve Dickson 2004-08-11 17:37:12 UTC
It seems nfs over TCP works in this scenario, why isn't
that a valid option? It would drop your retrans from
207082 to zero because TCP would be doing the retrans...



Comment 8 WhidbeyNet 2004-08-11 19:22:34 UTC
NFS over TCP is not an option:

http://www.redhat.com/docs/manuals/enterprise/RHEL-3-Manual/cluster-suite/
ch-netfs-service.html

"NFS cannot be configured to run over TCP with Red Hat Cluster Manager. 
For proper failover capabilities, NFS must run over the default UDP. "

Comment 9 Steve Dickson 2004-08-13 17:49:44 UTC
I'm going to reassign this to Lon our Cluster guy to see if he might
see something I'm  missing.... but in general these
messages are due to a flaky network or NFS server
that is spend a lot of time talking to the local filesystem.

Would it be possible to run a "top -i" on the server to capture
what happening on the server when these messages appear
on the client?

Comment 10 Lon Hohberger 2004-08-26 15:46:35 UTC
This really doesn't look like a cluster bug, particularly if there's
no stress testing+continuous relocation happening.

Comment 11 Lon Hohberger 2004-08-26 16:08:21 UTC
This looks like there's slow disk access or high load on the clustered
machines, preventing them from serving the NFS requests.

Comment 12 WhidbeyNet 2004-08-26 18:22:28 UTC
[root@eclu2 ~]$ uptime
 11:20:05  up 119 days, 12:18,  6 users,  load average: 0.33, 0.56, 0.45

Please let us re-iterate that:

* We do not see the NFS timeouts when a client is using a kernel prior to 
2.4.21-9.0.1.ELsmp.    We have 5 identical RH ES 3 client servers, connected 
to the same cluster, who do not display timeouts, with 2.4.21-4.ELsmp.

* There are no tangible performance impacts or other errors;  its the flood of 
timeout messages we're concerned about.  Others have reported this same 
scenario (see previous redhat forum link).

Thanks for continuing to look into this.  We are stuck using old kernels for 
now.

Comment 15 Michael Kearey 2004-09-28 07:22:32 UTC
I have customer asking about an identical problem to this bugzilla.
Any chance of an update to see where a potential solution is at ?

Comment 16 Steve Dickson 2004-10-15 11:50:43 UTC
Is this still happening in more recent RHEL kernels?

Comment 17 Ken Snider 2004-10-15 14:15:12 UTC
We had this problem occur with RHEL3-U3, so yes.

Comment 18 Steve Dickson 2004-10-15 15:48:41 UTC
To get another view, would it be possible to post
a ethereal trace of these timeouts? Also is there
any type of errors (other than the timeout messages)
being reported in /var/log/messages?

Comment 19 Ken Snider 2004-10-15 18:18:10 UTC
I cannot, unfortunately, as we've switched to NFO over TCP to get
around the problem -- however hopefully the original reporter can?

Comment 20 Geronimo A. Ordanza II 2004-10-20 03:49:47 UTC
Created attachment 105492 [details]
tcpdump of the most recent hang.

Comment 21 Geronimo A. Ordanza II 2004-10-20 03:58:37 UTC
Here's an update from one of my customer who is encountering the same
problem:

----------------
Here's a tcpdump (attached above) of our most recent hang, requiring a
reboot. 
 
Note that because this is TCP, there are no timeouts per se, but the
connections 
for these files (being accessed via vim) end up hanging indefinately. 
 
In the capture, the file being opened by UID 2005 is one that hangs 
indefinately. This is a small text-only README file (only a few k). 
 
.81 is the server, and .42 is the client with the hung connections. 
 
It should be pointed out that shortened timeouts seem to have no
effect, and 
setting hard,intr does *not* allow us to kill the applications;
essentially 
we're left with no recourse but to reboot the server.

Comment 22 Steve Dickson 2004-10-20 11:15:56 UTC
Would it be to get an AltSysRq-t trace of the hung process and/or
of the nfsd on the server (if applicable)

Comment 23 Ken Snider 2004-10-20 17:00:31 UTC
Created attachment 105531 [details]
alt-sysrq-t capture from client and server

Comment 24 Ken Snider 2004-10-20 17:01:33 UTC
Created attachment 105532 [details]
ethereal capture from the same time period

Comment 25 Ken Snider 2004-10-20 17:03:33 UTC
The sysrq file contains traces first from the client server, then one
of the nfsd processes on the server (they were all in identical places
in the stack at the time of the trace, though I cannot guarantee it
was this specific nfsd servicing the request obviously).

the ethereal capture was taken as the "cat" was being executed.

Comment 26 Steve Dickson 2004-10-20 19:05:02 UTC
The ethereal trace shows normal "cat" like traffic although it
do not show the NFS_READ which means either the data is in
the local page cache or the process hung before the RPC to
the server.

The alt-sysrq-t shows the cat process hung waiting for
a lock on a page. So the question is who has that lock
and why? Could you please post the entire alt-sysrq-t

Comment 27 Ken Snider 2004-10-20 20:39:51 UTC
Created attachment 105547 [details]
full sysrq from client server

This contains the full trace dump from the NFS client. Note that there were
several clients, all hung in the D state - including cc1plus, which had been
hung for a good half hour at the time this trace was taken.

Comment 28 Ken Snider 2004-10-21 15:01:24 UTC
The box hung again today, and it seems to mainly occur under moderate
IO load (specifically compiling).

That would seem to correlate with it being a locking issue, since
there's a lot of read/write traffic.

Comment 29 Larry Woodman 2004-10-21 15:27:01 UTC
Ken, when this box locks up is it just nfs accesses that stop or local
file systems as well?

Larry Woodman


Comment 30 Ken Snider 2004-10-21 16:19:27 UTC
exclusively nfs. The admin accounts have local homedirectories and are
unaffected when we login to look at the issue.

The problem happens as follows:

- Box boots, everything works normally
- at an indeterminate point, an application will hang on a read
request for NFS (usually shows itself during a compile, since that's
obviously IO/Lock intensive, but can also rear it's head with a simple
cat or vim)
- after this point, *any* nfs read *may* hang on any read, this
includes commands like "df". Local disk access continues normally
- At this point, the hung processes are unkillable, even though the
NFS share is mounted hard/intr, and the only way to clear the issue is
a reboot.

Comment 31 Kostas Georgiou 2004-10-23 13:52:28 UTC
Ken this looks like bug 118839.

Comment 32 Sergio Basto 2004-10-25 17:54:00 UTC
Created attachment 105734 [details]
logs before 18/out/2004 with kernel 21-4.ELsmp and after with kernel 21-20.ELsmp

Hi, 
Yah I have absolutely the same problem with 2.4.21-4.ELsmp none of messages
with 2.4.21-20.ELsmp all day.
In my case my server is just one NFS client of one Tru64 cluster with NFS
server, when is a little more loaded begin the warnings.
I had other servers with same hardware but with SlackWare OS and at same time
doesn't complain. 

I had some hangs periodically (4-6 days) with kernels 2.4-21-9. So we back to
stock kernel which stands up all 150 days.
In mean while I verify Gigabit Ethernet drive are out of date and I had update
it to e1000-5.2.52.tar.gz . 
Now with kernel 2.4.21-20.ELsmp, I verify that redhat also had updated Intel
driver, so a give a try the new kernel, since 7 days ago, have been runnig
without hang so far ...

Comment 34 Ken Snider 2004-10-26 04:15:47 UTC
In regards to Comment $31, this specific problem doesn't *seem* to be
the problem in bug 118839 - the RPC messages don't show up in dmesg,
and this box isn't under the kinds of client pressure suggested in
that bug, this is one nfs server talking to one nfs client.

The client is experiencing the issue about once a day though. This is
a real pain for us because these are developer home directories
mounted to our compile server, and as a result about once a day
someone's compile process ends up hanging due to this bug and the
server has to be rebooted.

I'm almost at the point where I want to switch back to the kernel that
did *not* experience this problem, only if I do I won't be able to
assist in future testing.

Larry, was my full trace output of any use?

Comment 35 Ken Snider 2004-11-01 19:45:05 UTC
bug 129861 appears related to the issue I've been describing -- if not
the same issue entirely.

Comment 36 Steve Dickson 2004-11-02 20:19:47 UTC
Well (supposedly) the fix for bug 129861 is in the U4 beta 
kernel which is available through the RHN beta 
channel. I believe its kernel-2.4.21-22.EL or (23.EL).

Please give that a try to see if the problem clears up.

Comment 37 Steve Dickson 2004-11-02 23:30:45 UTC
Correction.... the fix is in bz 118839

Comment 38 Ken Snider 2004-11-08 20:23:07 UTC
Just wanted to verify that the big in bug 118839 (in kernel
2.4.21-22.EL) Did *not* correct the problem. The client machine I
upgraded on Thursday finally failed today, with the same hangs. I will
be attaching an updated trace momentarily.

Comment 39 Ken Snider 2004-11-08 20:25:30 UTC
Created attachment 106300 [details]
alt-sysrq-t update from client running 2.4.21-22.EL

Comment 40 Steve Dickson 2004-11-09 14:11:05 UTC
OK, Ken, thanks for trying that out for us....

There has been some recent talk about unlink issues
on a number of the NFS mailing lists, I'm looking
into if they may or may not apply in this case...

Comment 41 Steve Dickson 2004-11-09 14:51:14 UTC
Created attachment 106333 [details]
Upstream patch that fixes hangs in NFS unlink code.

It seems the rpciod (pid 678) in the AtlSysRq-t trace in comment #38
is stuck in nfs_complete_unlink() which *could* mean that a wait up was
missed, which in turn,would cause the rest of the nfs process to hang. 
It just so happens that 2.4 patch appeared recently that fixes missed 
wake ups in the unlink code. Please try the the attached patch to see 
if it clears up the hang.

Comment 42 Chris Stankaitis 2004-11-09 20:15:19 UTC
On behalf of Ken (we work together) I grabbed the src.rpm for
2.4.21-23.EL and added the patch from Steve, rebuilt the RPM with
patch and installed on the server in question. 

The kernel is now live, we'll see how this works with regards to the
NFS issue.

Comment 43 Ken Snider 2004-11-18 16:01:07 UTC
Update for everyone.

We applied this patch to the U4-beta kernel (2.4.21-23.EL) and applied
this kernel to the NFS client (only) on the 9th.

The server ran without issue until the 16th (so much so that I was
about to report "Problem Solved!", as this was a significant change
from the "fail every 12-48 hours" rate we had seen in the past),
however, on the 16th, the server again failed exactly the same way,
hung NFS connections, and things as simple as vim-ing a 2kb file on
the server would hang in the D state.

So, it appears the problem has been made, at minimum, less common,
however it has *not* been eliminated. Unfortunately, the admin that
handled the server rebooted it without grabbing a trace, but I will
post one as soon as the problem manifests itself again.

Comment 44 Ken Snider 2004-11-19 18:16:26 UTC
note: bug 139101 seems to report the same issue as bug 129861, which
doesn't seem to be the same issue of this bug, so it would appear as
though this issue is in fact different from the two bugs stated above.

That just leaves the question of whether bug 118839 and this bug are
related. :)

Comment 45 Ken Snider 2004-11-24 15:42:18 UTC
I can now verify that the bug isn't resolved. We've seen this problem
pop up twice in the last two hours.

I'll attach traces again for both incidents - you'll notice they seem
to, again, be locked on the same kernel read-lock issue as before.

The kernel running on this client is 2.4.21-23.EL, with the patch from
comment #39 installed.

The server is stock RHEL3U3.

Comment 46 Ken Snider 2004-11-24 15:53:09 UTC
Created attachment 107395 [details]
Trace of hung NFS client running modified kernel above (2.4.21-23.EL w/patch)

Comment 47 Ken Snider 2004-11-24 15:54:20 UTC
Created attachment 107396 [details]
second trace from same client, 15 minutes after reboot for previous attachment, same symptoms

Comment 48 Steve Dickson 2004-11-29 12:15:50 UTC
Created attachment 107532 [details]
unlink-deadlock patch

Please try the attached patch. This was the fix to a 
similar problem so hopefully it will help wit this 
one.

Comment 49 Chris Stankaitis 2004-11-29 13:41:09 UTC
Steve, does this obsolete the patch to unlink.c which was listed on
  	2004-11-09 @ 09:51, should we build the kernel with or without the
other patch in addition to the one you just posted.

Comment 50 Steve Dickson 2004-11-29 13:59:32 UTC
No.... both seem to be needed... 

Comment 51 Chris Stankaitis 2004-11-29 19:01:32 UTC
rebuilt kernel-2.4.21-23.EL with the following two patches dated:

2004-11-09 09:51
2004-11-29 07:15

installed on problematic system and rebooted.  Ken or I will need to
update this ticket after the box runs for a while under the new kernel.

Comment 52 jason andrade 2004-12-01 05:23:50 UTC
hi,

i'm also interested in resolution of this as we see this on a bunch of
our nfs clients also.  we don't get a hang but access times out, load
jumps up significantly and performance drops for a while.

we mostly seem to see this on the system which has read/write access
via nfs - we have 10 nfs clients, 1 server and 9 of them mount readonly.

i'm running the kernel from rhel3qu4b2 (2.4.21-25 smp) 
regards,

-jason

Comment 53 Ken Snider 2004-12-10 15:06:51 UTC
Well, knock on wood, but the kernel as installed by Chris on Nov 29
has been running, without incident, on our NFS client - I'm prepared
to *tenatively* declare this one beat - it would appear that the
locking issue has been resolved.

Comment 54 Steve Dickson 2004-12-11 01:07:29 UTC
Just to be clear. Applying both patches (Comment #41
and Comment #48) stops client from hanging, right?
Are you still are seeing the ".... not responding" messages

Comment 55 Ken Snider 2004-12-11 15:07:43 UTC
We haven't seen the "not responding" messages since switching to NFS
over TCP, doing that is what switched us from those messages to the
hung tcp connection (more or less as expected..)

Comment 56 Steve Dickson 2004-12-13 13:30:42 UTC
Ken,

the patches did solove the hanging problem you were seeing, correct?

Comment 57 Ken Snider 2004-12-13 14:57:33 UTC
They appear to have, yes.

Comment 58 WhidbeyNet 2004-12-14 22:46:40 UTC
I applied the patches in Comment #41 and Comment #48 to the latest Enterprise release 
(2.4.21-20.0.1.EL).   

Unfortunately, we continue to see the errors:

Dec 14 14:38:02 mail5 kernel: nfs: server ecluster not responding, still trying
Dec 14 14:38:02 mail5 kernel: nfs: server ecluster OK
Dec 14 14:41:32 mail5 kernel: nfs: server ecluster not responding, still trying
Dec 14 14:41:32 mail5 kernel: nfs: server ecluster OK

Please remember this is connecting to a RedHat NFS Cluster system, and as such, it cannot 
use TCP connections.

I reverted again to 2.4.21-4.EL, which doesn't exhibit this problem.

Comment 59 Sergio Basto 2004-12-16 14:34:59 UTC
Continuing comment #58

neither the vanilla kernel 2.4.28 , so maybe use kernel 2.4.28 has
base of the kernel could be a good idea .

Comment 60 Sergio Basto 2005-01-28 16:27:14 UTC
Hi, more news, about this issue, still running with kernel 2.4.21-20.
We have one "Nortel Networks Alteon 2424 SSL" and if we put the client
or the server on Alteon, for load balancing, NFS stops working and get
very High loads. 
But with slackware 10 and kernel 2.4.28, none of this happens. 
We have many slacks and one RedHat the problem only appears on Redhat.
We also have problems with NFS on Tru64. But Tru64 are very older and
his nics too.

Comment 61 Steve Dickson 2005-02-01 15:25:53 UTC
Sergio,

are you using UDP or TCP. If UDP, please try TCP

Comment 62 WhidbeyNet 2005-02-01 15:50:14 UTC
See comment #8.

TCP is not an option for those serving NFS with a RedHat Enterprise Cluster.

Comment 63 Lon Hohberger 2005-02-01 16:56:13 UTC
Actually, using TCP NFS exports with Cluster Manager should work fine.

Comment 64 Sergio Basto 2005-02-03 11:55:42 UTC
Just a clarification I am talking about NFS client to a Unix NFS server 
I had the idea, not sure, the connections are made in TCP 

Comment 65 Ivan Olvera 2005-04-05 19:49:47 UTC
ho all i have FC3  2.6.10-1.770_FC3smp #1 SMP Thu Feb 24 14:20:06 EST 2005 i686
i686 i386 GNU/Linux


I analyse the net and i discover this

when starting doing the pivot to root system , the client ask  by NFS
protocol for  RPC  Access CAll FH: 0x3d74bff4

the server respond the RCP  accepted BUT THE UDP PACKET DON'T HAVE
CORRECT CHECKSUM  SO THE PACKET IS DISCARD

then the client send again and again the same RCP Access CALL , the
server respond every time  but the client discard packets because the
checksum is not correct.
also
the network file system  make a lookup CALL to DH 0xc17cbff4  it is
type of Linux knfsd (new )

the server accept the procedure but  again the UDP packet  have bad checksum

that's why is take to long to accept the connection

if you want i can send you my capture of packets

so i thing the problem is when the server make the UDP packetsm , do
you know how  this can be solved

regard
Ivan Olvera
MTIA




Comment 66 Steve Dickson 2005-04-13 11:00:48 UTC
Ivan,

Yes, a bzip2-ed ethereal traces (i.e. teterheal -w /tmp/dump.pcap)
would be good... also, what type of load is occurring when
the happens? Failover,  heavy io (i.e. reads or writes)?

Comment 67 Michael Tonn 2005-05-06 14:52:48 UTC
I am experiencing the same NFS hangs with Version 3 Update 4.  I non-clustered 
servers using NFS over TCP to a NetApp filer.  strace of the "df" command 
shows that the hang starts as soon as df accesses the NFS mount point.

Comment 68 WhidbeyNet 2005-05-06 15:00:33 UTC
We upgraded our NFS client servers to Enterprise 4.  After running for over a
week, we are faily confident this problem has been resolved.

Comment 69 Ken Snider 2005-05-06 15:04:13 UTC
I'm fairly certain that the 2.4 and 2.6 knfsd's have little to do with one
another, it may be premature to declare the bug squished in RHEL3 just because
the 2.6-kernel-based RHEL 4 doesn't exhibit the bug.

Comment 70 Ernie Petrides 2005-05-06 17:58:09 UTC
Reopening, since this is a RHEL3 bugzilla report.

Comment 71 Steve Dickson 2005-05-09 07:45:48 UTC
A bzip2-ed ethereal trace (i.e. teterheal -w /tmp/dump.pcap)
of this problem would be good.... 

Comment 79 Sev Binello 2005-08-04 20:25:14 UTC
So.......
Is there a solution to the initial "not responding" error message under UDP ?? 
We also have the same problem (is it really indicating a problem ?).
Would be interested to an explanation/solution.

Comment 80 Steve Dickson 2005-08-08 10:29:48 UTC
At this point we believe the initial problem was cause by 
the server's cluster hardware not responding in a timely
manner.

But in general, the nfs: server  not responding message
it such a generic warning message that its very hard to 
tell if the client or  the server is having problems.
We really need more to go on to decisively tell 
what is going on... 

Comment 81 Sev Binello 2005-08-08 14:03:12 UTC
Well, I'm not running a cluster (yet)
and I  get these messages all the time.
They go away under tcp, but as mentioned earlier in 
bugzilla, tcp is not supported under cluster suite,
and that's exactly what we are going to try to implement.
Didn't get these messages until we upgraded from 7.0 to RHEL 3.
Is there other info that I can send, that could help you analyze the problem ?? 
All our clients produce these messages no matter what server or subnet they access.

Comment 82 Steve Dickson 2005-08-08 14:59:33 UTC
Have you tried increasing the timeo mount option to something
like timeo=14? See nfs man page for details...

Comment 83 Sev Binello 2005-08-09 18:32:56 UTC
Yes (at least on the clients).
Any other ideas ?
Most of the not responding messages
are accompanied by an OK within the same second.

Comment 84 You Yuanxia 2005-08-25 02:13:20 UTC
Our cluster also has a huge number of nfs no responding... ok  problems.
we use UDP, 2.4.11


Comment 85 Juanjo Villaplana 2005-09-01 12:56:51 UTC
Hi Lon,

In comment #63 you said "TCP NFS exports with Cluster Manager should work fine".
Is this a tested feature? Which versions support it?

Does anyone have tried this?

Regards,
            Juanjo

Comment 87 RHEL Program Management 2007-10-19 19:23:38 UTC
This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.


Note You need to log in before you can comment on or make changes to this bug.