We've been running stock Enterprise 3 kernels (2.4.21-4.ELsmp) because of a problem we began experiencing with 2.4.21-9.0.1.ELsmp. The original kernel does not exhibit a problem. The issue exists in every kernel update since 9.0.1, including the latest 15.0.2. Our problem is continuous errors in /var/log/messages generated by the kernel: Jun 23 10:07:48 mail5 kernel: nfs: server ecluster not responding, still trying Jun 23 10:07:48 mail5 kernel: nfs: server ecluster OK Jun 23 10:07:53 mail5 kernel: nfs: server ecluster not responding, still trying Jun 23 10:07:53 mail5 kernel: nfs: server ecluster OK Server 'ecluster' is a RedHat Enterprise 3 AS NFS cluster. The "timeo" option in /etc/fstab seems to have no affect. We've tried the following fstab entries: ecluster:/var/mail /var/mail nfs rw,async 0 0 ecluster:/var/mail /var/mail nfs rw,async,hard,intr 0 0 ecluster:/var/mail /var/mail nfs rw,async,hard,intr,timeo=90 0 0 ecluster:/var/mail /var/mail nfs rw,async,wsize=1024,rsize=1024 0 0 ecluster:/var/mail /var/mail nfs rw,async,wsize=8192,rsize=8192 0 0 Others have told us this problem is the result of an incorrect NFS timeout value in the kernel, and is fixed in later (non-Enterprise) kernels: http://groups.google.com/groups?hl=en&lr=&ie=UTF- 8&selm=M3oM.78t.9%40gated-at.bofh.it However, we could find no bug ID, so we're submitting this to see if there's a solution in 2.4.21-15.0.2.ELsmp. Version-Release number of selected component (if applicable): 2.4.21-15.0.2.ELsmp How reproducible: Always Steps to Reproduce: 1. Install RedHat Enterprise 3 from the original media. 2. Run up2date on the kernel to bring it to 2.4.21-15.0.2.ELsmp. 3. Mount a volume from a RedHat Enterprise 3 NFS cluster. 4. Generate a light load of traffic to the volume. Actual Results: Errors are generated in /var/log/messages: Jun 23 10:07:48 mail5 kernel: nfs: server ecluster not responding, still trying Jun 23 10:07:48 mail5 kernel: nfs: server ecluster OK Jun 23 10:07:53 mail5 kernel: nfs: server ecluster not responding, still trying Jun 23 10:07:53 mail5 kernel: nfs: server ecluster OK Expected Results: No errors, as in kernel-2.4.21-4.ELsmp, which works flawlessly. Additional info:
Try removing the async to see of if messages go away
We had already tried that, but we tried it again, and the messages persist: Jul 16 10:32:08 mail14 kernel: nfs: server ecluster not responding, still trying Jul 16 10:32:08 mail14 kernel: nfs: server ecluster OK Jul 16 10:40:31 mail14 kernel: nfs: server ecluster not responding, still trying Jul 16 10:40:31 mail14 kernel: nfs: server ecluster OK
What does nfsstat -rc show abundant amount of retrans? Also, does ifconfig eth? show any errors?
nfsstat -rc: Fri Jul 30 08:03:00 PDT 2004 calls retrans authrefrsh 34597602 207082 0 Fri Jul 30 08:04:00 PDT 2004 calls retrans authrefrsh 34600977 207108 0 ifconfig eth1 (dedicated to NFS): RX packets:64320794 errors:0 dropped:0 overruns:0 frame:0 TX packets:39745262 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:3136520972 (2991.2 Mb) TX bytes:1599758946 (1525.6 Mb) Interrupt:29 Isn't this just a case of changing the timeout back to a higher value? We see no adverse affects, and have no problems with the stock kernels.
wrt http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&selm=M3oM.78t.9%40gated-at.bofh.it we already have all the patches this is referring to... Who is the server and are you using soft mounts?
The NFS server is a Red Hat Enterprise AS 3 2.4.21-4.0.2.ELsmp in a passive/active SCSI cluster. The nfs client servers are using hard mounts; the default. We've tried several other fstab options, with no change (such as setting a high timeo value). The previous thread was an old example. Here is one from July 21st, right on RedHats forum: http://www.redhat.com/archives/taroon-list/2004-July/msg00296.html Also: http://www.experts-exchange.com/Networking/Linux_Networking/ Q_21005301.html
It seems nfs over TCP works in this scenario, why isn't that a valid option? It would drop your retrans from 207082 to zero because TCP would be doing the retrans...
NFS over TCP is not an option: http://www.redhat.com/docs/manuals/enterprise/RHEL-3-Manual/cluster-suite/ ch-netfs-service.html "NFS cannot be configured to run over TCP with Red Hat Cluster Manager. For proper failover capabilities, NFS must run over the default UDP. "
I'm going to reassign this to Lon our Cluster guy to see if he might see something I'm missing.... but in general these messages are due to a flaky network or NFS server that is spend a lot of time talking to the local filesystem. Would it be possible to run a "top -i" on the server to capture what happening on the server when these messages appear on the client?
This really doesn't look like a cluster bug, particularly if there's no stress testing+continuous relocation happening.
This looks like there's slow disk access or high load on the clustered machines, preventing them from serving the NFS requests.
[root@eclu2 ~]$ uptime 11:20:05 up 119 days, 12:18, 6 users, load average: 0.33, 0.56, 0.45 Please let us re-iterate that: * We do not see the NFS timeouts when a client is using a kernel prior to 2.4.21-9.0.1.ELsmp. We have 5 identical RH ES 3 client servers, connected to the same cluster, who do not display timeouts, with 2.4.21-4.ELsmp. * There are no tangible performance impacts or other errors; its the flood of timeout messages we're concerned about. Others have reported this same scenario (see previous redhat forum link). Thanks for continuing to look into this. We are stuck using old kernels for now.
I have customer asking about an identical problem to this bugzilla. Any chance of an update to see where a potential solution is at ?
Is this still happening in more recent RHEL kernels?
We had this problem occur with RHEL3-U3, so yes.
To get another view, would it be possible to post a ethereal trace of these timeouts? Also is there any type of errors (other than the timeout messages) being reported in /var/log/messages?
I cannot, unfortunately, as we've switched to NFO over TCP to get around the problem -- however hopefully the original reporter can?
Created attachment 105492 [details] tcpdump of the most recent hang.
Here's an update from one of my customer who is encountering the same problem: ---------------- Here's a tcpdump (attached above) of our most recent hang, requiring a reboot. Note that because this is TCP, there are no timeouts per se, but the connections for these files (being accessed via vim) end up hanging indefinately. In the capture, the file being opened by UID 2005 is one that hangs indefinately. This is a small text-only README file (only a few k). .81 is the server, and .42 is the client with the hung connections. It should be pointed out that shortened timeouts seem to have no effect, and setting hard,intr does *not* allow us to kill the applications; essentially we're left with no recourse but to reboot the server.
Would it be to get an AltSysRq-t trace of the hung process and/or of the nfsd on the server (if applicable)
Created attachment 105531 [details] alt-sysrq-t capture from client and server
Created attachment 105532 [details] ethereal capture from the same time period
The sysrq file contains traces first from the client server, then one of the nfsd processes on the server (they were all in identical places in the stack at the time of the trace, though I cannot guarantee it was this specific nfsd servicing the request obviously). the ethereal capture was taken as the "cat" was being executed.
The ethereal trace shows normal "cat" like traffic although it do not show the NFS_READ which means either the data is in the local page cache or the process hung before the RPC to the server. The alt-sysrq-t shows the cat process hung waiting for a lock on a page. So the question is who has that lock and why? Could you please post the entire alt-sysrq-t
Created attachment 105547 [details] full sysrq from client server This contains the full trace dump from the NFS client. Note that there were several clients, all hung in the D state - including cc1plus, which had been hung for a good half hour at the time this trace was taken.
The box hung again today, and it seems to mainly occur under moderate IO load (specifically compiling). That would seem to correlate with it being a locking issue, since there's a lot of read/write traffic.
Ken, when this box locks up is it just nfs accesses that stop or local file systems as well? Larry Woodman
exclusively nfs. The admin accounts have local homedirectories and are unaffected when we login to look at the issue. The problem happens as follows: - Box boots, everything works normally - at an indeterminate point, an application will hang on a read request for NFS (usually shows itself during a compile, since that's obviously IO/Lock intensive, but can also rear it's head with a simple cat or vim) - after this point, *any* nfs read *may* hang on any read, this includes commands like "df". Local disk access continues normally - At this point, the hung processes are unkillable, even though the NFS share is mounted hard/intr, and the only way to clear the issue is a reboot.
Ken this looks like bug 118839.
Created attachment 105734 [details] logs before 18/out/2004 with kernel 21-4.ELsmp and after with kernel 21-20.ELsmp Hi, Yah I have absolutely the same problem with 2.4.21-4.ELsmp none of messages with 2.4.21-20.ELsmp all day. In my case my server is just one NFS client of one Tru64 cluster with NFS server, when is a little more loaded begin the warnings. I had other servers with same hardware but with SlackWare OS and at same time doesn't complain. I had some hangs periodically (4-6 days) with kernels 2.4-21-9. So we back to stock kernel which stands up all 150 days. In mean while I verify Gigabit Ethernet drive are out of date and I had update it to e1000-5.2.52.tar.gz . Now with kernel 2.4.21-20.ELsmp, I verify that redhat also had updated Intel driver, so a give a try the new kernel, since 7 days ago, have been runnig without hang so far ...
In regards to Comment $31, this specific problem doesn't *seem* to be the problem in bug 118839 - the RPC messages don't show up in dmesg, and this box isn't under the kinds of client pressure suggested in that bug, this is one nfs server talking to one nfs client. The client is experiencing the issue about once a day though. This is a real pain for us because these are developer home directories mounted to our compile server, and as a result about once a day someone's compile process ends up hanging due to this bug and the server has to be rebooted. I'm almost at the point where I want to switch back to the kernel that did *not* experience this problem, only if I do I won't be able to assist in future testing. Larry, was my full trace output of any use?
bug 129861 appears related to the issue I've been describing -- if not the same issue entirely.
Well (supposedly) the fix for bug 129861 is in the U4 beta kernel which is available through the RHN beta channel. I believe its kernel-2.4.21-22.EL or (23.EL). Please give that a try to see if the problem clears up.
Correction.... the fix is in bz 118839
Just wanted to verify that the big in bug 118839 (in kernel 2.4.21-22.EL) Did *not* correct the problem. The client machine I upgraded on Thursday finally failed today, with the same hangs. I will be attaching an updated trace momentarily.
Created attachment 106300 [details] alt-sysrq-t update from client running 2.4.21-22.EL
OK, Ken, thanks for trying that out for us.... There has been some recent talk about unlink issues on a number of the NFS mailing lists, I'm looking into if they may or may not apply in this case...
Created attachment 106333 [details] Upstream patch that fixes hangs in NFS unlink code. It seems the rpciod (pid 678) in the AtlSysRq-t trace in comment #38 is stuck in nfs_complete_unlink() which *could* mean that a wait up was missed, which in turn,would cause the rest of the nfs process to hang. It just so happens that 2.4 patch appeared recently that fixes missed wake ups in the unlink code. Please try the the attached patch to see if it clears up the hang.
On behalf of Ken (we work together) I grabbed the src.rpm for 2.4.21-23.EL and added the patch from Steve, rebuilt the RPM with patch and installed on the server in question. The kernel is now live, we'll see how this works with regards to the NFS issue.
Update for everyone. We applied this patch to the U4-beta kernel (2.4.21-23.EL) and applied this kernel to the NFS client (only) on the 9th. The server ran without issue until the 16th (so much so that I was about to report "Problem Solved!", as this was a significant change from the "fail every 12-48 hours" rate we had seen in the past), however, on the 16th, the server again failed exactly the same way, hung NFS connections, and things as simple as vim-ing a 2kb file on the server would hang in the D state. So, it appears the problem has been made, at minimum, less common, however it has *not* been eliminated. Unfortunately, the admin that handled the server rebooted it without grabbing a trace, but I will post one as soon as the problem manifests itself again.
note: bug 139101 seems to report the same issue as bug 129861, which doesn't seem to be the same issue of this bug, so it would appear as though this issue is in fact different from the two bugs stated above. That just leaves the question of whether bug 118839 and this bug are related. :)
I can now verify that the bug isn't resolved. We've seen this problem pop up twice in the last two hours. I'll attach traces again for both incidents - you'll notice they seem to, again, be locked on the same kernel read-lock issue as before. The kernel running on this client is 2.4.21-23.EL, with the patch from comment #39 installed. The server is stock RHEL3U3.
Created attachment 107395 [details] Trace of hung NFS client running modified kernel above (2.4.21-23.EL w/patch)
Created attachment 107396 [details] second trace from same client, 15 minutes after reboot for previous attachment, same symptoms
Created attachment 107532 [details] unlink-deadlock patch Please try the attached patch. This was the fix to a similar problem so hopefully it will help wit this one.
Steve, does this obsolete the patch to unlink.c which was listed on 2004-11-09 @ 09:51, should we build the kernel with or without the other patch in addition to the one you just posted.
No.... both seem to be needed...
rebuilt kernel-2.4.21-23.EL with the following two patches dated: 2004-11-09 09:51 2004-11-29 07:15 installed on problematic system and rebooted. Ken or I will need to update this ticket after the box runs for a while under the new kernel.
hi, i'm also interested in resolution of this as we see this on a bunch of our nfs clients also. we don't get a hang but access times out, load jumps up significantly and performance drops for a while. we mostly seem to see this on the system which has read/write access via nfs - we have 10 nfs clients, 1 server and 9 of them mount readonly. i'm running the kernel from rhel3qu4b2 (2.4.21-25 smp) regards, -jason
Well, knock on wood, but the kernel as installed by Chris on Nov 29 has been running, without incident, on our NFS client - I'm prepared to *tenatively* declare this one beat - it would appear that the locking issue has been resolved.
Just to be clear. Applying both patches (Comment #41 and Comment #48) stops client from hanging, right? Are you still are seeing the ".... not responding" messages
We haven't seen the "not responding" messages since switching to NFS over TCP, doing that is what switched us from those messages to the hung tcp connection (more or less as expected..)
Ken, the patches did solove the hanging problem you were seeing, correct?
They appear to have, yes.
I applied the patches in Comment #41 and Comment #48 to the latest Enterprise release (2.4.21-20.0.1.EL). Unfortunately, we continue to see the errors: Dec 14 14:38:02 mail5 kernel: nfs: server ecluster not responding, still trying Dec 14 14:38:02 mail5 kernel: nfs: server ecluster OK Dec 14 14:41:32 mail5 kernel: nfs: server ecluster not responding, still trying Dec 14 14:41:32 mail5 kernel: nfs: server ecluster OK Please remember this is connecting to a RedHat NFS Cluster system, and as such, it cannot use TCP connections. I reverted again to 2.4.21-4.EL, which doesn't exhibit this problem.
Continuing comment #58 neither the vanilla kernel 2.4.28 , so maybe use kernel 2.4.28 has base of the kernel could be a good idea .
Hi, more news, about this issue, still running with kernel 2.4.21-20. We have one "Nortel Networks Alteon 2424 SSL" and if we put the client or the server on Alteon, for load balancing, NFS stops working and get very High loads. But with slackware 10 and kernel 2.4.28, none of this happens. We have many slacks and one RedHat the problem only appears on Redhat. We also have problems with NFS on Tru64. But Tru64 are very older and his nics too.
Sergio, are you using UDP or TCP. If UDP, please try TCP
See comment #8. TCP is not an option for those serving NFS with a RedHat Enterprise Cluster.
Actually, using TCP NFS exports with Cluster Manager should work fine.
Just a clarification I am talking about NFS client to a Unix NFS server I had the idea, not sure, the connections are made in TCP
ho all i have FC3 2.6.10-1.770_FC3smp #1 SMP Thu Feb 24 14:20:06 EST 2005 i686 i686 i386 GNU/Linux I analyse the net and i discover this when starting doing the pivot to root system , the client ask by NFS protocol for RPC Access CAll FH: 0x3d74bff4 the server respond the RCP accepted BUT THE UDP PACKET DON'T HAVE CORRECT CHECKSUM SO THE PACKET IS DISCARD then the client send again and again the same RCP Access CALL , the server respond every time but the client discard packets because the checksum is not correct. also the network file system make a lookup CALL to DH 0xc17cbff4 it is type of Linux knfsd (new ) the server accept the procedure but again the UDP packet have bad checksum that's why is take to long to accept the connection if you want i can send you my capture of packets so i thing the problem is when the server make the UDP packetsm , do you know how this can be solved regard Ivan Olvera MTIA
Ivan, Yes, a bzip2-ed ethereal traces (i.e. teterheal -w /tmp/dump.pcap) would be good... also, what type of load is occurring when the happens? Failover, heavy io (i.e. reads or writes)?
I am experiencing the same NFS hangs with Version 3 Update 4. I non-clustered servers using NFS over TCP to a NetApp filer. strace of the "df" command shows that the hang starts as soon as df accesses the NFS mount point.
We upgraded our NFS client servers to Enterprise 4. After running for over a week, we are faily confident this problem has been resolved.
I'm fairly certain that the 2.4 and 2.6 knfsd's have little to do with one another, it may be premature to declare the bug squished in RHEL3 just because the 2.6-kernel-based RHEL 4 doesn't exhibit the bug.
Reopening, since this is a RHEL3 bugzilla report.
A bzip2-ed ethereal trace (i.e. teterheal -w /tmp/dump.pcap) of this problem would be good....
So....... Is there a solution to the initial "not responding" error message under UDP ?? We also have the same problem (is it really indicating a problem ?). Would be interested to an explanation/solution.
At this point we believe the initial problem was cause by the server's cluster hardware not responding in a timely manner. But in general, the nfs: server not responding message it such a generic warning message that its very hard to tell if the client or the server is having problems. We really need more to go on to decisively tell what is going on...
Well, I'm not running a cluster (yet) and I get these messages all the time. They go away under tcp, but as mentioned earlier in bugzilla, tcp is not supported under cluster suite, and that's exactly what we are going to try to implement. Didn't get these messages until we upgraded from 7.0 to RHEL 3. Is there other info that I can send, that could help you analyze the problem ?? All our clients produce these messages no matter what server or subnet they access.
Have you tried increasing the timeo mount option to something like timeo=14? See nfs man page for details...
Yes (at least on the clients). Any other ideas ? Most of the not responding messages are accompanied by an OK within the same second.
Our cluster also has a huge number of nfs no responding... ok problems. we use UDP, 2.4.11
Hi Lon, In comment #63 you said "TCP NFS exports with Cluster Manager should work fine". Is this a tested feature? Which versions support it? Does anyone have tried this? Regards, Juanjo
This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.