Bug 77553 - NFS server stops honoring mount requests - stuck NFS UDP queue
Summary: NFS server stops honoring mount requests - stuck NFS UDP queue
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: kernel
Version: 7.3
Hardware: i686
OS: Linux
medium
high
Target Milestone: ---
Assignee: Arjan van de Ven
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2002-11-08 22:22 UTC by Need Real Name
Modified: 2008-08-01 16:22 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-09-30 15:40:10 UTC
Embargoed:


Attachments (Terms of Use)

Description Need Real Name 2002-11-08 22:22:40 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.75 [en] (X11; U; HP-UX B.11.11 9000/785)

Description of problem:
Our NFS server stopped honoring mount requests.  The UDP queue for the NFS
service appeared stuck.
Mount requests from all servers timed-out repeatedly. 

The NFS server's log showed "rpc.mountd: authenticated mount request" but
apparently failed to complete the request.

Version-Release number of selected component (if applicable):


How reproducible:
Didn't try

Steps to Reproduce:
1.Run NFS
2.
3.
	

Additional info:

The output from the netstat command showed:

>netstat -a
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
...
udp   184288      0 *:nfs                   *:*                                 
udp     1728      0 *:32769                 *:*   

The permanently stuck queue was the 'nfs' queue shown above, with 184288 bytes
queued.

Note that udp port 32769 is opened by mountd.  This queue would build up for a
while and then
drop to zero.  Doing an 'strace' on the mountd process showed that it saw the
mount requests,
so mountd was not deadlocked.

I noticed these messages in the log, but am not sure if they are related to the
problem:
Nov  8 14:19:59 genetics iptables: Flushing all chains: succeeded
Nov  8 14:19:59 genetics iptables: Removing user defined chains: succeeded
Nov  8 14:19:59 genetics iptables: Resetting built-in chains to the default
ACCEPT policy succeeded

Stopping and restarting NFS via the /etc/rc3.d/ script cleared the problem.

Comment 1 Arjan van de Ven 2002-11-11 17:07:57 UTC
ok so what kernel version are you using and which network card?

Comment 2 Stephen Tweedie 2002-11-11 17:25:36 UTC
Please show a "tcpdump" of the mount port, both from the client and from the
server.  You should first do a

  /usr/sbin/rpcinfo -p

on the server to see which port it is using for mountd, then 

  /usr/sbin/tcpdump port <portnum>

to capture all traffic to or from that port.

Comment 3 Need Real Name 2002-11-12 00:30:30 UTC
Thanks for helping us.

Our NIC hardware is this one:
   eth0: Intel Corp. 82557/8/9 [Ethernet Pro 100], 00:06:5B:88:1D:2A, IRQ 19.

We are using kernel version 2.4.18-10smp.



Comment 4 Need Real Name 2002-11-12 03:11:02 UTC
Here is the client info:

[root@hydra /etc]#  /usr/sbin/rpcinfo -p
   program vers proto   port
    100000    2   tcp    111  portmapper
    100000    2   udp    111  portmapper
    100011    1   udp    801  rquotad
    100011    2   udp    801  rquotad
    100003    2   udp   2049  nfs
    100003    3   udp   2049  nfs
    100021    1   udp  32768  nlockmgr
    100021    3   udp  32768  nlockmgr
    100021    4   udp  32768  nlockmgr
    100005    1   udp  32770  mountd
    100005    1   tcp  32768  mountd
    100005    2   udp  32770  mountd
    100005    2   tcp  32768  mountd
    100005    3   udp  32770  mountd
    100005    3   tcp  32768  mountd
    100024    1   udp  32771  status
    100024    1   tcp  32769  status


Here is the server info:

[root@genetics log]#  /usr/sbin/rpcinfo -p
   program vers proto   port
    100000    2   tcp    111  portmapper
    100000    2   udp    111  portmapper
    100024    1   udp  32768  status
    100024    1   tcp  32768  status
    391002    2   tcp  32769  sgi_fam
    100011    1   udp    985  rquotad
    100011    2   udp    985  rquotad
    100011    1   tcp    988  rquotad
    100011    2   tcp    988  rquotad
    100005    1   udp  32784  mountd
    100005    1   tcp  37175  mountd
    100005    2   udp  32784  mountd
    100005    2   tcp  37175  mountd
    100005    3   udp  32784  mountd
    100005    3   tcp  37175  mountd
    100003    2   udp   2049  nfs
    100003    3   udp   2049  nfs
    100021    1   udp  32785  nlockmgr
    100021    3   udp  32785  nlockmgr
    100021    4   udp  32785  nlockmgr


Here is the output of tcpdump for the server:

[root@genetics dtrusty]# tcpdump port 32784
tcpdump: listening on eth0
21:34:50.991743 hydra.swmed.edu.689 > genetics.swmed.edu.32784:  udp 120 (DF)
21:34:50.992573 genetics.swmed.edu.32784 > hydra.swmed.edu.689:  udp 56 (DF)

The server syslog seems happy:

Nov 11 21:34:50 genetics rpc.mountd: authenticated mount request from
hydra.swmed.edu:689 for /mnt/databases (/mnt/databases) 


I get nothing on the client tcpdump:

[root@hydra log]# tcpdump port 32770
Kernel filter, protocol ALL, TURBO mode (575 frames), datagram packet socket
tcpdump: listening on all devices






Comment 5 Need Real Name 2002-11-12 03:22:00 UTC
I'm finding out a few more symptoms.  

The order of the mounts seems to affect it greatly.  If I mount a filesystem
from another server first,
the next mounts from the previously timing-out server work!  It is 100%
repeatable. 

Any ideas?


Comment 6 Need Real Name 2002-11-12 05:33:39 UTC
I think we should concentrate on the client, since that machine is 
where we can repeat the error.  The client is running kernel 2.4.18-smp.

It has these network interfaces:

eth0: Intel Corp. 82557 [Ethernet Pro 100], 00:D0:B7:A9:2C:73, IRQ 21.
   - internal network

eth1: NetGear GA620 Gigabit Ethernet at 0xf4000000, irq 18
   - connected to the NFS servers

Any help will be greatly appreciated.


Comment 7 Stephen Tweedie 2002-11-12 10:00:34 UTC
For the tcpdump on the client, please use the same port as on the server ---
we're looking for packets to or from the server's mountd port, not the client's
one.  We need to see if those packets are getting back to the client OK.

Comment 8 Need Real Name 2002-11-18 22:38:54 UTC
We are seeing some other errors on the client today:
  Nov 17 05:20:12 hydra kernel: nfs: server morpheus not responding, still
trying
  Nov 18 18:30:43 hydra kernel: nfs: task 10873 can't get a request slot
  Nov 18 18:32:43 hydra kernel: nfs_statfs: statfs error = 512
  Nov 18 18:38:16 hydra kernel: nfs_statfs: statfs error = 512

The host 'morpheus' seems alive and well.

Any ideas as to the cause?

David

Comment 9 Need Real Name 2002-11-18 23:19:33 UTC
I did a trace on the 'nfs' port (on the client) and I am getting these messages:
tcpdump 'port 2049 && host morpheus'
Kernel filter, protocol ALL, TURBO mode (575 frames), datagram packet socket
tcpdump: listening on all devices
19:30:48.767198 eth1 > hydra.swmed.edu.661373190 > morpheus.swmed.edu.nfs: 108
getattr fh 0,32/655360 (DF)
19:30:48.767577 eth1 < morpheus.swmed.edu.nfs > hydra.swmed.edu.661373190: reply
ok 28 getattr ERROR: Communication error on send (DF)
19:31:48.760169 eth1 > hydra.swmed.edu.661373190 > morpheus.swmed.edu.nfs: 108
getattr fh 0,32/655360 (DF)
19:31:48.762127 eth1 < morpheus.swmed.edu.nfs > hydra.swmed.edu.661373190: reply
ok 28 getattr ERROR: Communication error on send (DF)
19:32:11.157577 eth1 > hydra.swmed.edu.661373190 > morpheus.swmed.edu.nfs: 108
getattr fh 0,32/655360 (DF)
19:32:11.157972 eth1 < morpheus.swmed.edu.nfs > hydra.swmed.edu.661373190: reply
ok 28 getattr ERROR: Communication error on send (DF)


Comment 10 Need Real Name 2002-11-18 23:26:41 UTC
Here is the trace from the server.  It seems to be responding properly (ok 28
message).

tcpdump: listening on lan0
17:21:53.324681 order.swmed.edu.661373190 > morpheus.swmed..nfs: 108 getattr
[|nfs] (DF)
17:21:53.339951 morpheus.swmed..nfs > order.swmed.edu.661373190: reply ok 28
getattr [|nfs] (DF)
17:21:53.340014 order.swmed.edu > morpheus.swmed.edu: icmp: order.swmed.edu udp
port 929 unreachable [tos 0xc0] 
17:21:53.340108 order.swmed.edu.22 > morpheus.swmed.edu.54670: P
2035704829:2035704973(144) ack 1603569760 win 9648 (DF) [tos 0x10] 
17:21:53.344672 order.swmed.edu.22 > morpheus.swmed.edu.54670: P 144:320(176)
ack 1 win 9648 (DF) [tos 0x10] 
17:21:53.344721 morpheus.swmed.edu.54670 > order.swmed.edu.22: . ack 320 win
32768 (DF) [tos 0x10] 
17:22:53.317428 order.swmed.edu.661373190 > morpheus.swmed..nfs: 108 getattr
[|nfs] (DF)
17:22:53.317571 morpheus.swmed..nfs > order.swmed.edu.661373190: reply ok 28
getattr [|nfs] (DF)
17:22:53.318068 order.swmed.edu.22 > morpheus.swmed.edu.54670: P 320:464(144)
ack 1 win 9648 (DF) [tos 0x10] 
17:22:53.318135 order.swmed.edu > morpheus.swmed.edu: icmp: order.swmed.edu udp
port 929 unreachable [tos 0xc0] 
17:22:53.318596 order.swmed.edu.22 > morpheus.swmed.edu.54670: P 464:640(176)
ack 1 win 9648 (DF) [tos 0x10] 
17:22:53.318646 morpheus.swmed.edu.54670 > order.swmed.edu.22: . ack 640 win
32768 (DF) [tos 0x10] 

Any ideas?

David


Comment 11 Michael Pelletier 2004-01-29 01:37:49 UTC
Found a possible related page, regarding write-caching: 
http://article.gmane.org/gmane.linux.nfs/4249

We found similiar statfs error 512 log entries on a RedHat 8.0 system 
today, kernel 2.4.18-19.8.0, shortly after increasing the wsize on 
TCP-based NFS mounts.

Is the client running ClearCase, by any chance?

Comment 12 Bugzilla owner 2004-09-30 15:40:10 UTC
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/



Note You need to log in before you can comment on or make changes to this bug.