From Bugzilla Helper: User-Agent: Mozilla/4.75 [en] (X11; U; HP-UX B.11.11 9000/785) Description of problem: Our NFS server stopped honoring mount requests. The UDP queue for the NFS service appeared stuck. Mount requests from all servers timed-out repeatedly. The NFS server's log showed "rpc.mountd: authenticated mount request" but apparently failed to complete the request. Version-Release number of selected component (if applicable): How reproducible: Didn't try Steps to Reproduce: 1.Run NFS 2. 3. Additional info: The output from the netstat command showed: >netstat -a Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State ... udp 184288 0 *:nfs *:* udp 1728 0 *:32769 *:* The permanently stuck queue was the 'nfs' queue shown above, with 184288 bytes queued. Note that udp port 32769 is opened by mountd. This queue would build up for a while and then drop to zero. Doing an 'strace' on the mountd process showed that it saw the mount requests, so mountd was not deadlocked. I noticed these messages in the log, but am not sure if they are related to the problem: Nov 8 14:19:59 genetics iptables: Flushing all chains: succeeded Nov 8 14:19:59 genetics iptables: Removing user defined chains: succeeded Nov 8 14:19:59 genetics iptables: Resetting built-in chains to the default ACCEPT policy succeeded Stopping and restarting NFS via the /etc/rc3.d/ script cleared the problem.
ok so what kernel version are you using and which network card?
Please show a "tcpdump" of the mount port, both from the client and from the server. You should first do a /usr/sbin/rpcinfo -p on the server to see which port it is using for mountd, then /usr/sbin/tcpdump port <portnum> to capture all traffic to or from that port.
Thanks for helping us. Our NIC hardware is this one: eth0: Intel Corp. 82557/8/9 [Ethernet Pro 100], 00:06:5B:88:1D:2A, IRQ 19. We are using kernel version 2.4.18-10smp.
Here is the client info: [root@hydra /etc]# /usr/sbin/rpcinfo -p program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100011 1 udp 801 rquotad 100011 2 udp 801 rquotad 100003 2 udp 2049 nfs 100003 3 udp 2049 nfs 100021 1 udp 32768 nlockmgr 100021 3 udp 32768 nlockmgr 100021 4 udp 32768 nlockmgr 100005 1 udp 32770 mountd 100005 1 tcp 32768 mountd 100005 2 udp 32770 mountd 100005 2 tcp 32768 mountd 100005 3 udp 32770 mountd 100005 3 tcp 32768 mountd 100024 1 udp 32771 status 100024 1 tcp 32769 status Here is the server info: [root@genetics log]# /usr/sbin/rpcinfo -p program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100024 1 udp 32768 status 100024 1 tcp 32768 status 391002 2 tcp 32769 sgi_fam 100011 1 udp 985 rquotad 100011 2 udp 985 rquotad 100011 1 tcp 988 rquotad 100011 2 tcp 988 rquotad 100005 1 udp 32784 mountd 100005 1 tcp 37175 mountd 100005 2 udp 32784 mountd 100005 2 tcp 37175 mountd 100005 3 udp 32784 mountd 100005 3 tcp 37175 mountd 100003 2 udp 2049 nfs 100003 3 udp 2049 nfs 100021 1 udp 32785 nlockmgr 100021 3 udp 32785 nlockmgr 100021 4 udp 32785 nlockmgr Here is the output of tcpdump for the server: [root@genetics dtrusty]# tcpdump port 32784 tcpdump: listening on eth0 21:34:50.991743 hydra.swmed.edu.689 > genetics.swmed.edu.32784: udp 120 (DF) 21:34:50.992573 genetics.swmed.edu.32784 > hydra.swmed.edu.689: udp 56 (DF) The server syslog seems happy: Nov 11 21:34:50 genetics rpc.mountd: authenticated mount request from hydra.swmed.edu:689 for /mnt/databases (/mnt/databases) I get nothing on the client tcpdump: [root@hydra log]# tcpdump port 32770 Kernel filter, protocol ALL, TURBO mode (575 frames), datagram packet socket tcpdump: listening on all devices
I'm finding out a few more symptoms. The order of the mounts seems to affect it greatly. If I mount a filesystem from another server first, the next mounts from the previously timing-out server work! It is 100% repeatable. Any ideas?
I think we should concentrate on the client, since that machine is where we can repeat the error. The client is running kernel 2.4.18-smp. It has these network interfaces: eth0: Intel Corp. 82557 [Ethernet Pro 100], 00:D0:B7:A9:2C:73, IRQ 21. - internal network eth1: NetGear GA620 Gigabit Ethernet at 0xf4000000, irq 18 - connected to the NFS servers Any help will be greatly appreciated.
For the tcpdump on the client, please use the same port as on the server --- we're looking for packets to or from the server's mountd port, not the client's one. We need to see if those packets are getting back to the client OK.
We are seeing some other errors on the client today: Nov 17 05:20:12 hydra kernel: nfs: server morpheus not responding, still trying Nov 18 18:30:43 hydra kernel: nfs: task 10873 can't get a request slot Nov 18 18:32:43 hydra kernel: nfs_statfs: statfs error = 512 Nov 18 18:38:16 hydra kernel: nfs_statfs: statfs error = 512 The host 'morpheus' seems alive and well. Any ideas as to the cause? David
I did a trace on the 'nfs' port (on the client) and I am getting these messages: tcpdump 'port 2049 && host morpheus' Kernel filter, protocol ALL, TURBO mode (575 frames), datagram packet socket tcpdump: listening on all devices 19:30:48.767198 eth1 > hydra.swmed.edu.661373190 > morpheus.swmed.edu.nfs: 108 getattr fh 0,32/655360 (DF) 19:30:48.767577 eth1 < morpheus.swmed.edu.nfs > hydra.swmed.edu.661373190: reply ok 28 getattr ERROR: Communication error on send (DF) 19:31:48.760169 eth1 > hydra.swmed.edu.661373190 > morpheus.swmed.edu.nfs: 108 getattr fh 0,32/655360 (DF) 19:31:48.762127 eth1 < morpheus.swmed.edu.nfs > hydra.swmed.edu.661373190: reply ok 28 getattr ERROR: Communication error on send (DF) 19:32:11.157577 eth1 > hydra.swmed.edu.661373190 > morpheus.swmed.edu.nfs: 108 getattr fh 0,32/655360 (DF) 19:32:11.157972 eth1 < morpheus.swmed.edu.nfs > hydra.swmed.edu.661373190: reply ok 28 getattr ERROR: Communication error on send (DF)
Here is the trace from the server. It seems to be responding properly (ok 28 message). tcpdump: listening on lan0 17:21:53.324681 order.swmed.edu.661373190 > morpheus.swmed..nfs: 108 getattr [|nfs] (DF) 17:21:53.339951 morpheus.swmed..nfs > order.swmed.edu.661373190: reply ok 28 getattr [|nfs] (DF) 17:21:53.340014 order.swmed.edu > morpheus.swmed.edu: icmp: order.swmed.edu udp port 929 unreachable [tos 0xc0] 17:21:53.340108 order.swmed.edu.22 > morpheus.swmed.edu.54670: P 2035704829:2035704973(144) ack 1603569760 win 9648 (DF) [tos 0x10] 17:21:53.344672 order.swmed.edu.22 > morpheus.swmed.edu.54670: P 144:320(176) ack 1 win 9648 (DF) [tos 0x10] 17:21:53.344721 morpheus.swmed.edu.54670 > order.swmed.edu.22: . ack 320 win 32768 (DF) [tos 0x10] 17:22:53.317428 order.swmed.edu.661373190 > morpheus.swmed..nfs: 108 getattr [|nfs] (DF) 17:22:53.317571 morpheus.swmed..nfs > order.swmed.edu.661373190: reply ok 28 getattr [|nfs] (DF) 17:22:53.318068 order.swmed.edu.22 > morpheus.swmed.edu.54670: P 320:464(144) ack 1 win 9648 (DF) [tos 0x10] 17:22:53.318135 order.swmed.edu > morpheus.swmed.edu: icmp: order.swmed.edu udp port 929 unreachable [tos 0xc0] 17:22:53.318596 order.swmed.edu.22 > morpheus.swmed.edu.54670: P 464:640(176) ack 1 win 9648 (DF) [tos 0x10] 17:22:53.318646 morpheus.swmed.edu.54670 > order.swmed.edu.22: . ack 640 win 32768 (DF) [tos 0x10] Any ideas? David
Found a possible related page, regarding write-caching: http://article.gmane.org/gmane.linux.nfs/4249 We found similiar statfs error 512 log entries on a RedHat 8.0 system today, kernel 2.4.18-19.8.0, shortly after increasing the wsize on TCP-based NFS mounts. Is the client running ClearCase, by any chance?
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/