Red Hat Bugzilla – Full Text Bug Listing
|Summary:||nfs client hangs when mounting/unmouting repeatedly disk on IRIX server|
|Product:||[Fedora] Fedora||Reporter:||Dimitri Papadopoulos <dimitri.papadopoulos>|
|Component:||nfs-utils||Assignee:||Steve Dickson <steved>|
|Status:||CLOSED RAWHIDE||QA Contact:||Ben Levenson <benl>|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2005-09-01 07:50:45 EDT||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
Description Dimitri Papadopoulos 2005-06-15 11:19:59 EDT
From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; fr-FR; rv:1.7.8) Gecko/20050511 Firefox/1.0.4 Description of problem: Mounting/unmounting the same remote directory repeatedly hangs the client: # mount -t nfs pelles:/outils/linux/local /usr/local # mount | fgrep pelles pelles:/outils/linux/local on /usr/local type nfs (rw,addr=172.16.4.204) # umount /usr/local # # mount -t nfs pelles:/outils/linux/local /usr/local # mount | fgrep pelles pelles:/outils/linux/local on /usr/local type nfs (rw,addr=172.16.4.204) # umount /usr/local # # mount -t nfs pelles:/outils/linux/local /usr/local [***hangs here for a few minutes***] mount: pelles:/outils/linux/local: can't read superblock # mount | fgrep pelles # # mount -t nfs pelles:/outils/linux/local /usr/local # mount | fgrep pelles pelles:/outils/linux/local on /usr/local type nfs (rw,addr=172.16.4.204) # umount /usr/local # The NFS server is running IRIX: $ uname -aR IRIX64 pelles 6.5 6.5.23f 01080747 IP27 $ I cannot reproduce the client hanging with a Solaris 8 NFS server: $ uname -a SunOS uriens 5.8 Generic_108528-22 sun4u sparc SUNW,Sun-Fire-480R $ This could be a problem with IRIX, but on the other hand we have been running Red Hat 9, Fedora Core 2, and Solaris 8 client workstations without such problems for years. By the way, I'm not mounting/unmounting the same disk for fun. The initial problem was that I was trying to mount a few partitions from the IRIX server. The first mount would suceeed, but the second or third mount would always fail with the same "superblock" message. Version-Release number of selected component (if applicable): nfs-utils-1.0.7-8 How reproducible: Sometimes Steps to Reproduce: 1. mount -t nfs irix:/bar /mnt/foo ; umount /mnt/foo 2. repeat 1 until client hangs Additional info:
Comment 1 Dimitri Papadopoulos 2005-06-15 11:24:00 EDT
Created attachment 115484 [details] /var/log/messages This is the kind of messages I find the logs when I experiment with nfs client hanging.
Comment 2 Dimitri Papadopoulos 2005-06-15 11:49:38 EDT
*** Bug 160513 has been marked as a duplicate of this bug. ***
Comment 3 Dimitri Papadopoulos 2005-06-15 11:50:39 EDT
*** Bug 160512 has been marked as a duplicate of this bug. ***
Comment 4 Dimitri Papadopoulos 2005-06-16 05:11:35 EDT
This is maybe totally unrelated, but I thought I'd report this since I see RPC errors in the logs. Running up2date also generates many RPC errors: # up2date -l [...] do_ypcall: clnt_call: RPC: Timed out do_ypcall: clnt_call: RPC: Timed out do_ypcall: clnt_call: RPC: Timed out [...] do_ypcall: clnt_call: RPC: Timed out do_ypcall: clnt_call: RPC: Timed out [...] # Again maybe these are unrelated RPC problems. I'm reporting them for the sake of completness.
Comment 5 Dimitri Papadopoulos 2005-06-16 05:46:43 EDT
I've read IBM's excellent "System Management Guide" for AIX which describes how to troubleshoot NFS and NIS problems: http://publib.boulder.ibm.com/infocenter/pseries/topic/com.ibm.aix.doc/aixbman/commadmn/nfs_problem.htm http://publib.boulder.ibm.com/infocenter/pseries/topic/com.ibm.aix.doc/aixbman/nisplus/trbl_nis.htm Paragraph 5 is interesting: 5. Verify that the mountd, portmap and nfsd daemons are running on the NFS server by entering the following commands at the client shell prompt: Below are the results against our IRIX 6.5.23 server: # rpcinfo -u pelles mount program 100005 version 1 ready and waiting rpcinfo: RPC: Program/version mismatch; low version = 1, high version = 3 program 100005 version 2 is not available program 100005 version 3 ready and waiting # # rpcinfo -u pelles portmap rpcinfo: RPC: Timed out program 100000 version 0 is not available # # rpcinfo -u pelles nfs program 100003 version 2 ready and waiting program 100003 version 3 ready and waiting # Now if I try the same commands against our Solaris 8 server: # rpcinfo -u uriens mount program 100005 version 1 ready and waiting program 100005 version 2 ready and waiting program 100005 version 3 ready and waiting # rpcinfo -u uriens portmap program 100000 version 2 ready and waiting program 100000 version 3 ready and waiting program 100000 version 4 ready and waiting # rpcinfo -u uriens nfs program 100003 version 2 ready and waiting program 100003 version 3 ready and waiting # On the other hand I get similar results from Fedora Core 2, Red Hat 9, and Solaris 8 clients. NFS and NIS work smoothly on such clients.
Comment 6 Steve Dickson 2005-06-16 08:36:17 EDT
Could you please post a system backtrace by echoing a t in to /proc/sysrq-trigger (i.e. echo t > /proc/sysrq-trigger). Note: kernel.sysrq has to be set in /etc/sysctl.conf or echo 1 > /proc/sys/kernel/sysrq will also work. Plus if possible, could you get an bzip2 tethereal binary network trace (i.e. on the client: tethereal -w /tmp/data.pcap host servername).
Comment 7 Dimitri Papadopoulos 2005-06-16 10:41:20 EDT
Created attachment 115542 [details] /var/log/messages since last reboot I issued these commands: # up2date do_ypcall: clnt_call: RPC: Timed out do_ypcall: clnt_call: RPC: Timed out [...] # # mount /mnt/linux ; umount /mnt/linux # mount /mnt/linux ; umount /mnt/linux # mount /mnt/linux ; umount /mnt/linux mount: pelles:/outils/linux: can't read superblock umount: /mnt/linux: not mounted # and issued a few "echo t > /proc/sysrq-trigger" while: - mount was hanging, before its "can't read superblock" message, - up2date was issuing its "RPC: Timed out" messages. I hope that's what you need. Please do not hesitate to ask for more information. I'll see what I can do about the ethereal logs.
Comment 8 Dimitri Papadopoulos 2005-06-16 10:55:35 EDT
Created attachment 115545 [details] tethereal -w data.pcap host pelles This network trace covers the following commands: # mount /mnt/linux ; umount /mnt/linux # mount /mnt/linux ; umount /mnt/linux # mount /mnt/linux ; umount /mnt/linux mount: pelles:/outils/linux: can't read superblock umount: /mnt/linux: not mounted #
Comment 9 Steve Dickson 2005-06-16 11:05:31 EDT
hmm... there is really nothing unusual in the ethereal trace.... By chance are you using soft mounts (i.e. using -o soft mount option)? If so try using the hard mounts to see if the problem goes away...
Comment 10 Dimitri Papadopoulos 2005-06-17 05:08:35 EDT
We're not using soft mounts, here are the relevant lines from /etc/fstab: uriens:/usr2 /usr2 nfs auto,ro,intr 0 0 uriens:/var/mail /var/spool/mail nfs auto,intr 0 0 uriens:/export/prod/product /product nfs auto,intr 0 0 pelles:/outils/linux/local /usr/local nfs auto,intr,suid 0 0 pelles:/outils/linux /mnt/linux nfs auto,intr 0 0 Actually this NFS issue is only part of the picture. We're experiencing lots of problems: * slow boot (I looks like it's related to NFS or NIS problems which is why I'm trying to identify some reproducible problem to start with) * general slowness when using the workstation * takes minutes to restart X11 and get to thelogin screen after Ctrl+Alt+Backspace * up2date is extremly slow (RPC errors as already pointed out) Maybe this is not directly caused by NFS/NIS problems after all. I'll continue investigating, I'll try booting without NFS and NIS again, and see whether I can identify some other problem.
Comment 11 Dimitri Papadopoulos 2005-06-17 06:37:40 EDT
Could this be a problem with glibc or the kernel? 1) Shutting down both NFS and NIS services: # up2date [GUI appears immediatly] # 2) Starting NIS service: # up2date do_ypcall: clnt_call: RPC: Timed out do_ypcall: clnt_call: RPC: Timed out [message repeated 20 times before GUI appears] # 3) Shutting down NIS service, starting NFS services: # mount /mnt/linux ; umount /mnt/linux # mount /mnt/linux ; umount /mnt/linux # mount /mnt/linux ; umount /mnt/linux mount: pelles:/outils/linux: can't read superblock umount: /mnt/linux: not mounted # As you can see both NIS and NFS have problems on this workstation. I suspect these problems may be related, in which case the cause is probably in glibc or the kernel. If that's the case, I can't explain why I wasn't able to find reports of similar NFS or NIS problems on Google or the Fedora site. Any clue? Should I file this bug against another component? kernel? glibc?
Comment 12 Steve Dickson 2005-06-17 09:04:07 EDT
Boy it sure looks like your having network problems.... Does 'ifconfig eth?' show any errors on the interface? also does 'nfsstat -rc' show a ton of retrans?
Comment 13 Dimitri Papadopoulos 2005-06-17 09:43:56 EDT
On the other hand ping works fine: # ping -f pelles PING pelles.shfj.cea.fr (172.16.4.204) 56(84) bytes of data. --- pelles.shfj.cea.fr ping statistics --- 378 packets transmitted, 378 received, 0% packet loss, time 4169ms rtt min/avg/max/mdev = 0.177/0.664/1.407/0.252 ms, pipe 2, ipg/ewma 11.059/0.624 ms # Also FTP retrieves large files at maximal speed: # wget ftp://ftp.lip6.fr/pub/linux/distributions/fedora/4/i386/iso/FC4-i386-DVD.iso --15:38:36-- ftp://ftp.lip6.fr/pub/linux/distributions/fedora/4/i386/iso/FC4-i386-DVD.iso => `FC4-i386-DVD.iso' Resolving ftp.lip6.fr... 22.214.171.124 Connecting to ftp.lip6.fr[126.96.36.199]:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD /pub/linux/distributions/fedora/4/i386/iso ... done. ==> PASV ... done. ==> RETR FC4-i386-DVD.iso ... done. Length: 2,750,582,784 (unauthoritative) 0% [ ] 1,590,252 421.71K/s ETA 1:46:24 [...] # Once I shut down NIS, up2date works just fine. Once I remove all references to the IRIX server from /etc/fstab the workstation is as fast if not much faster than our FC2 workstations. Now more general slowness. So this can't be a general network issue. Here is the information you've asked for: # ifconfig eth0 eth0 Link encap:Ethernet HWaddr 00:B0:D0:F9:3E:B2 inet addr:172.16.4.82 Bcast:172.16.7.255 Mask:255.255.252.0 inet6 addr: fe80::2b0:d0ff:fef9:3eb2/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:14485 errors:0 dropped:0 overruns:1 frame:0 TX packets:4850 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:3579270 (3.4 MiB) TX bytes:521818 (509.5 KiB) Interrupt:169 Base address:0xec80 # I'll have to look into nfsstat in more detail, we need to compare the output with and without the IRIX server.
Comment 14 Dimitri Papadopoulos 2005-06-17 09:57:13 EDT
Mounting a directory from the Sun server and issuing 'ls -R': # mount /usr2 # ls -R /usr2 >/dev/null # umount /usr2 # nfsstat -rc Client rpc stats: calls retrans authrefrsh 704 0 0 # Mounting a directory from the Sun server and issuing 'ls -R': # mount /mnt/linux # ls -R /mnt/linux >/dev/null [...quite slow, this directory is much larger, but still...] # umount /mnt/linux # nfsstat -rc Client rpc stats: calls retrans authrefrsh 70780 0 0 #
Comment 15 Dimitri Papadopoulos 2005-06-17 09:58:33 EDT
Ooops... I meant 'IRIX server' instead of 'Sun server' in the second case (/mnt/linux).
Comment 16 Steve Dickson 2005-06-17 11:25:48 EDT
Ok.. Again looking at the message file in comment #7 there appears the following error: rpc.idmapd: nfsdreopen: Opening '' failed: errno 2 (No such file or directory) which means you need to upgrade your nfs-utils to the latest and greatest. (nfs-utils-1.0.7-8 I believe)... Now this probably will little effect on the current problems but its a known data corrupt problem that you really want to fix.... Again looking back at Comment #5 (the rpcinfo -u pelles portmap in particular), it appears the portmap is die which is very bad... That would explain why both NFS and NIS are having timeouts talking to pelles. Could you please verify that the portmapper is alive and well and if it isn't theres *should be* some type of error log in the /var/log/messages file
Comment 17 Dimitri Papadopoulos 2005-06-20 05:34:00 EDT
I already have nfs-utils-1.0.7-8. This is the version initially released with FC4 and there have been no updates as far as I know. About the portmap issue, I get the same results on Red Hat 9 and Fedora Core 2 workstations, 'rpcinfo' reports a timeout but 'portmap' is actuall running: # cat /etc/redhat-release Red Hat Linux release 9 (Shrike) # rpcinfo -u pelles portmap rpcinfo: RPC: Timed out program 100000 version 0 is not available # /etc/init.d/portmap status portmap (pid 3671) is running... # # cat /etc/redhat-release Fedora Core release 2 (Tettnang) # rpcinfo -u pelles portmap rpcinfo: RPC: Timed out program 100000 version 0 is not available # /etc/init.d/portmap status portmap (pid 2471) is running... # NFS and NIS work without problems on our Red Hat 9 and Fedora Core 2 workstations.
Comment 18 Dimitri Papadopoulos 2005-06-20 09:34:22 EDT
Mmmh... Today I'm not able to reproduce the general system slowness with NIS shut down. I'm still able to reproduce the general slowness with NIS running. I really don't know, maybe I had mixed up things. Here is what I can reproduce today: * When NIS client is running, general system slowness. Slowness disappears when NIS client is shut down. Since I'm interested in getting the NFS issues fixed first, I'll be working with NIS shut down from now on. * When NFS client is running (and NIS is shut down) no system slowness, even when IRIX disks are mounted. Actually NFS is much faster in Fedora Core 4 than in Fedora Core 2, it looks like caches are better implemented. * On the other hand I *still* see the problem with the IRIX server where issuing successive mount/umount commands results in: mount: pelles:/outils/linux: can't read superblock * From time to time, I do see a portmap error when shutting down the system: Sending all processes the TERM signal... RPC: sendmsg call retruned error 101 portmap: RPC call returned error 101 RPC: failed to contact portmap (errno -101) Sending all processes the KILL signal... This seems to be happening when an NFS disk is left mounted when shutting down. I'm not sure it's related to the mount/umount problem though. It may be related to another problem I have which I haven't reported yet: NFS shares with 'auto' in /etc/fstab don't get mounted automatically when starting the system. They probably don't get unmounted when shutting down either... Strangely I don't find these errors in the logs after reboot.
Comment 19 Dimitri Papadopoulos 2005-06-21 11:48:24 EDT
FYI, we've logged a call with SGI, hopefully they'll find something wrong on the server side about this mount/umount issue. Note that 'mount -a' often doesn't work when there are entries from the Irix server in /etc/fstab, I think it's the exact same problem: # mount -a # umount pelles:/outils/linux/local pelles:/outils/linux # mount -a mount: pelles:/outils/linux: can't read superblock # I'll probably open a different bug report about the NIS problem. Do you know whether it's expected that entries with the 'auto' option are not automatically mounted at boot-time? Maybe a side-effect of the 'mount -a' problem? I wans't able to find anything about that in the FC4 manuals and release notes.
Comment 20 Dimitri Papadopoulos 2005-08-30 05:36:59 EDT
I guess we'll never know whether this is was a Fedora Core 4 bug or an SGI bug: 1) Our Fedora Core 4 workstation has been upgraded to the latest updates available. # rpm -q kernel nfs-utils kernel-2.6.12-1.1447_FC4 nfs-utils-1.0.7-10 2) Our SGI server has been updgraded as well. $ uname -R 6.5 6.5.26f I can't reproduce the problem anymore: # mount /mnt/linux ; umount /mnt/linux # mount /mnt/linux ; umount /mnt/linux # mount /mnt/linux ; umount /mnt/linux # mount /mnt/linux ; umount /mnt/linux # mount /mnt/linux ; umount /mnt/linux # mount /mnt/linux ; umount /mnt/linux # mount /mnt/linux ; umount /mnt/linux # [...]
Comment 21 Steve Dickson 2005-09-01 07:50:45 EDT
So it had to be a SGI server bug.... ;-)