Bug 160514
Summary: | nfs client hangs when mounting/unmouting repeatedly disk on IRIX server | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Dimitri Papadopoulos <dimitri.papadopoulos> | ||||||||
Component: | nfs-utils | Assignee: | Steve Dickson <steved> | ||||||||
Status: | CLOSED RAWHIDE | QA Contact: | Ben Levenson <benl> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 4 | ||||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | i386 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2005-09-01 11:50:45 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Dimitri Papadopoulos
2005-06-15 15:19:59 UTC
Created attachment 115484 [details]
/var/log/messages
This is the kind of messages I find the logs when I experiment with nfs client
hanging.
*** Bug 160513 has been marked as a duplicate of this bug. *** *** Bug 160512 has been marked as a duplicate of this bug. *** This is maybe totally unrelated, but I thought I'd report this since I see RPC errors in the logs. Running up2date also generates many RPC errors: # up2date -l [...] do_ypcall: clnt_call: RPC: Timed out do_ypcall: clnt_call: RPC: Timed out do_ypcall: clnt_call: RPC: Timed out [...] do_ypcall: clnt_call: RPC: Timed out do_ypcall: clnt_call: RPC: Timed out [...] # Again maybe these are unrelated RPC problems. I'm reporting them for the sake of completness. I've read IBM's excellent "System Management Guide" for AIX which describes how to troubleshoot NFS and NIS problems: http://publib.boulder.ibm.com/infocenter/pseries/topic/com.ibm.aix.doc/aixbman/commadmn/nfs_problem.htm http://publib.boulder.ibm.com/infocenter/pseries/topic/com.ibm.aix.doc/aixbman/nisplus/trbl_nis.htm Paragraph 5 is interesting: 5. Verify that the mountd, portmap and nfsd daemons are running on the NFS server by entering the following commands at the client shell prompt: Below are the results against our IRIX 6.5.23 server: # rpcinfo -u pelles mount program 100005 version 1 ready and waiting rpcinfo: RPC: Program/version mismatch; low version = 1, high version = 3 program 100005 version 2 is not available program 100005 version 3 ready and waiting # # rpcinfo -u pelles portmap rpcinfo: RPC: Timed out program 100000 version 0 is not available # # rpcinfo -u pelles nfs program 100003 version 2 ready and waiting program 100003 version 3 ready and waiting # Now if I try the same commands against our Solaris 8 server: # rpcinfo -u uriens mount program 100005 version 1 ready and waiting program 100005 version 2 ready and waiting program 100005 version 3 ready and waiting # rpcinfo -u uriens portmap program 100000 version 2 ready and waiting program 100000 version 3 ready and waiting program 100000 version 4 ready and waiting # rpcinfo -u uriens nfs program 100003 version 2 ready and waiting program 100003 version 3 ready and waiting # On the other hand I get similar results from Fedora Core 2, Red Hat 9, and Solaris 8 clients. NFS and NIS work smoothly on such clients. Could you please post a system backtrace by echoing a t in to /proc/sysrq-trigger (i.e. echo t > /proc/sysrq-trigger). Note: kernel.sysrq has to be set in /etc/sysctl.conf or echo 1 > /proc/sys/kernel/sysrq will also work. Plus if possible, could you get an bzip2 tethereal binary network trace (i.e. on the client: tethereal -w /tmp/data.pcap host servername). Created attachment 115542 [details]
/var/log/messages since last reboot
I issued these commands:
# up2date
do_ypcall: clnt_call: RPC: Timed out
do_ypcall: clnt_call: RPC: Timed out
[...]
#
# mount /mnt/linux ; umount /mnt/linux
# mount /mnt/linux ; umount /mnt/linux
# mount /mnt/linux ; umount /mnt/linux
mount: pelles:/outils/linux: can't read superblock
umount: /mnt/linux: not mounted
#
and issued a few "echo t > /proc/sysrq-trigger" while:
- mount was hanging, before its "can't read superblock" message,
- up2date was issuing its "RPC: Timed out" messages.
I hope that's what you need. Please do not hesitate to ask for more
information.
I'll see what I can do about the ethereal logs.
Created attachment 115545 [details]
tethereal -w data.pcap host pelles
This network trace covers the following commands:
# mount /mnt/linux ; umount /mnt/linux
# mount /mnt/linux ; umount /mnt/linux
# mount /mnt/linux ; umount /mnt/linux
mount: pelles:/outils/linux: can't read superblock
umount: /mnt/linux: not mounted
#
hmm... there is really nothing unusual in the ethereal trace.... By chance are you using soft mounts (i.e. using -o soft mount option)? If so try using the hard mounts to see if the problem goes away... We're not using soft mounts, here are the relevant lines from /etc/fstab: uriens:/usr2 /usr2 nfs auto,ro,intr 0 0 uriens:/var/mail /var/spool/mail nfs auto,intr 0 0 uriens:/export/prod/product /product nfs auto,intr 0 0 pelles:/outils/linux/local /usr/local nfs auto,intr,suid 0 0 pelles:/outils/linux /mnt/linux nfs auto,intr 0 0 Actually this NFS issue is only part of the picture. We're experiencing lots of problems: * slow boot (I looks like it's related to NFS or NIS problems which is why I'm trying to identify some reproducible problem to start with) * general slowness when using the workstation * takes minutes to restart X11 and get to thelogin screen after Ctrl+Alt+Backspace * up2date is extremly slow (RPC errors as already pointed out) Maybe this is not directly caused by NFS/NIS problems after all. I'll continue investigating, I'll try booting without NFS and NIS again, and see whether I can identify some other problem. Could this be a problem with glibc or the kernel? 1) Shutting down both NFS and NIS services: # up2date [GUI appears immediatly] # 2) Starting NIS service: # up2date do_ypcall: clnt_call: RPC: Timed out do_ypcall: clnt_call: RPC: Timed out [message repeated 20 times before GUI appears] # 3) Shutting down NIS service, starting NFS services: # mount /mnt/linux ; umount /mnt/linux # mount /mnt/linux ; umount /mnt/linux # mount /mnt/linux ; umount /mnt/linux mount: pelles:/outils/linux: can't read superblock umount: /mnt/linux: not mounted # As you can see both NIS and NFS have problems on this workstation. I suspect these problems may be related, in which case the cause is probably in glibc or the kernel. If that's the case, I can't explain why I wasn't able to find reports of similar NFS or NIS problems on Google or the Fedora site. Any clue? Should I file this bug against another component? kernel? glibc? Boy it sure looks like your having network problems.... Does 'ifconfig eth?' show any errors on the interface? also does 'nfsstat -rc' show a ton of retrans? On the other hand ping works fine: # ping -f pelles PING pelles.shfj.cea.fr (172.16.4.204) 56(84) bytes of data. --- pelles.shfj.cea.fr ping statistics --- 378 packets transmitted, 378 received, 0% packet loss, time 4169ms rtt min/avg/max/mdev = 0.177/0.664/1.407/0.252 ms, pipe 2, ipg/ewma 11.059/0.624 ms # Also FTP retrieves large files at maximal speed: # wget ftp://ftp.lip6.fr/pub/linux/distributions/fedora/4/i386/iso/FC4-i386-DVD.iso --15:38:36-- ftp://ftp.lip6.fr/pub/linux/distributions/fedora/4/i386/iso/FC4-i386-DVD.iso => `FC4-i386-DVD.iso' Resolving ftp.lip6.fr... 195.83.118.1 Connecting to ftp.lip6.fr[195.83.118.1]:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD /pub/linux/distributions/fedora/4/i386/iso ... done. ==> PASV ... done. ==> RETR FC4-i386-DVD.iso ... done. Length: 2,750,582,784 (unauthoritative) 0% [ ] 1,590,252 421.71K/s ETA 1:46:24 [...] # Once I shut down NIS, up2date works just fine. Once I remove all references to the IRIX server from /etc/fstab the workstation is as fast if not much faster than our FC2 workstations. Now more general slowness. So this can't be a general network issue. Here is the information you've asked for: # ifconfig eth0 eth0 Link encap:Ethernet HWaddr 00:B0:D0:F9:3E:B2 inet addr:172.16.4.82 Bcast:172.16.7.255 Mask:255.255.252.0 inet6 addr: fe80::2b0:d0ff:fef9:3eb2/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:14485 errors:0 dropped:0 overruns:1 frame:0 TX packets:4850 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:3579270 (3.4 MiB) TX bytes:521818 (509.5 KiB) Interrupt:169 Base address:0xec80 # I'll have to look into nfsstat in more detail, we need to compare the output with and without the IRIX server. Mounting a directory from the Sun server and issuing 'ls -R': # mount /usr2 # ls -R /usr2 >/dev/null # umount /usr2 # nfsstat -rc Client rpc stats: calls retrans authrefrsh 704 0 0 # Mounting a directory from the Sun server and issuing 'ls -R': # mount /mnt/linux # ls -R /mnt/linux >/dev/null [...quite slow, this directory is much larger, but still...] # umount /mnt/linux # nfsstat -rc Client rpc stats: calls retrans authrefrsh 70780 0 0 # Ooops... I meant 'IRIX server' instead of 'Sun server' in the second case (/mnt/linux). Ok.. Again looking at the message file in comment #7 there appears the following error: rpc.idmapd: nfsdreopen: Opening '' failed: errno 2 (No such file or directory) which means you need to upgrade your nfs-utils to the latest and greatest. (nfs-utils-1.0.7-8 I believe)... Now this probably will little effect on the current problems but its a known data corrupt problem that you really want to fix.... Again looking back at Comment #5 (the rpcinfo -u pelles portmap in particular), it appears the portmap is die which is very bad... That would explain why both NFS and NIS are having timeouts talking to pelles. Could you please verify that the portmapper is alive and well and if it isn't theres *should be* some type of error log in the /var/log/messages file I already have nfs-utils-1.0.7-8. This is the version initially released with FC4 and there have been no updates as far as I know. About the portmap issue, I get the same results on Red Hat 9 and Fedora Core 2 workstations, 'rpcinfo' reports a timeout but 'portmap' is actuall running: # cat /etc/redhat-release Red Hat Linux release 9 (Shrike) # rpcinfo -u pelles portmap rpcinfo: RPC: Timed out program 100000 version 0 is not available # /etc/init.d/portmap status portmap (pid 3671) is running... # # cat /etc/redhat-release Fedora Core release 2 (Tettnang) # rpcinfo -u pelles portmap rpcinfo: RPC: Timed out program 100000 version 0 is not available # /etc/init.d/portmap status portmap (pid 2471) is running... # NFS and NIS work without problems on our Red Hat 9 and Fedora Core 2 workstations. Mmmh... Today I'm not able to reproduce the general system slowness with NIS shut down. I'm still able to reproduce the general slowness with NIS running. I really don't know, maybe I had mixed up things. Here is what I can reproduce today: * When NIS client is running, general system slowness. Slowness disappears when NIS client is shut down. Since I'm interested in getting the NFS issues fixed first, I'll be working with NIS shut down from now on. * When NFS client is running (and NIS is shut down) no system slowness, even when IRIX disks are mounted. Actually NFS is much faster in Fedora Core 4 than in Fedora Core 2, it looks like caches are better implemented. * On the other hand I *still* see the problem with the IRIX server where issuing successive mount/umount commands results in: mount: pelles:/outils/linux: can't read superblock * From time to time, I do see a portmap error when shutting down the system: Sending all processes the TERM signal... RPC: sendmsg call retruned error 101 portmap: RPC call returned error 101 RPC: failed to contact portmap (errno -101) Sending all processes the KILL signal... This seems to be happening when an NFS disk is left mounted when shutting down. I'm not sure it's related to the mount/umount problem though. It may be related to another problem I have which I haven't reported yet: NFS shares with 'auto' in /etc/fstab don't get mounted automatically when starting the system. They probably don't get unmounted when shutting down either... Strangely I don't find these errors in the logs after reboot. FYI, we've logged a call with SGI, hopefully they'll find something wrong on the server side about this mount/umount issue. Note that 'mount -a' often doesn't work when there are entries from the Irix server in /etc/fstab, I think it's the exact same problem: # mount -a # umount pelles:/outils/linux/local pelles:/outils/linux # mount -a mount: pelles:/outils/linux: can't read superblock # I'll probably open a different bug report about the NIS problem. Do you know whether it's expected that entries with the 'auto' option are not automatically mounted at boot-time? Maybe a side-effect of the 'mount -a' problem? I wans't able to find anything about that in the FC4 manuals and release notes. I guess we'll never know whether this is was a Fedora Core 4 bug or an SGI bug: 1) Our Fedora Core 4 workstation has been upgraded to the latest updates available. # rpm -q kernel nfs-utils kernel-2.6.12-1.1447_FC4 nfs-utils-1.0.7-10 2) Our SGI server has been updgraded as well. $ uname -R 6.5 6.5.26f I can't reproduce the problem anymore: # mount /mnt/linux ; umount /mnt/linux # mount /mnt/linux ; umount /mnt/linux # mount /mnt/linux ; umount /mnt/linux # mount /mnt/linux ; umount /mnt/linux # mount /mnt/linux ; umount /mnt/linux # mount /mnt/linux ; umount /mnt/linux # mount /mnt/linux ; umount /mnt/linux # [...] So it had to be a SGI server bug.... ;-) |