Bug 160514

Summary: nfs client hangs when mounting/unmouting repeatedly disk on IRIX server
Product: [Fedora] Fedora Reporter: Dimitri Papadopoulos <dimitri.papadopoulos>
Component: nfs-utilsAssignee: Steve Dickson <steved>
Status: CLOSED RAWHIDE QA Contact: Ben Levenson <benl>
Severity: medium Docs Contact:
Priority: medium    
Version: 4   
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-09-01 11:50:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
/var/log/messages
none
/var/log/messages since last reboot
none
tethereal -w data.pcap host pelles none

Description Dimitri Papadopoulos 2005-06-15 15:19:59 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; fr-FR; rv:1.7.8) Gecko/20050511 Firefox/1.0.4

Description of problem:
Mounting/unmounting the same remote directory repeatedly hangs the client:

# mount -t nfs pelles:/outils/linux/local /usr/local
# mount | fgrep pelles
pelles:/outils/linux/local on /usr/local type nfs (rw,addr=172.16.4.204)
# umount /usr/local
# 
# mount -t nfs pelles:/outils/linux/local /usr/local
# mount | fgrep pelles
pelles:/outils/linux/local on /usr/local type nfs (rw,addr=172.16.4.204)
# umount /usr/local
# 
# mount -t nfs pelles:/outils/linux/local /usr/local
[***hangs here for a few minutes***]
mount: pelles:/outils/linux/local: can't read superblock
# mount | fgrep pelles
#
# mount -t nfs pelles:/outils/linux/local /usr/local
# mount | fgrep pelles
pelles:/outils/linux/local on /usr/local type nfs (rw,addr=172.16.4.204)
# umount /usr/local
# 

The NFS server is running IRIX:
$ uname -aR
IRIX64 pelles 6.5 6.5.23f 01080747 IP27
$ 

I cannot reproduce the client hanging with a Solaris 8 NFS server:
$ uname -a
SunOS uriens 5.8 Generic_108528-22 sun4u sparc SUNW,Sun-Fire-480R
$ 

This could be a problem with IRIX, but on the other hand we have been running Red Hat 9, Fedora Core 2, and Solaris 8 client workstations without such problems for years.

By the way, I'm not mounting/unmounting the same disk for fun. The initial problem was that I was trying to mount a few partitions from the IRIX server. The first mount would suceeed, but the second or third mount would always fail with the same "superblock" message.


Version-Release number of selected component (if applicable):
nfs-utils-1.0.7-8

How reproducible:
Sometimes

Steps to Reproduce:
1. mount -t nfs irix:/bar /mnt/foo ; umount /mnt/foo
2. repeat 1 until client hangs


Additional info:

Comment 1 Dimitri Papadopoulos 2005-06-15 15:24:00 UTC
Created attachment 115484 [details]
/var/log/messages

This is the kind of messages I find the logs when I experiment with nfs client
hanging.

Comment 2 Dimitri Papadopoulos 2005-06-15 15:49:38 UTC
*** Bug 160513 has been marked as a duplicate of this bug. ***

Comment 3 Dimitri Papadopoulos 2005-06-15 15:50:39 UTC
*** Bug 160512 has been marked as a duplicate of this bug. ***

Comment 4 Dimitri Papadopoulos 2005-06-16 09:11:35 UTC
This is maybe totally unrelated, but I thought I'd report this since I see RPC
errors in the logs. Running up2date also generates many RPC errors:
	# up2date -l
	[...]
	do_ypcall: clnt_call: RPC: Timed out
	do_ypcall: clnt_call: RPC: Timed out
	do_ypcall: clnt_call: RPC: Timed out
	[...]
	do_ypcall: clnt_call: RPC: Timed out
	do_ypcall: clnt_call: RPC: Timed out
	[...]
	# 
Again maybe these are unrelated RPC problems. I'm reporting them for the sake of
completness.


Comment 5 Dimitri Papadopoulos 2005-06-16 09:46:43 UTC
I've read IBM's excellent "System Management Guide" for AIX which describes how
to troubleshoot NFS and NIS problems:
http://publib.boulder.ibm.com/infocenter/pseries/topic/com.ibm.aix.doc/aixbman/commadmn/nfs_problem.htm
http://publib.boulder.ibm.com/infocenter/pseries/topic/com.ibm.aix.doc/aixbman/nisplus/trbl_nis.htm

Paragraph 5 is interesting:

5. Verify that the mountd, portmap and nfsd daemons are running on the NFS
server by entering the following commands at the client shell prompt:

Below are the results against our IRIX 6.5.23 server:

# rpcinfo -u pelles mount
program 100005 version 1 ready and waiting
rpcinfo: RPC: Program/version mismatch; low version = 1, high version = 3
program 100005 version 2 is not available
program 100005 version 3 ready and waiting
# 
# rpcinfo -u pelles portmap
rpcinfo: RPC: Timed out
program 100000 version 0 is not available
# 
# rpcinfo -u pelles nfs
program 100003 version 2 ready and waiting
program 100003 version 3 ready and waiting
# 

Now if I try the same commands against our Solaris 8 server:

# rpcinfo -u uriens mount
program 100005 version 1 ready and waiting
program 100005 version 2 ready and waiting
program 100005 version 3 ready and waiting
# rpcinfo -u uriens portmap
program 100000 version 2 ready and waiting
program 100000 version 3 ready and waiting
program 100000 version 4 ready and waiting
# rpcinfo -u uriens nfs
program 100003 version 2 ready and waiting
program 100003 version 3 ready and waiting
# 

On the other hand I get similar results from Fedora Core 2, Red Hat 9, and
Solaris 8 clients. NFS and NIS work smoothly on such clients.


Comment 6 Steve Dickson 2005-06-16 12:36:17 UTC
Could you please post a system backtrace by echoing
a t in to /proc/sysrq-trigger (i.e. echo t > /proc/sysrq-trigger).
Note: kernel.sysrq has to be set in /etc/sysctl.conf or
echo 1 > /proc/sys/kernel/sysrq will also work.

Plus if possible, could you get an bzip2 tethereal binary
network trace (i.e. on the client: tethereal -w /tmp/data.pcap host servername).



Comment 7 Dimitri Papadopoulos 2005-06-16 14:41:20 UTC
Created attachment 115542 [details]
/var/log/messages since last reboot

I issued these commands:

	# up2date
	do_ypcall: clnt_call: RPC: Timed out
	do_ypcall: clnt_call: RPC: Timed out
	[...]
	# 

	# mount /mnt/linux ; umount /mnt/linux
	# mount /mnt/linux ; umount /mnt/linux
	# mount /mnt/linux ; umount /mnt/linux
	mount: pelles:/outils/linux: can't read superblock
	umount: /mnt/linux: not mounted
	# 

and issued a few "echo t > /proc/sysrq-trigger" while:
- mount was hanging, before its "can't read superblock" message,
- up2date was issuing its "RPC: Timed out" messages.

I hope that's what you need. Please do not hesitate to ask for more
information.

I'll see what I can do about the ethereal logs.

Comment 8 Dimitri Papadopoulos 2005-06-16 14:55:35 UTC
Created attachment 115545 [details]
tethereal -w data.pcap host pelles

This network trace covers the following commands:
	# mount /mnt/linux ; umount /mnt/linux
	# mount /mnt/linux ; umount /mnt/linux
	# mount /mnt/linux ; umount /mnt/linux
	mount: pelles:/outils/linux: can't read superblock
	umount: /mnt/linux: not mounted
	#

Comment 9 Steve Dickson 2005-06-16 15:05:31 UTC
hmm... there is really nothing unusual in the ethereal trace....
By chance are you using soft mounts (i.e. using -o soft mount option)?
If so try using the hard mounts to see if the problem goes away...

Comment 10 Dimitri Papadopoulos 2005-06-17 09:08:35 UTC
We're not using soft mounts, here are the relevant lines from /etc/fstab:

uriens:/usr2                /usr2               nfs     auto,ro,intr    0 0
uriens:/var/mail            /var/spool/mail     nfs     auto,intr       0 0
uriens:/export/prod/product /product            nfs     auto,intr       0 0
pelles:/outils/linux/local  /usr/local          nfs     auto,intr,suid  0 0
pelles:/outils/linux        /mnt/linux          nfs     auto,intr       0 0

Actually this NFS issue is only part of the picture. We're experiencing lots of
problems:
* slow boot (I looks like it's related to NFS or NIS problems which is why I'm
trying to identify some reproducible problem to start with)
* general slowness when using the workstation
* takes minutes to restart X11 and get to thelogin screen after Ctrl+Alt+Backspace
* up2date is extremly slow (RPC errors as already pointed out)

Maybe this is not directly caused by NFS/NIS problems after all.

I'll continue investigating, I'll try booting without NFS and NIS again, and see
whether I can identify some other problem.


Comment 11 Dimitri Papadopoulos 2005-06-17 10:37:40 UTC
Could this be a problem with glibc or the kernel?

1) Shutting down both NFS and NIS services:
   # up2date
   [GUI appears immediatly]
   # 

2) Starting NIS service:
   # up2date
   do_ypcall: clnt_call: RPC: Timed out
   do_ypcall: clnt_call: RPC: Timed out
   [message repeated 20 times before GUI appears]
   # 

3) Shutting down NIS service, starting NFS services:
   # mount /mnt/linux ; umount /mnt/linux
   # mount /mnt/linux ; umount /mnt/linux
   # mount /mnt/linux ; umount /mnt/linux
   mount: pelles:/outils/linux: can't read superblock
   umount: /mnt/linux: not mounted
   # 

As you can see both NIS and NFS have problems on this workstation. I suspect
these problems may be related, in which case the cause is probably in glibc or
the kernel. If that's the case, I can't explain why I wasn't able to find
reports of similar NFS or NIS problems on Google or the Fedora site.

Any clue? Should I file this bug against another component? kernel? glibc?


Comment 12 Steve Dickson 2005-06-17 13:04:07 UTC
Boy it sure looks like your having network problems....

Does 'ifconfig eth?' show any errors on the interface?
also does 'nfsstat -rc' show a ton of retrans?



Comment 13 Dimitri Papadopoulos 2005-06-17 13:43:56 UTC
On the other hand ping works fine:
# ping -f pelles
PING pelles.shfj.cea.fr (172.16.4.204) 56(84) bytes of data.
--- pelles.shfj.cea.fr ping statistics ---
378 packets transmitted, 378 received, 0% packet loss, time 4169ms
rtt min/avg/max/mdev = 0.177/0.664/1.407/0.252 ms, pipe 2, ipg/ewma 11.059/0.624 ms
#

Also FTP retrieves large files at maximal speed:
# wget ftp://ftp.lip6.fr/pub/linux/distributions/fedora/4/i386/iso/FC4-i386-DVD.iso
--15:38:36-- 
ftp://ftp.lip6.fr/pub/linux/distributions/fedora/4/i386/iso/FC4-i386-DVD.iso
           => `FC4-i386-DVD.iso'
Resolving ftp.lip6.fr... 195.83.118.1
Connecting to ftp.lip6.fr[195.83.118.1]:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD /pub/linux/distributions/fedora/4/i386/iso ... done.
==> PASV ... done.    ==> RETR FC4-i386-DVD.iso ... done.
Length: 2,750,582,784 (unauthoritative)

 0% [                                    ] 1,590,252    421.71K/s  ETA 1:46:24
[...]
# 

Once I shut down NIS, up2date works just fine.

Once I remove all references to the IRIX server from /etc/fstab the workstation
is as fast if not much faster than our FC2 workstations. Now more general slowness.


So this can't be a general network issue.

Here is the information you've asked for:
# ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:B0:D0:F9:3E:B2
          inet addr:172.16.4.82  Bcast:172.16.7.255  Mask:255.255.252.0
          inet6 addr: fe80::2b0:d0ff:fef9:3eb2/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:14485 errors:0 dropped:0 overruns:1 frame:0
          TX packets:4850 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3579270 (3.4 MiB)  TX bytes:521818 (509.5 KiB)
          Interrupt:169 Base address:0xec80
# 

I'll have to look into nfsstat in more detail, we need to compare the output
with and without the IRIX server.


Comment 14 Dimitri Papadopoulos 2005-06-17 13:57:13 UTC
Mounting a directory from the Sun server and issuing 'ls -R':

	# mount /usr2
	# ls -R /usr2 >/dev/null
	# umount /usr2
	# nfsstat -rc
	Client rpc stats:
	calls      retrans    authrefrsh
	704        0          0
	# 

Mounting a directory from the Sun server and issuing 'ls -R':

	# mount /mnt/linux
	# ls -R /mnt/linux >/dev/null
	[...quite slow, this directory is much larger, but still...]
	# umount /mnt/linux
	# nfsstat -rc
	Client rpc stats:
	calls      retrans    authrefrsh
	70780      0          0
	# 


Comment 15 Dimitri Papadopoulos 2005-06-17 13:58:33 UTC
Ooops... I meant 'IRIX server' instead of 'Sun server' in the second case
(/mnt/linux).

Comment 16 Steve Dickson 2005-06-17 15:25:48 UTC
Ok.. Again looking at the message file in comment #7 there appears the following
error: rpc.idmapd: nfsdreopen: Opening '' failed: errno 2 (No such file or
directory)

which means you need to upgrade your nfs-utils to the latest and greatest.
(nfs-utils-1.0.7-8 I believe)...

Now this probably will little effect on the current problems but its a known
data corrupt problem that you really want to fix....

Again looking back at Comment #5 (the rpcinfo -u pelles portmap in
particular), it appears the portmap is die which is very bad... That would
explain why both NFS and NIS are having timeouts talking to pelles.
Could you please verify that the portmapper is alive and well and if
it isn't theres *should be* some type of error log in the /var/log/messages
file


Comment 17 Dimitri Papadopoulos 2005-06-20 09:34:00 UTC
I already have nfs-utils-1.0.7-8. This is the version initially released with
FC4 and there have been no updates as far as I know.

About the portmap issue, I get the same results on Red Hat 9 and Fedora Core 2
workstations, 'rpcinfo' reports a timeout but 'portmap' is actuall running:

	# cat /etc/redhat-release
	Red Hat Linux release 9 (Shrike)
	# rpcinfo -u pelles portmap
	rpcinfo: RPC: Timed out
	program 100000 version 0 is not available
	# /etc/init.d/portmap status
	portmap (pid 3671) is running...
	# 

	# cat /etc/redhat-release
	Fedora Core release 2 (Tettnang)
	# rpcinfo -u pelles portmap
	rpcinfo: RPC: Timed out
	program 100000 version 0 is not available
	# /etc/init.d/portmap status
	portmap (pid 2471) is running...
	# 

NFS and NIS work without problems on our Red Hat 9 and Fedora Core 2 workstations.


Comment 18 Dimitri Papadopoulos 2005-06-20 13:34:22 UTC
Mmmh...

Today I'm not able to reproduce the general system slowness with NIS shut down.
I'm still able to reproduce the general slowness with NIS running. I really
don't know, maybe I had mixed up things.

Here is what I can reproduce today:

* When NIS client is running, general system slowness. Slowness disappears when
NIS client is shut down. Since I'm interested in getting the NFS issues fixed
first, I'll be working with NIS shut down from now on.

* When NFS client is running (and NIS is shut down) no system slowness, even
when IRIX disks are mounted. Actually NFS is much faster in Fedora Core 4 than
in Fedora Core 2, it looks like caches are better implemented.

* On the other hand I *still* see the problem with the IRIX server where issuing
successive mount/umount commands results in:
	mount: pelles:/outils/linux: can't read superblock

* From time to time, I do see a portmap error when shutting down the system:
	Sending all processes the TERM signal...
	RPC: sendmsg call retruned error 101
	portmap: RPC call returned error 101
	RPC: failed to contact portmap (errno -101)
	Sending all processes the KILL signal...
This seems to be happening when an NFS disk is left mounted when shutting down.
I'm not sure it's related to the mount/umount problem though. It may be related
to another problem I have which I haven't reported yet: NFS shares with 'auto'
in /etc/fstab don't get mounted automatically when starting the system. They
probably don't get unmounted when shutting down either... Strangely I don't find
these errors in the logs after reboot.


Comment 19 Dimitri Papadopoulos 2005-06-21 15:48:24 UTC
FYI, we've logged a call with SGI, hopefully they'll find something wrong on the
server side about this mount/umount issue. Note that 'mount -a' often doesn't
work when there are entries from the Irix server in /etc/fstab, I think it's the
exact same problem:
	# mount -a
	# umount pelles:/outils/linux/local pelles:/outils/linux
	# mount -a
	mount: pelles:/outils/linux: can't read superblock
	# 

I'll probably open a different bug report about the NIS problem.

Do you know whether it's expected that entries with the 'auto' option are not
automatically mounted at boot-time? Maybe a side-effect of the 'mount -a'
problem? I wans't able to find anything about that in the FC4 manuals and
release notes.


Comment 20 Dimitri Papadopoulos 2005-08-30 09:36:59 UTC
I guess we'll never know whether this is was a Fedora Core 4 bug or an SGI bug:
1) Our Fedora Core 4 workstation has been upgraded to the latest updates available.
   	# rpm -q kernel nfs-utils
   	kernel-2.6.12-1.1447_FC4
   	nfs-utils-1.0.7-10
2) Our SGI server has been updgraded as well.
   	$ uname -R
   	6.5 6.5.26f

I can't reproduce the problem anymore:
	# mount /mnt/linux ; umount /mnt/linux
	# mount /mnt/linux ; umount /mnt/linux
	# mount /mnt/linux ; umount /mnt/linux
	# mount /mnt/linux ; umount /mnt/linux
	# mount /mnt/linux ; umount /mnt/linux
	# mount /mnt/linux ; umount /mnt/linux
	# mount /mnt/linux ; umount /mnt/linux
	# [...]


Comment 21 Steve Dickson 2005-09-01 11:50:45 UTC
So it had to be a SGI server bug.... ;-)