146629 – Failure to mount from multiple servers in a short period of time

Bug 146629 - Failure to mount from multiple servers in a short period of time

Summary: Failure to mount from multiple servers in a short period of time

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	autofs
Sub Component:
Version:	3
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Jeff Moyer
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-01-31 03:50 UTC by Stuart Anderson
Modified:	2007-11-30 22:10 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2005-04-20 17:38:01 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
/etc/auto.master (383 bytes, text/plain) 2005-01-31 19:34 UTC, Stuart Anderson	no flags	Details
/etc/auto.data (6.44 KB, text/plain) 2005-01-31 19:35 UTC, Stuart Anderson	no flags	Details
Output of netstat before automounting (2.25 KB, text/plain) 2005-01-31 19:36 UTC, Stuart Anderson	no flags	Details
Output of netstat after automount mounted 55 out of a requested 60 filesystems (43.54 KB, text/plain) 2005-01-31 19:37 UTC, Stuart Anderson	no flags	Details
syslog output from running automount with --debug (41.00 KB, text/plain) 2005-04-20 03:04 UTC, Stuart Anderson	no flags	Details
/etc/exports from one of the NFS servers (34 bytes, text/plain) 2005-04-20 03:05 UTC, Stuart Anderson	no flags	Details
View All

Description Stuart Anderson 2005-01-31 03:50:34 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.3)
Gecko/20041020

Description of problem:
On a dual-Xeon 4GB server running FC3 I am unable to successfully
issue more than 52 automount requests to unique Linux (also FC3)
NFS servers in less than a minute.

After 52 mounts I get the following error. This is reproducible.

Jan 30 10:40:22 beowulf automount[22882]: >> nfs bindresvport: Address
already in use
Jan 30 10:40:22 beowulf automount[22882]: mount(nfs): nfs: mount
failure node54:/usr1 on /data/node54
Jan 30 10:40:22 beowulf automount[22882]: failed to mount /data/node54

After waiting ~1 minute I am able to mount another 52 filesystems
before the getting the same error again.

This problem does not occur on a RedHat 9 client machine nor on
a Solaris 9 client.

This limitation is breaking an application on a large Beowulf cluster
that cross-mounts data between 290 GigE attached nodes. I have also
reproduced this problem with a script that explicitly calls
/bin/mount rather than relying on automount.

Version-Release number of selected component (if applicable):
autofs-4.1.3-28

How reproducible:
Always

Steps to Reproduce:
1. setup /etc/auto.data to have /data/node* point to 290 Linux NFS
servers.
2. umount /data/node*
3. ls -l /data/node*/known_file
    

Actual Results:  After 52 successfully mounts and ls results, the
following shows
up in /var/log/messages,

Jan 30 19:42:01 beowulf automount[23509]: >> nfs bindresvport: Address
already in use
Jan 30 19:42:01 beowulf automount[23509]: mount(nfs): nfs: mount
failure node53:/usr1 on /data/node53
Jan 30 19:42:01 beowulf automount[23509]: failed to mount /data/node53
... (for the remaining nodes)

Expected Results:  The ls output for 290 files.

Additional info:

Comment 1 Jeff Moyer 2005-01-31 13:41:45 UTC

This is a known issue.  See bug #128966.  The last two attachments
(patches) should help the problem:

https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=105300&action=view
https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=107079&action=view

The first one should already be applied to 4.1.3-28.  Please apply the
second patch and let me know if this resolves the issue for you.

Please also note that mount now does some amount of probing to detect
the versions of NFS the server supports.  This also uses up reserved
ports, so fixing automount is only part of the solution.

Comment 2 Stuart Anderson 2005-01-31 17:43:56 UTC

I tried autofs-4.1.4_beta1 but that did not help.
Is that a sufficient test?

Both of these patches are to automount, yet you indicate there is
an additional problem with NFS probing to detect versions. Is there
a patch to fix that?

Comment 3 Jeff Moyer 2005-01-31 17:50:00 UTC

4.1.4 beta1 should have these patches included.  When you say it did
not help, do you mean that your results did not change at all?  I
would find that unusual.

There are no patches available for the mount command that I know of. 
You could try an older version of util-linux.  2.11y would be fine,
for sure.

Comment 4 Stuart Anderson 2005-01-31 18:19:53 UTC

No change. It mounts the first 52 and then fails.

# md5sum autofs-4.1.4_beta1.tar.bz2 
29562f435b219e328d0f2736b91ab5cc  autofs-4.1.4_beta1.tar.bz2

# /usr/sbin/automount --version
Linux automount version 4.1.4_beta1

Jan 31 10:13:37 beowulf automount[27036]: >> nfs bindresvport: Address
already in use
Jan 31 10:13:37 beowulf automount[27036]: mount(nfs): nfs: mount
failure node53:/usr1 on /data/nod
e53
Jan 31 10:13:37 beowulf automount[27036]: failed to mount /data/node53

According to netstat it still allocated port 800 for node1 and then
started working its way down until node52 had port 720.

Here is how I installed it on the FC3 client machine:

# tar xjvf autofs-4.1.4_beta1.tar.bz2 
# cd autofs-4.1.4_beta1
# ./configure
# make
# /etc/init.d/autofs stop
# mv /usr/sbin/automount /usr/sbin/automount.orig
# cp daemon/automount /usr/sbin
# mv /usr/lib/autofs /usr/lib/autofs.orig
# mkdir /usr/lib/autofs
# install -c lookup_yp.so lookup_file.so lookup_program.so
lookup_userhome.so lookup_multi.so parse_sun.so mount_generic.so
mount_nfs.so mount_afs.so mount_autofs.so mount_changer.so
mount_bind.so mount_ext2.so lookup_hesiod.so parse_hesiod.so
lookup_nisplus.so lookup_ldap.so -m 755 /usr/lib/autofs
# rm -f /usr/lib/autofs/mount_smbfs.so
# ln -fs mount_ext2.so /usr/lib/autofs/mount_ext3.so
# /etc/init.d/autofs start

I did not change anything on the NFS server machines.

Comment 5 Stuart Anderson 2005-01-31 19:09:38 UTC

Perhaps automount needs the equivalient of the 1 line fix
in bug 141773 to reduce the number of temporary TCP connections
during the mount operation?

Comment 6 Jeff Moyer 2005-01-31 19:26:36 UTC

Autofs already starts with UDP and falls back to TCP if there are no
responses from the UDP probe.

Please post your auto.master and related files.

Comment 7 Stuart Anderson 2005-01-31 19:34:29 UTC

Created attachment 110450 [details]
/etc/auto.master

Comment 8 Stuart Anderson 2005-01-31 19:35:11 UTC

Created attachment 110451 [details]
/etc/auto.data

Comment 9 Stuart Anderson 2005-01-31 19:36:01 UTC

Created attachment 110452 [details]
Output of netstat before automounting

Comment 10 Stuart Anderson 2005-01-31 19:37:20 UTC

Created attachment 110453 [details]
Output of netstat after automount mounted 55 out of a requested 60 filesystems

Comment 11 Jeff Moyer 2005-04-11 21:24:38 UTC

This should be resolved in the current release.  I'm closing this bug.  Please
feel free to re-open if you are still experiencing problems.  (my tests allowed
me to mount up to 800 file systems very quickly from a single NFS server).

Comment 12 Stuart Anderson 2005-04-11 22:58:50 UTC

Please define "current release", I would like to upgrade whatever subsystems are
needed and test this fix. Thanks.

Comment 13 Jeff Moyer 2005-04-12 13:55:00 UTC

Sorry, I should have double checked this.  The new version has not yet been
pushed out.  It will be 4.1.3-114, I believe.  I'll re-open the bug, and allow
you to close it when your environment proves stable.

Thanks.

Comment 14 Jeff Moyer 2005-04-12 14:00:43 UTC

You could try the testing package and provide feedback.  It can be found here:

http://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/3/i386/autofs-4.1.3-114.i386.rpm

Comment 15 Stuart Anderson 2005-04-12 17:17:41 UTC

It still fails for me.

# rpm -qf /usr/sbin/automount
autofs-4.1.3-114

# df /oink
Filesystem           1K-blocks      Used Available Use% Mounted on
automount(pid23862)          0         0         0   -  /oink

# lsof /usr/sbin/automount
COMMAND     PID USER  FD   TYPE DEVICE  SIZE   NODE NAME
automount 23862 root txt    REG    3,2 36508 519226 /usr/sbin/automount

(run script to /bin/ls 290 automount entries in /oink)

First 30 work and then,
ls: /oink/node31: No such file or directory
ls: /oink/node32: No such file or directory
ls: /oink/node33: No such file or directory

Looking /var/log/messages,

automount[24473]: mount(nfs): nfs: mount failure node31:/usr1 on /oink/node31
automount[24473]: failed to mount /oink/node31
automount[24475]: >> nfs bindresvport: Address already in use
automount[24475]: mount(nfs): nfs: mount failure node31:/usr1 on /oink/node31
automount[24475]: failed to mount /oink/node31
automount[24478]: >> nfs bindresvport: Address already in use
automount[24478]: mount(nfs): nfs: mount failure node32:/usr1 on /oink/node32
automount[24478]: failed to mount /oink/node32
automount[24480]: >> nfs bindresvport: Address already in use
automount[24480]: mount(nfs): nfs: mount failure node32:/usr1 on /oink/node32
automount[24480]: failed to mount /oink/node32
automount[24483]: >> nfs bindresvport: Address already in use

...

Are there any other changes such as nfs-utils needed in addition to
autofs-4.1.3-114?

Comment 16 Jeff Moyer 2005-04-12 18:25:35 UTC

util-linux had a bug which caused it to eat up reserved ports.  This was
resolved in: util-linux-2.12a-19.  Please give that a try and let us know if
that addresses your mount issues.  Thanks.

Comment 17 Stuart Anderson 2005-04-12 19:39:33 UTC

I upgraded to util-linux-2.12a-23 on the client side only (as was the case
for autofs, if I need to update anything on the NFS servers please let me know),
and I now typically get 205-260 mounts before the bindresvport error.
Note, in my case, each mount request is to a distince server.

Comment 18 Jeff Moyer 2005-04-20 00:07:37 UTC

I don't think it should matter whether you're mounting from one server or
multiple.  I'll have to double check that, though.

This is very strange.  Autofs should not be doing any probing of the NFS servers
in your case, since you are not using replicated servers.

Could you please take a look at the section on "Filing bug reports" on my people
page?  It has steps to gather debug information.  You've provided most of the
info, but I'd like to get the debug logs from autofs.  Reproduce the issue, and
then attach /var/log/debug to this bugzilla.  Also, could you attach the
/etc/exports from the NFS server?

Thanks.

Comment 19 Jeff Moyer 2005-04-20 00:14:07 UTC

I just realized I didn't include a URL for my people page.  You can find it here:

  http://people.redhat.com/jmoyer/

Thanks.

Comment 20 Stuart Anderson 2005-04-20 03:00:29 UTC

# rpm -q autofs
autofs-4.1.3-114

# uname -r
2.6.11-1.14_FC3smp

# cat /etc/auto.master
#
# $Id: auto.master,v 1.3 2003/09/29 08:22:35 raven Exp $
#
# Sample auto.master file
# This is an automounter map and it has the following format
# key [ -mount-options-separated-by-comma ] location
# For details of the format look at autofs(5).
#/misc  /etc/auto.misc --timeout=60
#/misc  /etc/auto.misc
#/net   /etc/auto.net
# auto.master/data
/data   /etc/auto.data  --timeout 60
/imports /etc/auto.imports

# cat /etc/auto.data
ide1    datacache1:/raid0
ide2    datacache1:/raid1
ide3    datacache2:/raid0
ide4    datacache2:/raid1
ide5    datacache3:/raid0
ide6    datacache3:/raid1
ide7    datacache4:/raid0
ide8    datacache4:/raid1
ide9    datacache5:/raid0
ide10   datacache5:/raid1
ide11   datacache6:/raid0
ide12   datacache6:/raid1
#ide13  datacache7:/raid0
#ide14  datacache7:/raid1
ide15   datacache8:/raid0
ide16   datacache8:/raid1
ide17   datacache9:/raid0
#ide18  datacache9:/raid1
node1   node1:/usr1
node2   node2:/usr1
node3   node3:/usr1

...

node288 node288:/usr1
node289 node289:/usr1
node290 node290:/usr1

Steps to reproduce:
1) verify autofs is controlling the mount point,

# df /data
Filesystem           1K-blocks      Used Available Use% Mounted on
automount(pid4073)           0         0         0   -  /data

2) list all the mount points to trigger an automount
# cd /data
# cat /etc/auto.data | awk '{print $1}' | xargs ls

(lots of output but not all 290 filesystems are mounted)

3) Count the number of mount points (should be 290).
# df | grep node | wc -l
251


4) Added syslog/autofs logging and repeat step 2)

This resulted in different behavior, rather than getting ~200 mounts and
then the bindresvport errors the xargs ls command blocked after mounting
just 7 nodes. Automounting is now stuck, I was able to umount /data/node*
but I am unable to trigger any new automount requests,

# ls /data/node1
(just hangs)

There is no additional output in /var/log/debug, see attached file,
however, there are now 2 automount processes and one of them
appears to be stuck in a futex(),

# ps -ef | grep auto
root      7418     1  0 19:19 pts/2    00:00:00 /usr/sbin/automount --timeout=60
/imports file /etc/auto.imports
root      7365     1  0 19:19 pts/2    00:00:00 /usr/sbin/automount --timeout=60
--debug /data file /etc/auto.data
root      7852  5193  0 19:34 pts/0    00:00:00 grep auto
# strace -p 7365
Process 7365 attached - interrupt to quit
futex(0xb7fcd28c, FUTEX_WAIT, 2, NULL


The auto.data instance of automount would not shutdown,

# /etc/init.d/autofs stop
Stopping automount:umount2: Device or resource busy
umount: /data: device is busy
                                                           [FAILED]
however,

# ps -ef | grep auto
root      7365     1  0 19:19 pts/2    00:00:00 /usr/sbin/automount --timeout=60
--debug /data file /etc/auto.data

This processes would not respond to SIGTERM, but it did exit with SIGKILL.


I restarted autofs with debugging and it got stuck again in futex() after
48 mounts,

# strace -p 8987
Process 8987 attached - interrupt to quit
futex(0xb7fcd28c, FUTEX_WAIT, 2, NULL



turning off debugging and restarting syslog/autofs results in 267 rapid
automounts before getting the usual,

ls: node268: No such file or directory
ls: node269: No such file or directory

(syslog)
automount[10947]: >> nfs bindresvport: Address already in use
automount[10947]: mount(nfs): nfs: mount failure node268:/usr1 on /data/node268

...

Comment 21 Stuart Anderson 2005-04-20 03:04:02 UTC

Created attachment 113382 [details]
syslog output from running automount with --debug

automount processes was found to be stuck in futex() after
mounting node7:/usr1 as /data/node7.

Comment 22 Stuart Anderson 2005-04-20 03:05:44 UTC

Created attachment 113383 [details]
/etc/exports from one of the NFS servers

Comment 23 Jeff Moyer 2005-04-20 12:30:53 UTC

Okay, the new problem you are running into is autofs doing a syslog in a signal
handler.  That is a no no.  I have a separate bug tracking that issue.  Turning
on debug often exacerbates the situation.

Based on the information that did make it to the logs, it appears that autofs is
not issuing any rpc ping's to the nfs server.  This leads me to believe that the
issue is still with mount.

So, the next step is to try this without automount.  You can run this little
(untested) bash script to mount everything in /etc/auto.data:

cat /etc/auto.data | (while read mntpt server ; do mkdir -p /tmp/mnts/$mntpt";
mount -t nfs  $server /tmp/mnts/$mntpt; done)

This will unmount everything:

cat /etc/auto.data | (while read mntpt server; do umount /tmp/mnts/$mntpt; rmdir
/tmp/mnts/$mntpt; done)
rmdir /tmp/mntpts

Thanks for the quick turnaround on testing.  We'll get to the bottom of this.  =)

Comment 24 Stuart Anderson 2005-04-20 16:23:08 UTC

Many thanks for your help. I have also been tracking the corresponding static
mount problem, https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=141773

and I just reproduced it again today. In particular, after adding the
290 mount points to /etc/fstab and then running grep | xargs mount
I am able to get 237 successful mounts before finding,
nfs bindresvport: Address already in use
nfs bindresvport: Address already in use
nfs bindresvport: Address already in use
...
for the remainder of the 290 requests.

$ rpm -qf /bin/mount
util-linux-2.12a-23

Comment 25 Jeff Moyer 2005-04-20 16:28:14 UTC

OK.  If you agree with me that the number remains consistent both when mounting
is done by autofs and by hand, then please close this bug (and be sure to add
yourself to the CC list of 141773).

Note that I'm working with the Red Hat nfs maintainer to get this resolved.

Thanks.

Comment 26 Stuart Anderson 2005-04-20 16:46:09 UTC

I am quite new to the usage policy on Bugzilla, my concern is that 141773
was opened against RHEL wheras I am working on FC3, and some of the postings
there seem to indicate that some people think it has been fixed, and perhaps
it has for RHEL. I have posted my FC3 failures at 141773 but I have not yet seen
an acknwoledgement that that group is going to pick it up yet.

I still have the problem that I cannot automount which is what this PR is
about, so it seems strange to me to close this before automounting works.
I admit I don't understand the relationship between RPC, static and automounting.

Comment 27 Jeff Moyer 2005-04-20 17:28:04 UTC

OK, I didn't realize that bug was for RHEL 3.  I will clone it for FC3.

Here's the deal with automount vs mount:

Automount takes care of determining what to mount and when.  It also provides
for automatically unmounting when there are no users.  To do the actual
mounting, it hands off to mount.

The automount tries to be smart about what it attempts to mount.  In the case of
a replicated server entry (this means that you can mount a single directory from
any one of 2 or more servers), the daemon issues an RPC Ping (a null procedure
call) to each of the nfs servers listed.  It then sorts servers given an
algorithm, taking the ping time into account.

When doing this probing, the code used to use up more reserved ports than was
necessary.  That is why this particular bugzilla was filed.  The code has since
been modified to not use reserved ports for operations which don't require it.

The problem you are seeing is that the mount command has the same bits of
polling logic.  In the case of mount, it does the polling to determine whether
the host is reaachable, and what protocols and versions of NFS are supported. 
It is in this code that I believe we are using reserved ports where we could get
away without them.

So, there are two bugs, in two different packages that needed resolving.  The
bug in the autofs code has been fixed.  This is precisely why I want to close
this bugzilla.  The bug for mount is still there, and being worked on.

I understand that your problem is with automounting (after all, who mounts a
slew of directories quite quickly without it?).  However, the real bug lies with
the mount program in util-linux.

So, here is what I am going to do:

- clone bug number 141773 and file it against FC3.
- Cut and paste some of the relevant comments from here into that bugzilla
- Add you to the CC list of the cloned bug.
- close this bug

I'm sorry if this process seems confusing.  But, I assure you, we won't let this
slip through the cracks.

Thanks!

Jeff

Comment 28 Jeff Moyer 2005-04-20 17:38:01 UTC

bz #155470 created to track the issue with mount.

I'm closing this bug, as the autofs portions have been addressed.

Note You need to log in before you can comment on or make changes to this bug.