Bug 146629
Summary: | Failure to mount from multiple servers in a short period of time | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Stuart Anderson <anderson> |
Component: | autofs | Assignee: | Jeff Moyer <jmoyer> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 3 | CC: | axel.thimm, cfeist, ram, stevec |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-04-20 17:38:01 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Stuart Anderson
2005-01-31 03:50:34 UTC
This is a known issue. See bug #128966. The last two attachments (patches) should help the problem: https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=105300&action=view https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=107079&action=view The first one should already be applied to 4.1.3-28. Please apply the second patch and let me know if this resolves the issue for you. Please also note that mount now does some amount of probing to detect the versions of NFS the server supports. This also uses up reserved ports, so fixing automount is only part of the solution. I tried autofs-4.1.4_beta1 but that did not help. Is that a sufficient test? Both of these patches are to automount, yet you indicate there is an additional problem with NFS probing to detect versions. Is there a patch to fix that? 4.1.4 beta1 should have these patches included. When you say it did not help, do you mean that your results did not change at all? I would find that unusual. There are no patches available for the mount command that I know of. You could try an older version of util-linux. 2.11y would be fine, for sure. No change. It mounts the first 52 and then fails. # md5sum autofs-4.1.4_beta1.tar.bz2 29562f435b219e328d0f2736b91ab5cc autofs-4.1.4_beta1.tar.bz2 # /usr/sbin/automount --version Linux automount version 4.1.4_beta1 Jan 31 10:13:37 beowulf automount[27036]: >> nfs bindresvport: Address already in use Jan 31 10:13:37 beowulf automount[27036]: mount(nfs): nfs: mount failure node53:/usr1 on /data/nod e53 Jan 31 10:13:37 beowulf automount[27036]: failed to mount /data/node53 According to netstat it still allocated port 800 for node1 and then started working its way down until node52 had port 720. Here is how I installed it on the FC3 client machine: # tar xjvf autofs-4.1.4_beta1.tar.bz2 # cd autofs-4.1.4_beta1 # ./configure # make # /etc/init.d/autofs stop # mv /usr/sbin/automount /usr/sbin/automount.orig # cp daemon/automount /usr/sbin # mv /usr/lib/autofs /usr/lib/autofs.orig # mkdir /usr/lib/autofs # install -c lookup_yp.so lookup_file.so lookup_program.so lookup_userhome.so lookup_multi.so parse_sun.so mount_generic.so mount_nfs.so mount_afs.so mount_autofs.so mount_changer.so mount_bind.so mount_ext2.so lookup_hesiod.so parse_hesiod.so lookup_nisplus.so lookup_ldap.so -m 755 /usr/lib/autofs # rm -f /usr/lib/autofs/mount_smbfs.so # ln -fs mount_ext2.so /usr/lib/autofs/mount_ext3.so # /etc/init.d/autofs start I did not change anything on the NFS server machines. Perhaps automount needs the equivalient of the 1 line fix in bug 141773 to reduce the number of temporary TCP connections during the mount operation? Autofs already starts with UDP and falls back to TCP if there are no responses from the UDP probe. Please post your auto.master and related files. Created attachment 110450 [details]
/etc/auto.master
Created attachment 110451 [details]
/etc/auto.data
Created attachment 110452 [details]
Output of netstat before automounting
Created attachment 110453 [details]
Output of netstat after automount mounted 55 out of a requested 60 filesystems
This should be resolved in the current release. I'm closing this bug. Please feel free to re-open if you are still experiencing problems. (my tests allowed me to mount up to 800 file systems very quickly from a single NFS server). Please define "current release", I would like to upgrade whatever subsystems are needed and test this fix. Thanks. Sorry, I should have double checked this. The new version has not yet been pushed out. It will be 4.1.3-114, I believe. I'll re-open the bug, and allow you to close it when your environment proves stable. Thanks. You could try the testing package and provide feedback. It can be found here: http://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/3/i386/autofs-4.1.3-114.i386.rpm It still fails for me. # rpm -qf /usr/sbin/automount autofs-4.1.3-114 # df /oink Filesystem 1K-blocks Used Available Use% Mounted on automount(pid23862) 0 0 0 - /oink # lsof /usr/sbin/automount COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME automount 23862 root txt REG 3,2 36508 519226 /usr/sbin/automount (run script to /bin/ls 290 automount entries in /oink) First 30 work and then, ls: /oink/node31: No such file or directory ls: /oink/node32: No such file or directory ls: /oink/node33: No such file or directory Looking /var/log/messages, automount[24473]: mount(nfs): nfs: mount failure node31:/usr1 on /oink/node31 automount[24473]: failed to mount /oink/node31 automount[24475]: >> nfs bindresvport: Address already in use automount[24475]: mount(nfs): nfs: mount failure node31:/usr1 on /oink/node31 automount[24475]: failed to mount /oink/node31 automount[24478]: >> nfs bindresvport: Address already in use automount[24478]: mount(nfs): nfs: mount failure node32:/usr1 on /oink/node32 automount[24478]: failed to mount /oink/node32 automount[24480]: >> nfs bindresvport: Address already in use automount[24480]: mount(nfs): nfs: mount failure node32:/usr1 on /oink/node32 automount[24480]: failed to mount /oink/node32 automount[24483]: >> nfs bindresvport: Address already in use ... Are there any other changes such as nfs-utils needed in addition to autofs-4.1.3-114? util-linux had a bug which caused it to eat up reserved ports. This was resolved in: util-linux-2.12a-19. Please give that a try and let us know if that addresses your mount issues. Thanks. I upgraded to util-linux-2.12a-23 on the client side only (as was the case for autofs, if I need to update anything on the NFS servers please let me know), and I now typically get 205-260 mounts before the bindresvport error. Note, in my case, each mount request is to a distince server. I don't think it should matter whether you're mounting from one server or multiple. I'll have to double check that, though. This is very strange. Autofs should not be doing any probing of the NFS servers in your case, since you are not using replicated servers. Could you please take a look at the section on "Filing bug reports" on my people page? It has steps to gather debug information. You've provided most of the info, but I'd like to get the debug logs from autofs. Reproduce the issue, and then attach /var/log/debug to this bugzilla. Also, could you attach the /etc/exports from the NFS server? Thanks. I just realized I didn't include a URL for my people page. You can find it here: http://people.redhat.com/jmoyer/ Thanks. # rpm -q autofs autofs-4.1.3-114 # uname -r 2.6.11-1.14_FC3smp # cat /etc/auto.master # # $Id: auto.master,v 1.3 2003/09/29 08:22:35 raven Exp $ # # Sample auto.master file # This is an automounter map and it has the following format # key [ -mount-options-separated-by-comma ] location # For details of the format look at autofs(5). #/misc /etc/auto.misc --timeout=60 #/misc /etc/auto.misc #/net /etc/auto.net # auto.master/data /data /etc/auto.data --timeout 60 /imports /etc/auto.imports # cat /etc/auto.data ide1 datacache1:/raid0 ide2 datacache1:/raid1 ide3 datacache2:/raid0 ide4 datacache2:/raid1 ide5 datacache3:/raid0 ide6 datacache3:/raid1 ide7 datacache4:/raid0 ide8 datacache4:/raid1 ide9 datacache5:/raid0 ide10 datacache5:/raid1 ide11 datacache6:/raid0 ide12 datacache6:/raid1 #ide13 datacache7:/raid0 #ide14 datacache7:/raid1 ide15 datacache8:/raid0 ide16 datacache8:/raid1 ide17 datacache9:/raid0 #ide18 datacache9:/raid1 node1 node1:/usr1 node2 node2:/usr1 node3 node3:/usr1 ... node288 node288:/usr1 node289 node289:/usr1 node290 node290:/usr1 Steps to reproduce: 1) verify autofs is controlling the mount point, # df /data Filesystem 1K-blocks Used Available Use% Mounted on automount(pid4073) 0 0 0 - /data 2) list all the mount points to trigger an automount # cd /data # cat /etc/auto.data | awk '{print $1}' | xargs ls (lots of output but not all 290 filesystems are mounted) 3) Count the number of mount points (should be 290). # df | grep node | wc -l 251 4) Added syslog/autofs logging and repeat step 2) This resulted in different behavior, rather than getting ~200 mounts and then the bindresvport errors the xargs ls command blocked after mounting just 7 nodes. Automounting is now stuck, I was able to umount /data/node* but I am unable to trigger any new automount requests, # ls /data/node1 (just hangs) There is no additional output in /var/log/debug, see attached file, however, there are now 2 automount processes and one of them appears to be stuck in a futex(), # ps -ef | grep auto root 7418 1 0 19:19 pts/2 00:00:00 /usr/sbin/automount --timeout=60 /imports file /etc/auto.imports root 7365 1 0 19:19 pts/2 00:00:00 /usr/sbin/automount --timeout=60 --debug /data file /etc/auto.data root 7852 5193 0 19:34 pts/0 00:00:00 grep auto # strace -p 7365 Process 7365 attached - interrupt to quit futex(0xb7fcd28c, FUTEX_WAIT, 2, NULL The auto.data instance of automount would not shutdown, # /etc/init.d/autofs stop Stopping automount:umount2: Device or resource busy umount: /data: device is busy [FAILED] however, # ps -ef | grep auto root 7365 1 0 19:19 pts/2 00:00:00 /usr/sbin/automount --timeout=60 --debug /data file /etc/auto.data This processes would not respond to SIGTERM, but it did exit with SIGKILL. I restarted autofs with debugging and it got stuck again in futex() after 48 mounts, # strace -p 8987 Process 8987 attached - interrupt to quit futex(0xb7fcd28c, FUTEX_WAIT, 2, NULL turning off debugging and restarting syslog/autofs results in 267 rapid automounts before getting the usual, ls: node268: No such file or directory ls: node269: No such file or directory (syslog) automount[10947]: >> nfs bindresvport: Address already in use automount[10947]: mount(nfs): nfs: mount failure node268:/usr1 on /data/node268 ... Created attachment 113382 [details]
syslog output from running automount with --debug
automount processes was found to be stuck in futex() after
mounting node7:/usr1 as /data/node7.
Created attachment 113383 [details]
/etc/exports from one of the NFS servers
Okay, the new problem you are running into is autofs doing a syslog in a signal handler. That is a no no. I have a separate bug tracking that issue. Turning on debug often exacerbates the situation. Based on the information that did make it to the logs, it appears that autofs is not issuing any rpc ping's to the nfs server. This leads me to believe that the issue is still with mount. So, the next step is to try this without automount. You can run this little (untested) bash script to mount everything in /etc/auto.data: cat /etc/auto.data | (while read mntpt server ; do mkdir -p /tmp/mnts/$mntpt"; mount -t nfs $server /tmp/mnts/$mntpt; done) This will unmount everything: cat /etc/auto.data | (while read mntpt server; do umount /tmp/mnts/$mntpt; rmdir /tmp/mnts/$mntpt; done) rmdir /tmp/mntpts Thanks for the quick turnaround on testing. We'll get to the bottom of this. =) Many thanks for your help. I have also been tracking the corresponding static mount problem, https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=141773 and I just reproduced it again today. In particular, after adding the 290 mount points to /etc/fstab and then running grep | xargs mount I am able to get 237 successful mounts before finding, nfs bindresvport: Address already in use nfs bindresvport: Address already in use nfs bindresvport: Address already in use ... for the remainder of the 290 requests. $ rpm -qf /bin/mount util-linux-2.12a-23 OK. If you agree with me that the number remains consistent both when mounting is done by autofs and by hand, then please close this bug (and be sure to add yourself to the CC list of 141773). Note that I'm working with the Red Hat nfs maintainer to get this resolved. Thanks. I am quite new to the usage policy on Bugzilla, my concern is that 141773 was opened against RHEL wheras I am working on FC3, and some of the postings there seem to indicate that some people think it has been fixed, and perhaps it has for RHEL. I have posted my FC3 failures at 141773 but I have not yet seen an acknwoledgement that that group is going to pick it up yet. I still have the problem that I cannot automount which is what this PR is about, so it seems strange to me to close this before automounting works. I admit I don't understand the relationship between RPC, static and automounting. OK, I didn't realize that bug was for RHEL 3. I will clone it for FC3. Here's the deal with automount vs mount: Automount takes care of determining what to mount and when. It also provides for automatically unmounting when there are no users. To do the actual mounting, it hands off to mount. The automount tries to be smart about what it attempts to mount. In the case of a replicated server entry (this means that you can mount a single directory from any one of 2 or more servers), the daemon issues an RPC Ping (a null procedure call) to each of the nfs servers listed. It then sorts servers given an algorithm, taking the ping time into account. When doing this probing, the code used to use up more reserved ports than was necessary. That is why this particular bugzilla was filed. The code has since been modified to not use reserved ports for operations which don't require it. The problem you are seeing is that the mount command has the same bits of polling logic. In the case of mount, it does the polling to determine whether the host is reaachable, and what protocols and versions of NFS are supported. It is in this code that I believe we are using reserved ports where we could get away without them. So, there are two bugs, in two different packages that needed resolving. The bug in the autofs code has been fixed. This is precisely why I want to close this bugzilla. The bug for mount is still there, and being worked on. I understand that your problem is with automounting (after all, who mounts a slew of directories quite quickly without it?). However, the real bug lies with the mount program in util-linux. So, here is what I am going to do: - clone bug number 141773 and file it against FC3. - Cut and paste some of the relevant comments from here into that bugzilla - Add you to the CC list of the cloned bug. - close this bug I'm sorry if this process seems confusing. But, I assure you, we won't let this slip through the cracks. Thanks! Jeff bz #155470 created to track the issue with mount. I'm closing this bug, as the autofs portions have been addressed. |