From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7) Gecko/20040626 Firefox/0.8 Description of problem: I'm having to chek on an issue with one of our server farms (1200+ machines) that nfs mount from a central nfs box (linux) The problem stems from a case where the directory is not mounted even though there would not appear to be any reason it shouldn't have mounted and indead successive attempts will succeed in mounting the directory. (NOTE: I'm ignoring the 800 max mounts issue in this scenario, thats a different problem entirely) Ive attached the packet trace that shows the error codes being passed from the server to the client which indicates this is a server side issue. Version-Release number of selected component (if applicable): nfs-utils-1.0.6-22 How reproducible: Always Steps to Reproduce: (NOTE: we're using autofs to generate the loading but the acual work is being done by rpc.mountd/mount) 1. Setup the nfs master to export an FS eg /test 2. on the nfs master, create 1024 directories/files in the exported FS using this piece of shell. for n in `seq 1 1024`; do mkdir $n ; touch $n/$n; done 3. Make a mount point on the client box for the NFS mounts eg /test 4. Also on the client side, add this line to (or similar) /etc/auto.master: /test /etc/auto.test tcp --ghost --verbose intr,rsize=8192,wsize=8192 4. On the client, (replace <IP> with the server IP) for n in `seq 1 1024`; do echo "$n <IP>:/export_fs/$n" >> /etc/auto.test ; done 5. ensure that nfs services re running on the master box and then issue for n in `seq 1 1024`; do ls /test/$n ; done 6. The autofs daemon will quickly service the requests and generate the corresponding mount comands. radiply you will see the first 100+ mounts being serviced however after that you'll see failures which if you left continue will start to succeed and then start failing and succeeding again Actual Results: After 115+ mounts it starts failing Expected Results: It should keep mounting happily until ~800 mounts when a different issue crops up Additional info: With etherreal attached and filering I see the mounts succeeding fine but when the mounts FAIL, the etherreal trace shows that the NFS server passes back an ERR_ACCESS failure. The relevent trace piece is appended.
Created attachment 102348 [details] Text trace output of the packet dump This is a packet dump of the NFS failure. The packes of note are packets 1 and 5
exact kernels in use ?
Varies,.. I get the same result with RHEL3 2.4.21-15.0.3.ELsmp but in this particular test it was with FC2 2.6.6-1.435.2.3smp Cheers Phil =--=
If autofs is removed from the picture, does things still break? Since these are insecure mounts, its probably not a port space issued, so its either a mount problem (which is nfs-utils) or an autofs problem (since its the one calling mount)
Ok,.. a quick bit of editing and I have a shell script to mount 1024 FS's... (cough) Anyway running that seems to have worked up until I run out of low ports [root@localhost test]# sh /tmp/mountfs <some errors..> [root@localhost test]# df | grep -c 192.168.1.100 796 mount: wrong fs type, bad option, bad superblock on 192.168.1.100:/compile/nfs/1024, or too many mounted file systems (ahem really,.. innocent whistle) So it would seem happy enough to mount 800ish FS's before we hit the ports issue. This now looks like an autofs issue. Whoops. Applogies for tar and feathering you Steve Though quite why the *server* side in the packet dump spits out ERR_ACCESS I dunno,.. maybe the mount request gett mangled but packet 1 seems to be correct in the trace Phil =--=
Shell script to generate the shell to mount 1024 directories (replace <IP> with the nfs server IP) (edit to suit) for n in `seq 1 1024`; do echo "mount <IP>:/exportfs/$n /test/$n" >> /tmp/mountfs ; done then just 'sh /tmp/mountfs' Phil =--=
Oh,.. as it's relevant,. the autofs version being used is autofs-4.1.3-4 Phil =--=
Actually I saw there was a 4.1.3-9 and I tried it as well, same results Phil =--=
Is there any relevant information in the logs? Please attach if so. If not, re-run your tests using the debug option to autofs. You can specify this by editing /etc/sysconfig/autofs and adding --debug to the DAEMONOPTIONS variable. Please ensure you have a syslog entry which captures debug information.
Ok,. rerun with --debug I've aded it as an attachment (see attachements) Phil =--=
Created attachment 102416 [details] syslog output for autofs run with --debug
Oh,.. not all the output is 100% the same I also found this in the log Aug 4 12:31:56 localhost automount[4686]: attempting to mount entry /test/164 Aug 4 12:31:56 localhost automount[4686]: attempting to mount entry /test/165 Aug 4 12:31:56 localhost automount[5252]: >> nfs bindresvport: Address already in use Aug 4 12:31:56 localhost automount[5252]: mount(nfs): nfs: mount failure 192.168.1.100:/compile/nfs/16 5 on /test/165 Aug 4 12:31:56 localhost automount[5252]: failed to mount /test/165 noting the nfs bindresvport error Phil =--=
Could you please compress your log files and attach them in their entirety. You have not included the bits that I am interested in seeing.
Sorry if it sounds stupid but what are you expecting to see? I don't see anything especially new Client: syslogd.conf: *.* /var/log/all root 4006 0.0 0.0 1676 840 ? S 13:23 0:00 /usr/sbin/automount --timeout=60 --ghost --verbose /test file /etc/auto.test tcp intr,rsize=8192,wsize=8192 --debug Server: syslogd.conf: *.* /var/log/all (attaching logs) Phil =--=
Created attachment 102479 [details] syslog output for the client
Created attachment 102480 [details] syslog output from teh server side
Was this in line with what you're expecting to see? Phil =--=
Jeff I'm still seeing this behaviour in 2.4.21-18EL (EL3 beta) Unfortunately this area insn't my strong point but I'm gonna make a guess From the logs it looks like autofs is busy mounting FS's and is happy then *ANOTHER* autofs process seems to get fired off and tries to bind to the same port ant which point bad thinsg happen quite what prompts autofs to start a new instance of itself I dunno though Phil =--=
Autofs does this for every mount. From the code: info("attempting to mount entry %s", buf); sigprocmask(SIG_BLOCK, &lock_sigs, &oldsig); f = fork(); The child later goes on to fork/exec mount. The daemon is written this way so that it can process multiple mount requests simultaneously. Ah ha! The problem appears to be in the rpc_subs.c file. We do an rpc_ping to test a host's availability. Unfortunately, the socket used for this test is never closed! I'll let you know when I have a patch that is ready for testing. Thanks for your help in diagnosing this! -Jeff
Created attachment 102851 [details] Fix an rpc socket leak Can you try out this patch? Or do you need me to roll another rpm for you? If you need me to roll an rpm, what distribution would you like to test on? -Jeff
Nah thats ok 8) I used to work for RHAT in the engineering dep making rpms all the time 8) I'll have this testing in about 10 miuntes when I get the rig setup again. Cheers. Phil - aka Bryce, formally bryce, I'm probably still =--= blamed for something.
Aug 18 18:59:12 localhost automount[8746]: attempting to mount entry /test/947 Aug 18 18:59:12 localhost automount[9050]: >> nfs bindresvport: Address already in use Aug 18 18:59:12 localhost automount[9050]: mount(nfs): nfs: mount failure 192.168.1.100:/compile/nfs/947 on /test/947 SO CLOSE! it's doing DOUBLE the number of nfs mounts it used to before it bailed (230ish vs 115) so you're on the right track Cheers Phil =--=
Created attachment 105296 [details] patch to fix remaining holes in autofs socket leaks Jeff, I hope you don't mind, but I have an issue tracker ticket thats waiting on this same bugzilla so I took a look at your patch and the code, to see if anything was missing. I found a few things: 1) creating sockets using udp use RPC_SOCKANY, which for some reason in the c library used bindresvport to get a port to bind to. I was concerned that since we only have 1024 reserved ports and there were so many automounts going on in this issue, that we might have a problem running out of ports. I figured since create_tcp_client didn't need to use a reserved port we didn't need to do so in create_udp_client either, so I changed the code to match what create_tcp_client did, by creating and binding its own socket. This is also helpful, as it avoids the double socket descriptor close that occurs if you call clnt_destroy and close on a client for which the c library creates a socket for you (as the case of using RPC_SOCKANY does. 2) you added the extra close into portmap_getport which closed that leak, but rpc_ping_proto suffered from the same leak condition so I added a close there Of course, all your changes are maintained in this patch, and in addition I did a small cleanup on create_tcp_client so that it didn't need the local fd variable, and instead used the one already in the info pointer. disclaimer: I haven't tested this yet. It builds cleanly without any warnings, and I intend to test it shortly, but since there are people waiting on this, I wanted to get it posted. If anyone out there has the time and inclination to do some field testing on it, that would be great. Thanks!
Hi, Neil, Thanks for the help! Unfortunately, there is a newer version of the patch I had posted, which is much simpler. I do like the idea of not using reserved ports, and I will incorporate that. I'll attach the last patch I sent upstream here in a moment. -Jeff
Created attachment 105300 [details] Fix socket descriptor leak in rpc_subs.c
Created attachment 107079 [details] Patch to keep rpc_subs from using reserved ports. Hi, Neil, I'm finally come back to this. Your patch used the server address to bind a local socket. The idea is good, so I re-implemented it. I've tested this patch and it works for me. This patch should be applied in addition to the previous patch I posted. This will make the next set of update releases (for RHEL 3, this means update 5). -Jeff
Several fixes went into autofs to deal with this problem. It should be resolved in FC3 and beyond. Please re-open this bug if you can still trigger the problem. Thanks!
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-177.html