Hide Forgot
Description of problem: We automount NFS directories in /data from various NFS servers using NFSv4. It appears that since adding NFSv4 capability (to /etc/exports) to the servers, this is much less reliable. All machines are up to date FC3. Ran the following test on a client: #!/usr/bin/perl while (1) { open(IN,"< /data/radar3/saber/SABER_SPECTRA_2002_090_130M.DAT") or print `date` . ": Failed to open\n"; close(IN); sleep(70); } Errors in the client logs started at 00:49 and went to 04:13: Apr 11 00:49:25 earth automount[7414]: failed to mount /data/radar3/saber Apr 11 00:50:35 earth automount[7690]: failed to mount /data/radar3/saber ... Apr 11 04:12:26 earth automount[17304]: failed to mount /data/radar3/saber Apr 11 04:13:36 earth automount[17555]: failed to mount /data/radar3/saber After that it fixed itself and was fine. I'm kind of surprised to see "/data/radar3/saber" instead of "/data/radar3" since that is the mount point. Server was pretty quiet at these times: # sar Linux 2.6.10-1.770_FC3smp (alexandria.cora.nwra.com) 04/11/2005 12:00:01 AM CPU %user %nice %system %iowait %idle 12:10:01 AM all 0.01 0.00 0.02 0.07 99.91 12:20:01 AM all 0.00 0.00 0.01 0.01 99.97 12:30:01 AM all 0.01 0.00 0.01 0.01 99.97 12:40:02 AM all 0.01 0.00 0.01 0.01 99.97 12:50:01 AM all 0.01 0.00 0.01 0.01 99.97 01:00:01 AM all 0.00 0.00 0.01 0.00 99.98 01:10:01 AM all 0.10 0.00 0.06 0.15 99.69 01:20:01 AM all 0.12 0.00 0.14 0.95 98.79 01:30:01 AM all 0.05 0.00 0.14 0.06 99.76 01:40:02 AM all 0.04 0.00 0.14 0.06 99.76 01:50:01 AM all 0.05 0.00 0.13 0.06 99.76 02:00:01 AM all 0.05 0.00 0.14 0.10 99.72 02:10:01 AM all 0.05 0.00 0.14 0.05 99.76 02:20:02 AM all 0.07 0.00 0.15 0.44 99.34 02:30:01 AM all 0.05 0.00 0.16 0.08 99.71 02:40:01 AM all 0.05 0.00 0.17 0.13 99.65 02:50:02 AM all 0.05 0.00 0.16 0.07 99.72 03:00:02 AM all 0.05 0.00 0.16 0.05 99.74 03:10:01 AM all 0.05 0.00 0.16 0.06 99.74 03:20:01 AM all 0.05 0.00 0.16 0.05 99.74 03:30:01 AM all 0.10 0.00 0.18 0.34 99.38 03:40:01 AM all 0.08 0.00 0.18 0.65 99.08 (at 01:00 it starts a series of rsyncs) Now, I'm not sure, but this issue seems to have coinciding with starting to use NFSv4 as well as NFSv3. /etc/exports on the server looks like: /export *.cora.nwra.com(rw,sync,fsid=0) /export/amanda *.cora.nwra.com(rw,sync,nohide,no_root_squash) /export/data1 *.cora.nwra.com(rw,sync,nohide,no_root_squash) /export/clouds2 *.cora.nwra.com(rw,sync,nohide) /export/radar3 *.cora.nwra.com(rw,sync,nohide) /export/soldyn2 *.cora.nwra.com(rw,sync,nohide) /export/bigmeadow *.cora.nwra.com(rw,sync,nohide) /export/waves2 *.cora.nwra.com(rw,sync,nohide) added the first line and the nohide option when I added NFSv4 support. Automount map looks like: soldyn2 -rsize=8192,wsize=8192 alexandria:/export/soldyn2 clouds2 -rsize=8192,wsize=8192 alexandria:/export/clouds2 waves2 -rsize=8192,wsize=8192 alexandria:/export/waves2 sw1 -rsize=8192,wsize=8192 alexandria:/export/data1 radar3 -rsize=8192,wsize=8192 alexandria:/export/radar3 bigmeadow -rsize=8192,wsize=8192 alexandria:/export/bigmeadow Version-Release number of selected component (if applicable): nfs-utils-1.0.7-1 on the server (compiled on FC3 to fix some issues), but have seen issues with stock nfs-utils on other systems. nfs-utils-1.0.6-52 on client How reproducible: Fairly. Additional info: Let me know what other debug info I can provide. This is really causing some major headaches for us.
Don't see mount requests on the server during the problem period: Apr 11 00:48:15 alexandria rpc.mountd: authenticated mount request from earth.co ra.nwra.com:1015 for /export/radar3 (/export/radar3) Apr 11 00:49:25 alexandria rpc.mountd: authenticated unmount request from earth. cora.nwra.com:805 for /export/radar3 (/export/radar3) Messages stop there until: Apr 11 04:10:34 alexandria rpc.mountd: authenticated mount request from hawk.cor a.nwra.com:680 for /export/data1 (/export/data1) Apr 11 04:11:35 alexandria rpc.mountd: authenticated unmount request from hawk.c ora.nwra.com:707 for /export/data1 (/export/data1) Apr 11 04:14:46 alexandria rpc.mountd: authenticated mount request from earth.co ra.nwra.com:998 for /export/radar3 (/export/radar3)
I think this may be an autofs issue: Apr 11 00:49:25 earth automount[7414]: failed to mount /data/radar3/saber Apr 11 00:50:35 earth automount[7690]: failed to mount /data/radar3/saber ... Apr 11 04:12:26 earth automount[17304]: failed to mount /data/radar3/saber Apr 11 04:13:36 earth automount[17555]: failed to mount /data/radar3/saber I've installed the autofs update from updates/testing and am testing that to try to reproduce.
Please update the bug when you have updates. If the problem persists with the latest autofs, please do the following: Modify /etc/syslog.conf to contain a line like so: *.* /var/log/debug Add --debug to the end of the line in the auto.master map for the problematic mount point. Please also attach your auto.master and corresponding maps. HUP syslogd, and restart autofs. Make sure autofs actually restarts (it may not if any mounts are busy). Then, when you've reproduced the problem, please send me the contents of /var/log/debug. Thanks.
I meant to say please attach the contents of /var/log/debug at the end of comment #3.
Created attachment 113019 [details] debug log output from automount Reproduced with autofs-4.1.3-114: Mon Apr 11 17:20:49 MDT 2005 : Failed to open I've attached the debug log for the minutes of 17:20-21, failure followed by success. Let me know if you need more. Looks like a race condition between expiring the mount and accessing it?
Created attachment 114032 [details] Fix a race in the expire code. Would you be willing to test a kernel patch? Ian Kent recently released a patch that addressed a race in the expire code, and I think it may apply here. This patch will apply to the latest FC-3 kernel. If you need me to do a build for you, let me know.
I can test this out. Thanks.
Got this kernel installed on two test machines. We'll see how it goes.
Still see the error. Running with a patched version of 2.6.11-1.20_FC3 and autofs-4.1.4-8. Unfortunately it happened with a different directory that I wasn't running under debug. I'll change that and monitor: May 5 09:59:36 hawk automount[31689]: failed to mount /home/dbraun/public_html I don't know if it is a clue or not, but "/home/dbraun" is the real mount, though not the file being accessed.
Created attachment 114163 [details] automount debug outptu This is with the expire race patched kernel and autofs-4.1.4-8. At May 6 21:31:03 it starts with "failed to mount /home/meyer/public_html".
Let me know if you need any more debug output. Still seeing this fairly regularly.
Any progress on this? It's really causing me a lot of headaches. Basically it seems that once a mount attempt fails, automount will simply not be able to mount the directory when accessing the same dir as the original failure. e.g., if a program keeps asking for /data/sw1/backup, and it fails once, any future attempts to access /data/sw1/backup will fail. However, if you do "ls /data/sw1", it mounts fine.
Sorry, no progress to report. It may be helpful to bring this up on the autofs mailing list. The list address is autofs@linux.kernel.org. I can post the problem report for you, if you like (indeed, I am more than happy to). Please let me know if you would prefer to report the problem. Thanks.
If you wouldn't mind posting it, I'd be grateful. I'll try to follow the discussion there and chip in if needed.
Okay, I've posted the problem report, and Ian Kent responded. He would like to see debug output from the kernel module. This will require recompiling the autofs4 module. You basically need to edit fs/autofs4/autofs_i.h to uncomment the line which defines DEBUG. I'll attach a patch for this. Then, the real issue is the sheer amount of logs that will be generated. It's enormous. Let me know if you are willing to give this a try. You'd have to reproduce the problem with the debug module, and attach the logs for the time period when the problem occurs.
Created attachment 114751 [details] enable debugging in the autofs4 module
I have one other question. When you're running your test script, are there other apps that may be accessing the automounted directories?
I'm building the new kernel now and hopefully can test tomorrow. I think I got a pretty sure fire way to reproduce: - bring nfs server down - try to access a subdir, e.g.: /data/sw1/fedora. autofs will complain - bring nfs server up - try to access the subdir again, autofs seems to keep complaining until I access /data/sw1 first. As for my original test, there may be one or two odd accesses to the automounted directories, but they don't seem to correspond to the failures.
No luck reproducing so far. My expire race test script has run without error so far, but will keep trying. Log volume isn't *too* bad :-). The server down trick did not appear to work either, though that is a common failure mode I'm seeing here with my FC3 clients. I brought down one of our main disk servers recently, and folks couldn't access subdirectories in in automounted directories until I first accessed the top-level mount directory. It *seemed* like trying to access the directory while the server was down left things in a bad state.
Still no luck reproducing with debugging on. Perhaps it prevents the race? I'm going to have to stop after today as I'm going on vacation. I've got another question though. I'm having problems where one of my servers gets busy and some automount attemps fail: May 27 05:21:06 cynosure automount[4026]: attempting to mount entry /data/sw1 May 27 05:21:48 cynosure automount[24502]: >> mount: RPC: Timed out May 27 05:21:48 cynosure automount[24502]: mount(nfs): nfs: mount failure alexandria:/export/data1 on /data/sw1 May 27 05:21:48 cynosure automount[24502]: failed to mount /data/sw1 May 27 05:21:48 cynosure automount[24502]: umount_multi: no mounts found under /data/sw1 Is there any way I can up the RPM time out or ask it to retry a couple times? Man pages doesn't have any information about this if so.
Thanks for all of the testing! I guess we'll have to track this bug down the old fashioned way. =) On to your other question. Unfortunately, there is no good way to up the number of retries that the mount command will do. The "bg" mount option tells mount to go off and retry a bunch of times in the background, but this is definitely not something I would recommend when using autofs! If this is functionality that would be useful to you, please file a feature request. Basically, you want automount to retry indefinitely until a mount succeeds. I don't think it makes sense to just up the number of retries, since this sort of thing is supposed to be transparent, anyway. Thanks.
Created attachment 114991 [details] Fix for failed mounts Could you please try the following patch, from the upstream maintainer? Here is his description: --- This happens for browsable maps that request a mount at the same time as an expire is happening. My understanding of the lookup was slightly incorrect in that, when the directory exists the revalidate succeeds when it should fail, in order to ensure that lookup is called. Instead lookup is called on the next path component which should be whithin the mount. Unfortuneately, the parent isn't mounted. This patch seems to fix the problem. --- Is ghosting enabled in your automounter setup?
Ah, ghosting. My previous attempt to get things to fail did not have ghosting turned on, and we indeed start seeing more problems once I turned ghosting on. I've just run a new test. kernel has the expire-race, comment 22 patch, and debugging turned on. Got it to fail fairly quickly: Wed Jun 8 10:34:04 MDT 2005 drwxr-xr-x 13 root source 4096 Apr 19 14:42 /data/sw1/fedora Wed Jun 8 10:35:11 MDT 2005 drwxr-xr-x 13 root source 4096 Apr 19 14:42 /data/sw1/fedora Wed Jun 8 10:36:18 MDT 2005 drwxr-xr-x 13 root source 4096 Apr 19 14:42 /data/sw1/fedora Wed Jun 8 10:37:27 MDT 2005 drwxr-xr-x 13 root source 4096 Apr 19 14:42 /data/sw1/fedora Wed Jun 8 10:38:34 MDT 2005 ls: /data/sw1/fedora: No such file or directory Wed Jun 8 10:40:43 MDT 2005 ls: /data/sw1/fedora: No such file or directory Wed Jun 8 10:41:50 MDT 2005 ls: /data/sw1/fedora: No such file or directory Wed Jun 8 10:42:57 MDT 2005 ls: /data/sw1/fedora: No such file or directory System is in such a state that access to /data/sw1/fedora will fail, but accessing /data/sw1 should restore things to rights. I'll attach debug logs next. They are big, much bigger with ghosting on!
Created attachment 115232 [details] debug output From 10:37-10:39. Last successful access at 10:37:27. Failure occurs at 10:38:34.
Any progress on this? This is just killing us, causing all kinds of problems.
Ok, RE comment 20, you should be able to specify the mount option retry=n, where n is the number of retries. Bump this up on the entries for the servers that timeout. For the main problem, it may be that you're running into the negative dentry caching problems that have recently been discovered. Ian, do you have time to take a look at the logs posted in comment #24?
(In reply to comment #25) > Any progress on this? This is just killing us, causing all kinds of problems. Very interseting. I've had a look at the log and I think it demonstrates the problem very well. I need more time to digest this. Ian
(In reply to comment #24) > Created an attachment (id=115232) [edit] > debug output > > From 10:37-10:39. Last successful access at 10:37:27. Failure occurs at > 10:38:34. Orion can you give me a little bit of a discription of what's seen in userspace during the time this happens (from about 10:39 - 10:40) please. If you can still remember. Ian
Well, comment #23 really has it all. My test is basically: while true do date ls /data/sw1/fedora sleep 65 done The sleep time is adjusted to provoke maximum overlap with the expire time of the mount. The output you see the the output from ls. That's what userspace sees. Let me know if you need other info. FYI - at this point I've turned --ghost off of all of our mounts to avoid problems.
(In reply to comment #29) > > The sleep time is adjusted to provoke maximum overlap with the expire time of > the mount. The output you see the the output from ls. That's what userspace > sees. Let me know if you need other info. OK so no hang just fail messages. Did you also say that these mounts return to normal after a while? Reading the call I got the impression you were running FC3 but I see it's FC4. Can you tell me what kernel version you currently have please?
(In reply to comment #30) > Did you also say that these mounts return to normal after a while? Not on their own. If you keep trying to access the same file it will always error. However, you can fix the problem by accessing the base mount directory (and perhaps others) to mount the directory. > Reading the call I got the impression you were running FC3 but I see > it's FC4. Can you tell me what kernel version you currently have please? We're mostly up to date here. Although, as I mentioned above, I've disabled --ghost everywhere to work around the problem. I could try enabling it on a few machines to test.
(In reply to comment #31) > (In reply to comment #30) > > Did you also say that these mounts return to normal after a while? > > Not on their own. If you keep trying to access the same file it will always > error. However, you can fix the problem by accessing the base mount directory > (and perhaps others) to mount the directory. OK. > > > Reading the call I got the impression you were running FC3 but I see > > it's FC4. Can you tell me what kernel version you currently have please? > > We're mostly up to date here. Although, as I mentioned above, I've disabled > --ghost everywhere to work around the problem. I could try enabling it on a few > machines to test. Yes. I forgot, enabling ghosting makes the problem happen more quickly, right. But it still happens when it's disabled, correct? And your kernel version would be 2.6.16-1.2108_FC4 yes?
(In reply to comment #32) > Yes. I forgot, enabling ghosting makes the problem happen more > quickly, right. But it still happens when it's disabled, correct? Actually, I'm not sure the problem occurs at all without ghosting. I haven't had the problem occur in recent memory after turning it off. > And your kernel version would be 2.6.16-1.2108_FC4 yes? We'll, it could be :-). We've got about about 50 machines running various kernels, but are generally up-to-date. Let me know what you'd like me to test and I could probably set it up. e.g.: apapane: 2.6.16-1.2096_FC4smp apollo: 2.6.16-1.2069_FC4smp armstrong: 2.6.16-1.2096_FC4 aspen: 2.6.16-1.2107_FC4smp coop00: 2.6.16-1.2069_FC4smp coop01: 2.6.16-1.2108_FC4smp cynosure: 2.6.16-1.2096_FC5 denali: 2.6.16-1.2069_FC4 earth: 2.6.13-1.1532_FC4 hammer: 2.6.16-1.2107_FC5xen0 hawk: 2.6.16-1.2108_FC4smp hobbes: 2.6.16-1.2107_FC4smp iago: 2.6.16-1.2107_FC4 irimi: 2.6.16-1.2111_FC5 kolea: 2.6.16-1.2107_FC4 laplata: 2.6.16-1.2107_FC4 lego: 2.6.16-1.2069_FC4
(In reply to comment #33) > (In reply to comment #32) > > Yes. I forgot, enabling ghosting makes the problem happen more > > quickly, right. But it still happens when it's disabled, correct? > > Actually, I'm not sure the problem occurs at all without ghosting. I haven't had > the problem occur in recent memory after turning it off. OK. Let me know if you think it does. Looking at the kernel log I think it should but I'll have to rethink that then. > > > And your kernel version would be 2.6.16-1.2108_FC4 yes? > > We'll, it could be :-). We've got about about 50 machines running various > kernels, but are generally up-to-date. Let me know what you'd like me to test > and I could probably set it up. Cool. I asked because you seem to have a good handle on what's going on and I have a kernel patch I developed late last year while working on another problem. When I was developing it I realized that it would help with this problem as well. From reading your kernel log the problem looks like a long standing autofs issue of not being able to return an error code when the kernel walks the path (in this particular case). I tested the patch when I wrote it and it seemed to work ok. I stopped work on it because it is a significant change and I was working on version 5 of autofs which also introduces significant change. But that's done now. If you have the time to test this I'll make a patch for which ever kernel you have on your test machine. Ian
I'm happy to test whatever version is easiest. FC4 or FC5.
(In reply to comment #35) > I'm happy to test whatever version is easiest. FC4 or FC5. Great. I've redone the patch against 2.6.16-1.2108_FC4. Give me a while to put it through some basic sanity checks and rethink the cases to make sure it does what I expect and review your logs again then I'll post it for you to test. Ian
(In reply to comment #20) > I've got another question though. I'm having problems where one of my servers > gets busy and some automount attemps fail: > > May 27 05:21:06 cynosure automount[4026]: attempting to mount entry /data/sw1 > May 27 05:21:48 cynosure automount[24502]: >> mount: RPC: Timed out > May 27 05:21:48 cynosure automount[24502]: mount(nfs): nfs: mount failure > alexandria:/export/data1 on /data/sw1 > May 27 05:21:48 cynosure automount[24502]: failed to mount /data/sw1 > May 27 05:21:48 cynosure automount[24502]: umount_multi: no mounts found under > /data/sw1 > > Is there any way I can up the RPM time out or ask it to retry a couple times? > Man pages doesn't have any information about this if so. Do these mounts use UDP or TCP for the transport?
(In reply to comment #37) > (In reply to comment #20) > > Is there any way I can up the RPM time out or ask it to retry a couple times? > > Man pages doesn't have any information about this if so. > > Do these mounts use UDP or TCP for the transport? TCP
(In reply to comment #36) > (In reply to comment #35) > > I'm happy to test whatever version is easiest. FC4 or FC5. > > Great. > > I've redone the patch against 2.6.16-1.2108_FC4. > Give me a while to put it through some basic sanity checks > and rethink the cases to make sure it does what I expect > and review your logs again then I'll post it for you to > test. The patch has failed my analysis. I thought I'd thought it through quite thouroughly when I did it but clearly not. However, reviewing the history I see that there is another patch that has gone into recent FC4 kernels that should partially address this issue. FC4 kernels from 2.6.14_1.1633_FC4 have this patch. It won't address the problem of the lookup continuing after a mount fail (which was what the above patch would have resolved) but it should allow subsequent mount to function as normal without having to access the mount point directory to make them work again. Most of your machines have kernels with this patch so, when you have some time, could you check whether the original problem still exists please. Ian
(In reply to comment #38) > (In reply to comment #37) > > (In reply to comment #20) > > > Is there any way I can up the RPM time out or ask it to retry a couple times? > > > Man pages doesn't have any information about this if so. > > > > Do these mounts use UDP or TCP for the transport? > > TCP Is it actualy likely that the server isn't able to complete a response within 60 seconds, given that in this case the server may be bogged down?
> Is it actualy likely that the server isn't able to complete > a response within 60 seconds, given that in this case the > server may be bogged down? I first started seeing this when I started a system of using rsync to backup our machines to an nfs directory. This got started by cron.daily, and was one of the first things run so everyone kicked off at the same time. So the server would get something like 30-40 mount requests followed by heavy disk io. I eventually added a random sleep before the backup ran to reduce load and rarely see this problem anymore. But it would still be nice to have a retry option.
(In reply to comment #41) > > Is it actualy likely that the server isn't able to complete > > a response within 60 seconds, given that in this case the > > server may be bogged down? > > I first started seeing this when I started a system of using rsync to backup our > machines to an nfs directory. This got started by cron.daily, and was one of > the first things run so everyone kicked off at the same time. So the server > would get something like 30-40 mount requests followed by heavy disk io. I > eventually added a random sleep before the backup ran to reduce load and rarely > see this problem anymore. But it would still be nice to have a retry option. > The problem with that is that it would cause failures for other mounts that timeout waiting for the lock file. So you could get several failures instead of the one. If we did this then autofs mount locking would need to be disabled by default which it's not at the moment. I'll give it some more thought.
(In reply to comment #39) > > However, reviewing the history I see that there is another > patch that has gone into recent FC4 kernels that should > partially address this issue. > > FC4 kernels from 2.6.14_1.1633_FC4 have this patch. > It won't address the problem of the lookup continuing > after a mount fail (which was what the above patch would > have resolved) but it should allow subsequent mount to > function as normal without having to access the mount point > directory to make them work again. > > Most of your machines have kernels with this patch so, when > you have some time, could you check whether the original > problem still exists please. Well, ran a test over the weekend that failed to produce a failure, so perhaps things are more robust now. I'll start re-enabling ghosting on some of our machines and see how it goes.
(In reply to comment #43) > (In reply to comment #39) > > > > However, reviewing the history I see that there is another > > patch that has gone into recent FC4 kernels that should > > partially address this issue. > > > > FC4 kernels from 2.6.14_1.1633_FC4 have this patch. > > It won't address the problem of the lookup continuing > > after a mount fail (which was what the above patch would > > have resolved) but it should allow subsequent mount to > > function as normal without having to access the mount point > > directory to make them work again. > > > > Most of your machines have kernels with this patch so, when > > you have some time, could you check whether the original > > problem still exists please. > > Well, ran a test over the weekend that failed to produce a failure, so perhaps > things are more robust now. I'll start re-enabling ghosting on some of our > machines and see how it goes. Orion, Have you been able to check this? Ian
(In reply to comment #44) > > Have you been able to check this? Still no problems so far. I'll try turning ghosting back on in more places.
(In reply to comment #45) > (In reply to comment #44) > > > > Have you been able to check this? > > Still no problems so far. I'll try turning ghosting back on in more places. > Orion, did you end up getting time to test this further?
So far so good. I'm going to turn --ghost back on on all my systems and see what happens.
This report targets the FC3 or FC4 products, which have now been EOL'd. Could you please check that it still applies to a current Fedora release, and either update the target product or close it ? Thanks.
Appears resolved in FC5+