Bug 154393 - automount temporarily unable to mount NFS directory
automount temporarily unable to mount NFS directory
Status: CLOSED NEXTRELEASE
Product: Fedora
Classification: Fedora
Component: autofs (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Ian Kent
Brock Organ
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-04-11 09:44 EDT by Orion Poplawski
Modified: 2007-11-30 17:11 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-01-22 12:27:17 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
debug log output from automount (9.15 KB, text/plain)
2005-04-12 10:53 EDT, Orion Poplawski
no flags Details
Fix a race in the expire code. (509 bytes, patch)
2005-05-04 13:34 EDT, Jeffrey Moyer
no flags Details | Diff
automount debug outptu (8.77 KB, application/octet-stream)
2005-05-09 11:10 EDT, Orion Poplawski
no flags Details
enable debugging in the autofs4 module (408 bytes, patch)
2005-05-23 17:53 EDT, Jeffrey Moyer
no flags Details | Diff
Fix for failed mounts (592 bytes, patch)
2005-05-31 09:49 EDT, Jeffrey Moyer
no flags Details | Diff
debug output (170.20 KB, application/octet-stream)
2005-06-08 14:06 EDT, Orion Poplawski
no flags Details

  None (edit)
Description Orion Poplawski 2005-04-11 09:44:02 EDT
Description of problem:

We automount NFS directories in /data from various NFS servers using NFSv4.  It
appears that since adding NFSv4 capability (to /etc/exports) to the servers,
this is much less reliable.  All machines are up to date FC3.

Ran the following test on a client:

#!/usr/bin/perl
while (1) {
   open(IN,"< /data/radar3/saber/SABER_SPECTRA_2002_090_130M.DAT") or print
`date` . ": Failed to open\n";
   close(IN);
   sleep(70);
}

Errors in the client logs started at 00:49 and went to 04:13:

Apr 11 00:49:25 earth automount[7414]: failed to mount /data/radar3/saber
Apr 11 00:50:35 earth automount[7690]: failed to mount /data/radar3/saber
...
Apr 11 04:12:26 earth automount[17304]: failed to mount /data/radar3/saber
Apr 11 04:13:36 earth automount[17555]: failed to mount /data/radar3/saber

After that it fixed itself and was fine.  I'm kind of surprised to see
"/data/radar3/saber" instead of "/data/radar3" since that is the mount point.

Server was pretty quiet at these times:

# sar
Linux 2.6.10-1.770_FC3smp (alexandria.cora.nwra.com)    04/11/2005

12:00:01 AM       CPU     %user     %nice   %system   %iowait     %idle
12:10:01 AM       all      0.01      0.00      0.02      0.07     99.91
12:20:01 AM       all      0.00      0.00      0.01      0.01     99.97
12:30:01 AM       all      0.01      0.00      0.01      0.01     99.97
12:40:02 AM       all      0.01      0.00      0.01      0.01     99.97
12:50:01 AM       all      0.01      0.00      0.01      0.01     99.97
01:00:01 AM       all      0.00      0.00      0.01      0.00     99.98
01:10:01 AM       all      0.10      0.00      0.06      0.15     99.69
01:20:01 AM       all      0.12      0.00      0.14      0.95     98.79
01:30:01 AM       all      0.05      0.00      0.14      0.06     99.76
01:40:02 AM       all      0.04      0.00      0.14      0.06     99.76
01:50:01 AM       all      0.05      0.00      0.13      0.06     99.76
02:00:01 AM       all      0.05      0.00      0.14      0.10     99.72
02:10:01 AM       all      0.05      0.00      0.14      0.05     99.76
02:20:02 AM       all      0.07      0.00      0.15      0.44     99.34
02:30:01 AM       all      0.05      0.00      0.16      0.08     99.71
02:40:01 AM       all      0.05      0.00      0.17      0.13     99.65
02:50:02 AM       all      0.05      0.00      0.16      0.07     99.72
03:00:02 AM       all      0.05      0.00      0.16      0.05     99.74
03:10:01 AM       all      0.05      0.00      0.16      0.06     99.74
03:20:01 AM       all      0.05      0.00      0.16      0.05     99.74
03:30:01 AM       all      0.10      0.00      0.18      0.34     99.38
03:40:01 AM       all      0.08      0.00      0.18      0.65     99.08

(at 01:00 it starts a series of rsyncs)

Now, I'm not sure, but this issue seems to have coinciding with starting to use
NFSv4 as well as NFSv3.  /etc/exports on the server looks like:

/export                 *.cora.nwra.com(rw,sync,fsid=0)
/export/amanda          *.cora.nwra.com(rw,sync,nohide,no_root_squash)
/export/data1           *.cora.nwra.com(rw,sync,nohide,no_root_squash)
/export/clouds2         *.cora.nwra.com(rw,sync,nohide)
/export/radar3          *.cora.nwra.com(rw,sync,nohide)
/export/soldyn2         *.cora.nwra.com(rw,sync,nohide)
/export/bigmeadow       *.cora.nwra.com(rw,sync,nohide)
/export/waves2          *.cora.nwra.com(rw,sync,nohide)

added the first line and the nohide option when I added NFSv4 support.

Automount map looks like:

soldyn2 -rsize=8192,wsize=8192  alexandria:/export/soldyn2
clouds2 -rsize=8192,wsize=8192  alexandria:/export/clouds2
waves2 -rsize=8192,wsize=8192   alexandria:/export/waves2
sw1 -rsize=8192,wsize=8192      alexandria:/export/data1
radar3 -rsize=8192,wsize=8192   alexandria:/export/radar3
bigmeadow -rsize=8192,wsize=8192        alexandria:/export/bigmeadow



Version-Release number of selected component (if applicable):
nfs-utils-1.0.7-1 on the server (compiled on FC3 to fix some issues), but have
seen issues with stock nfs-utils on other systems.
nfs-utils-1.0.6-52 on client


How reproducible:
Fairly.

Additional info:

Let me know what other debug info I can provide.  This is really causing some
major headaches for us.
Comment 1 Orion Poplawski 2005-04-11 09:48:51 EDT
Don't see mount requests on the server during the problem period:

Apr 11 00:48:15 alexandria rpc.mountd: authenticated mount request from earth.co
ra.nwra.com:1015 for /export/radar3 (/export/radar3)
Apr 11 00:49:25 alexandria rpc.mountd: authenticated unmount request from earth.
cora.nwra.com:805 for /export/radar3 (/export/radar3)

Messages stop there until:

Apr 11 04:10:34 alexandria rpc.mountd: authenticated mount request from hawk.cor
a.nwra.com:680 for /export/data1 (/export/data1)
Apr 11 04:11:35 alexandria rpc.mountd: authenticated unmount request from hawk.c
ora.nwra.com:707 for /export/data1 (/export/data1)
Apr 11 04:14:46 alexandria rpc.mountd: authenticated mount request from earth.co
ra.nwra.com:998 for /export/radar3 (/export/radar3)
Comment 2 Orion Poplawski 2005-04-11 13:13:51 EDT
I think this may be an autofs issue:

Apr 11 00:49:25 earth automount[7414]: failed to mount /data/radar3/saber
Apr 11 00:50:35 earth automount[7690]: failed to mount /data/radar3/saber
...
Apr 11 04:12:26 earth automount[17304]: failed to mount /data/radar3/saber
Apr 11 04:13:36 earth automount[17555]: failed to mount /data/radar3/saber

I've installed the autofs update from updates/testing and am testing that to try
to reproduce.
Comment 3 Jeffrey Moyer 2005-04-11 13:24:11 EDT
Please update the bug when you have updates.

If the problem persists with the latest autofs, please do the following:

Modify /etc/syslog.conf to contain a line like so:
*.*        /var/log/debug

Add --debug to the end of the line in the auto.master map for the problematic
mount point.  Please also attach your auto.master and corresponding maps.

HUP syslogd, and restart autofs.  Make sure autofs actually restarts (it may not
if any mounts are busy).

Then, when you've reproduced the problem, please send me the contents of
/var/log/debug.

Thanks.
Comment 4 Jeffrey Moyer 2005-04-11 13:24:55 EDT
I meant to say please attach the contents of /var/log/debug at the end of
comment #3.
Comment 5 Orion Poplawski 2005-04-12 10:53:24 EDT
Created attachment 113019 [details]
debug log output from automount

Reproduced with autofs-4.1.3-114:

Mon Apr 11 17:20:49 MDT 2005
: Failed to open

I've attached the debug log for the minutes of 17:20-21, failure followed by
success.  Let me know if you need more.  Looks like a race condition between
expiring the mount and accessing it?
Comment 6 Jeffrey Moyer 2005-05-04 13:34:06 EDT
Created attachment 114032 [details]
Fix a race in the expire code.

Would you be willing to test a kernel patch?  Ian Kent recently released a
patch that addressed a race in the expire code, and I think it may apply here. 
This patch will apply to the latest FC-3 kernel.  If you need me to do a build
for you, let me know.
Comment 7 Orion Poplawski 2005-05-04 13:38:41 EDT
I can test this out.  Thanks.
Comment 8 Orion Poplawski 2005-05-04 19:00:11 EDT
Got this kernel installed on two test machines.  We'll see how it goes.
Comment 9 Orion Poplawski 2005-05-06 14:16:50 EDT
Still see the error. Running with a patched version of 2.6.11-1.20_FC3 and
autofs-4.1.4-8.  Unfortunately it happened with a different directory that I
wasn't running under debug.  I'll change that and monitor:

May  5 09:59:36 hawk automount[31689]: failed to mount /home/dbraun/public_html

I don't know if it is a clue or not, but "/home/dbraun" is the real mount,
though not the file being accessed.
Comment 10 Orion Poplawski 2005-05-09 11:10:35 EDT
Created attachment 114163 [details]
automount debug outptu

This is with the expire race patched kernel and autofs-4.1.4-8.  At May  6
21:31:03 it starts with "failed to mount /home/meyer/public_html".
Comment 11 Orion Poplawski 2005-05-16 12:25:04 EDT
Let me know if you need any more debug output.  Still seeing this fairly regularly.
Comment 12 Orion Poplawski 2005-05-23 11:10:21 EDT
Any progress on this?  It's really causing me a lot of headaches.  Basically it
seems that once a mount attempt fails, automount will simply not be able to
mount the directory when accessing the same dir as the original failure.  e.g.,
if a program keeps asking for /data/sw1/backup, and it fails once, any future
attempts to access /data/sw1/backup will fail.  However, if you do "ls
/data/sw1", it mounts fine.
Comment 13 Jeffrey Moyer 2005-05-23 11:17:31 EDT
Sorry, no progress to report.  It may be helpful to bring this up on the autofs
mailing list.  The list address is autofs@linux.kernel.org.  I can post the
problem report for you, if you like (indeed, I am more than happy to).  Please
let me know if you would prefer to report the problem.

Thanks.
Comment 14 Orion Poplawski 2005-05-23 11:22:12 EDT
If you wouldn't mind posting it, I'd be grateful.  I'll try to follow the
discussion there and chip in if needed.
Comment 15 Jeffrey Moyer 2005-05-23 17:52:54 EDT
Okay, I've posted the problem report, and Ian Kent responded.  He would like to
see debug output from the kernel module.  This will require recompiling the
autofs4 module.  You basically need to edit fs/autofs4/autofs_i.h to uncomment
the line which defines DEBUG.  I'll attach a patch for this.

Then, the real issue is the sheer amount of logs that will be generated.  It's
enormous.

Let me know if you are willing to give this a try.  You'd have to reproduce the
problem with the debug module, and attach the logs for the time period when the
problem occurs.
Comment 16 Jeffrey Moyer 2005-05-23 17:53:47 EDT
Created attachment 114751 [details]
enable debugging in the autofs4 module
Comment 17 Jeffrey Moyer 2005-05-23 18:04:05 EDT
I have one other question.  When you're running your test script, are there
other apps that may be accessing the automounted directories?
Comment 18 Orion Poplawski 2005-05-23 19:00:47 EDT
I'm building the new kernel now and hopefully can test tomorrow.  I think I got
a pretty sure fire way to reproduce:

- bring nfs server down
- try to access a subdir, e.g.: /data/sw1/fedora.  autofs will complain
- bring nfs server up
- try to access the subdir again, autofs seems to keep complaining until I
access /data/sw1 first.

As for my original test, there may be one or two odd accesses to the automounted
directories, but they don't seem to correspond to the failures.
Comment 19 Orion Poplawski 2005-05-25 14:30:16 EDT
No luck reproducing so far.  My expire race test script has run without error so
far, but will keep trying.  Log volume isn't *too* bad :-).

The server down trick did not appear to work either, though that is a common
failure mode I'm seeing here with my FC3 clients.  I brought down one of our
main disk servers recently, and folks couldn't access subdirectories in in
automounted directories until I first accessed the top-level mount directory. 
It *seemed* like trying to access the directory while the server was down left
things in a bad state.
Comment 20 Orion Poplawski 2005-05-27 10:37:21 EDT
Still no luck reproducing with debugging on.  Perhaps it prevents the race?  I'm
going to have to stop after today as I'm going on vacation.

I've got another question though.  I'm having problems where one of my servers
gets busy and some automount attemps fail:

May 27 05:21:06 cynosure automount[4026]: attempting to mount entry /data/sw1
May 27 05:21:48 cynosure automount[24502]: >> mount: RPC: Timed out
May 27 05:21:48 cynosure automount[24502]: mount(nfs): nfs: mount failure
alexandria:/export/data1 on /data/sw1
May 27 05:21:48 cynosure automount[24502]: failed to mount /data/sw1
May 27 05:21:48 cynosure automount[24502]: umount_multi: no mounts found under
/data/sw1

Is there any way I can up the RPM time out or ask it to retry a couple times? 
Man pages doesn't have any information about this if so.
Comment 21 Jeffrey Moyer 2005-05-27 10:53:44 EDT
Thanks for all of the testing!  I guess we'll have to track this bug down the
old fashioned way.  =)

On to your other question.  Unfortunately, there is no good way to up the number
of retries that the mount command will do.  The "bg" mount option tells mount to
go off and retry a bunch of times in the background, but this is definitely not
something I would recommend when using autofs!

If this is functionality that would be useful to you, please file a feature
request.  Basically, you want automount to retry indefinitely until a mount
succeeds.  I don't think it makes sense to just up the number of retries, since
this sort of thing is supposed to be transparent, anyway.

Thanks.
Comment 22 Jeffrey Moyer 2005-05-31 09:49:43 EDT
Created attachment 114991 [details]
Fix for failed mounts

Could you please try the following patch, from the upstream maintainer?  Here
is his description:

---
This happens for browsable maps that request a mount at the same time as 
an expire is happening. My understanding of the lookup was slightly 
incorrect in that, when the directory exists the revalidate succeeds when 
it should fail, in order to ensure that lookup is called. Instead lookup 
is called on the next path component which should be whithin the mount. 
Unfortuneately, the parent isn't mounted.

This patch seems to fix the problem.
---

Is ghosting enabled in your automounter setup?
Comment 23 Orion Poplawski 2005-06-08 14:02:25 EDT
Ah, ghosting.  My previous attempt to get things to fail did not have ghosting
turned on, and we indeed start seeing more problems once I turned ghosting on.

I've just run a new test.  kernel has the expire-race, comment 22 patch, and
debugging turned on.  Got it to fail fairly quickly:

Wed Jun  8 10:34:04 MDT 2005
drwxr-xr-x  13 root source 4096 Apr 19 14:42 /data/sw1/fedora
Wed Jun  8 10:35:11 MDT 2005
drwxr-xr-x  13 root source 4096 Apr 19 14:42 /data/sw1/fedora
Wed Jun  8 10:36:18 MDT 2005
drwxr-xr-x  13 root source 4096 Apr 19 14:42 /data/sw1/fedora
Wed Jun  8 10:37:27 MDT 2005
drwxr-xr-x  13 root source 4096 Apr 19 14:42 /data/sw1/fedora
Wed Jun  8 10:38:34 MDT 2005
ls: /data/sw1/fedora: No such file or directory
Wed Jun  8 10:40:43 MDT 2005
ls: /data/sw1/fedora: No such file or directory
Wed Jun  8 10:41:50 MDT 2005
ls: /data/sw1/fedora: No such file or directory
Wed Jun  8 10:42:57 MDT 2005
ls: /data/sw1/fedora: No such file or directory

System is in such a state that access to /data/sw1/fedora will fail, but
accessing /data/sw1 should restore things to rights.

I'll attach debug logs next.  They are big, much bigger with ghosting on!
Comment 24 Orion Poplawski 2005-06-08 14:06:48 EDT
Created attachment 115232 [details]
debug output

From 10:37-10:39.  Last successful access at 10:37:27.	Failure occurs at
10:38:34.
Comment 25 Orion Poplawski 2005-10-10 15:48:29 EDT
Any progress on this?  This is just killing us, causing all kinds of problems.
Comment 26 Jeffrey Moyer 2006-05-08 18:58:41 EDT
Ok, RE comment 20, you should be able to specify the mount option retry=n, where
n is the number of retries.  Bump this up on the entries for the servers that
timeout.

For the main problem, it may be that you're running into the negative dentry
caching problems that have recently been discovered.

Ian, do you have time to take a look at the logs posted in comment #24?
Comment 27 Ian Kent 2006-05-09 01:11:19 EDT
(In reply to comment #25)
> Any progress on this?  This is just killing us, causing all kinds of problems.

Very interseting.

I've had a look at the log and I think it demonstrates the problem
very well. I need more time to digest this.

Ian
Comment 28 Ian Kent 2006-05-09 01:16:38 EDT
(In reply to comment #24)
> Created an attachment (id=115232) [edit]
> debug output
> 
> From 10:37-10:39.  Last successful access at 10:37:27.	Failure occurs at
> 10:38:34.

Orion can you give me a little bit of a discription of what's
seen in userspace during the time this happens (from about
10:39 - 10:40) please. If you can still remember.

Ian
Comment 29 Orion Poplawski 2006-05-09 11:14:10 EDT
Well, comment #23 really has it all.  My test is basically:

while true
do
   date
   ls /data/sw1/fedora
   sleep 65
done

The sleep time is adjusted to provoke maximum overlap with the expire time of
the mount.  The output you see the the output from ls.  That's what userspace
sees.  Let me know if you need other info.

FYI - at this point I've turned --ghost off of all of our mounts to avoid problems.
Comment 30 Ian Kent 2006-05-09 12:02:20 EDT
(In reply to comment #29)

> 
> The sleep time is adjusted to provoke maximum overlap with the expire time of
> the mount.  The output you see the the output from ls.  That's what userspace
> sees.  Let me know if you need other info.

OK so no hang just fail messages.
Did you also say that these mounts return to normal after a while?
 
Reading the call I got the impression you were running FC3 but I see
it's FC4. Can you tell me what kernel version you currently have please?

Comment 31 Orion Poplawski 2006-05-09 12:08:11 EDT
(In reply to comment #30)
> Did you also say that these mounts return to normal after a while?

Not on their own.  If you keep trying to access the same file it will always
error.  However, you can fix the problem by accessing the base mount directory 
(and perhaps others) to mount the directory.
  
> Reading the call I got the impression you were running FC3 but I see
> it's FC4. Can you tell me what kernel version you currently have please?

We're mostly up to date here.  Although, as I mentioned above, I've disabled
--ghost everywhere to work around the problem.  I could try enabling it on a few
machines to test.
Comment 32 Ian Kent 2006-05-09 12:17:08 EDT
(In reply to comment #31)
> (In reply to comment #30)
> > Did you also say that these mounts return to normal after a while?
> 
> Not on their own.  If you keep trying to access the same file it will always
> error.  However, you can fix the problem by accessing the base mount directory 
> (and perhaps others) to mount the directory.

OK.

>   
> > Reading the call I got the impression you were running FC3 but I see
> > it's FC4. Can you tell me what kernel version you currently have please?
> 
> We're mostly up to date here.  Although, as I mentioned above, I've disabled
> --ghost everywhere to work around the problem.  I could try enabling it on a few
> machines to test.

Yes. I forgot, enabling ghosting makes the problem happen more
quickly, right. But it still happens when it's disabled, correct?

And your kernel version would be 2.6.16-1.2108_FC4 yes?


Comment 33 Orion Poplawski 2006-05-09 12:26:37 EDT
(In reply to comment #32)
> Yes. I forgot, enabling ghosting makes the problem happen more
> quickly, right. But it still happens when it's disabled, correct?

Actually, I'm not sure the problem occurs at all without ghosting. I haven't had
the problem occur in recent memory after turning it off.
  
> And your kernel version would be 2.6.16-1.2108_FC4 yes?

We'll, it could be :-).  We've got about about 50 machines running various
kernels, but are generally up-to-date.  Let me know what you'd like me to test
and I could probably set it up.

e.g.:
apapane: 2.6.16-1.2096_FC4smp
apollo: 2.6.16-1.2069_FC4smp
armstrong: 2.6.16-1.2096_FC4
aspen: 2.6.16-1.2107_FC4smp
coop00: 2.6.16-1.2069_FC4smp
coop01: 2.6.16-1.2108_FC4smp
cynosure: 2.6.16-1.2096_FC5
denali: 2.6.16-1.2069_FC4
earth: 2.6.13-1.1532_FC4
hammer: 2.6.16-1.2107_FC5xen0
hawk: 2.6.16-1.2108_FC4smp
hobbes: 2.6.16-1.2107_FC4smp
iago: 2.6.16-1.2107_FC4
irimi: 2.6.16-1.2111_FC5
kolea: 2.6.16-1.2107_FC4
laplata: 2.6.16-1.2107_FC4
lego: 2.6.16-1.2069_FC4
Comment 34 Ian Kent 2006-05-09 13:01:08 EDT
(In reply to comment #33)
> (In reply to comment #32)
> > Yes. I forgot, enabling ghosting makes the problem happen more
> > quickly, right. But it still happens when it's disabled, correct?
> 
> Actually, I'm not sure the problem occurs at all without ghosting. I haven't had
> the problem occur in recent memory after turning it off.

OK. Let me know if you think it does.
Looking at the kernel log I think it should but I'll have to
rethink that then.

>   
> > And your kernel version would be 2.6.16-1.2108_FC4 yes?
> 
> We'll, it could be :-).  We've got about about 50 machines running various
> kernels, but are generally up-to-date.  Let me know what you'd like me to test
> and I could probably set it up.

Cool.

I asked because you seem to have a good handle on what's
going on and I have a kernel patch I developed late last
year while working on another problem. When I was developing
it I realized that it would help with this problem as well.

From reading your kernel log the problem looks like a long
standing autofs issue of not being able to return an error
code when the kernel walks the path (in this particular case).

I tested the patch when I wrote it and it seemed to work ok.
I stopped work on it because it is a significant change and
I was working on version 5 of autofs which also introduces
significant change. But that's done now.

If you have the time to test this I'll make a patch for which
ever kernel you have on your test machine.

Ian
Comment 35 Orion Poplawski 2006-05-09 13:44:29 EDT
I'm happy to test whatever version is easiest.  FC4 or FC5.
Comment 36 Ian Kent 2006-05-10 05:42:26 EDT
(In reply to comment #35)
> I'm happy to test whatever version is easiest.  FC4 or FC5.

Great.

I've redone the patch against 2.6.16-1.2108_FC4.
Give me a while to put it through some basic sanity checks
and rethink the cases to make sure it does what I expect
and review your logs again then I'll post it for you to
test.

Ian

Comment 37 Ian Kent 2006-05-10 05:47:55 EDT
(In reply to comment #20)
> I've got another question though.  I'm having problems where one of my servers
> gets busy and some automount attemps fail:
> 
> May 27 05:21:06 cynosure automount[4026]: attempting to mount entry /data/sw1
> May 27 05:21:48 cynosure automount[24502]: >> mount: RPC: Timed out
> May 27 05:21:48 cynosure automount[24502]: mount(nfs): nfs: mount failure
> alexandria:/export/data1 on /data/sw1
> May 27 05:21:48 cynosure automount[24502]: failed to mount /data/sw1
> May 27 05:21:48 cynosure automount[24502]: umount_multi: no mounts found under
> /data/sw1
> 
> Is there any way I can up the RPM time out or ask it to retry a couple times? 
> Man pages doesn't have any information about this if so.

Do these mounts use UDP or TCP for the transport?


Comment 38 Orion Poplawski 2006-05-10 10:27:39 EDT
(In reply to comment #37)
> (In reply to comment #20)
> > Is there any way I can up the RPM time out or ask it to retry a couple times? 
> > Man pages doesn't have any information about this if so.
> 
> Do these mounts use UDP or TCP for the transport?

TCP

Comment 39 Ian Kent 2006-05-11 21:35:13 EDT
(In reply to comment #36)
> (In reply to comment #35)
> > I'm happy to test whatever version is easiest.  FC4 or FC5.
> 
> Great.
> 
> I've redone the patch against 2.6.16-1.2108_FC4.
> Give me a while to put it through some basic sanity checks
> and rethink the cases to make sure it does what I expect
> and review your logs again then I'll post it for you to
> test.

The patch has failed my analysis. I thought I'd thought it
through quite thouroughly when I did it but clearly not.

However, reviewing the history I see that there is another
patch that has gone into recent FC4 kernels that should
partially address this issue.

FC4 kernels from 2.6.14_1.1633_FC4 have this patch.
It won't address the problem of the lookup continuing
after a mount fail (which was what the above patch would
have resolved) but it should allow subsequent mount to
function as normal without having to access the mount point
directory to make them work again.

Most of your machines have kernels with this patch so, when
you have some time, could you check whether the original
problem still exists please.

Ian
 
Comment 40 Ian Kent 2006-05-11 21:42:51 EDT
(In reply to comment #38)
> (In reply to comment #37)
> > (In reply to comment #20)
> > > Is there any way I can up the RPM time out or ask it to retry a couple times? 
> > > Man pages doesn't have any information about this if so.
> > 
> > Do these mounts use UDP or TCP for the transport?
> 
> TCP

Is it actualy likely that the server isn't able to complete
a response within 60 seconds, given that in this case the
server may be bogged down?




Comment 41 Orion Poplawski 2006-05-12 12:23:38 EDT
> Is it actualy likely that the server isn't able to complete
> a response within 60 seconds, given that in this case the
> server may be bogged down?

I first started seeing this when I started a system of using rsync to backup our
machines to an nfs directory.  This got started by cron.daily, and was one of
the first things run so everyone kicked off at the same time.  So the server
would get something like 30-40 mount requests followed by heavy disk io.  I
eventually added a random sleep before the backup ran to reduce load and rarely
see this problem anymore.  But it would still be nice to have a retry option.

Comment 42 Ian Kent 2006-05-12 21:40:29 EDT
(In reply to comment #41)
> > Is it actualy likely that the server isn't able to complete
> > a response within 60 seconds, given that in this case the
> > server may be bogged down?
> 
> I first started seeing this when I started a system of using rsync to backup our
> machines to an nfs directory.  This got started by cron.daily, and was one of
> the first things run so everyone kicked off at the same time.  So the server
> would get something like 30-40 mount requests followed by heavy disk io.  I
> eventually added a random sleep before the backup ran to reduce load and rarely
> see this problem anymore.  But it would still be nice to have a retry option.
> 

The problem with that is that it would cause failures for other mounts
that timeout waiting for the lock file. So you could get several failures
instead of the one.

If we did this then autofs mount locking would need to be disabled by
default which it's not at the moment. I'll give it some more thought.

Comment 43 Orion Poplawski 2006-05-15 18:40:07 EDT
(In reply to comment #39)
> 
> However, reviewing the history I see that there is another
> patch that has gone into recent FC4 kernels that should
> partially address this issue.
> 
> FC4 kernels from 2.6.14_1.1633_FC4 have this patch.
> It won't address the problem of the lookup continuing
> after a mount fail (which was what the above patch would
> have resolved) but it should allow subsequent mount to
> function as normal without having to access the mount point
> directory to make them work again.
> 
> Most of your machines have kernels with this patch so, when
> you have some time, could you check whether the original
> problem still exists please.

Well, ran a test over the weekend that failed to produce a failure, so perhaps
things are more robust now.  I'll start re-enabling ghosting on some of our
machines and see how it goes.
Comment 44 Ian Kent 2006-05-29 22:38:03 EDT
(In reply to comment #43)
> (In reply to comment #39)
> > 
> > However, reviewing the history I see that there is another
> > patch that has gone into recent FC4 kernels that should
> > partially address this issue.
> > 
> > FC4 kernels from 2.6.14_1.1633_FC4 have this patch.
> > It won't address the problem of the lookup continuing
> > after a mount fail (which was what the above patch would
> > have resolved) but it should allow subsequent mount to
> > function as normal without having to access the mount point
> > directory to make them work again.
> > 
> > Most of your machines have kernels with this patch so, when
> > you have some time, could you check whether the original
> > problem still exists please.
> 
> Well, ran a test over the weekend that failed to produce a failure, so perhaps
> things are more robust now.  I'll start re-enabling ghosting on some of our
> machines and see how it goes.


Orion,

Have you been able to check this?

Ian
Comment 45 Orion Poplawski 2006-05-30 11:29:30 EDT
(In reply to comment #44)
> 
> Have you been able to check this?

Still no problems so far.  I'll try turning ghosting back on in more places.
Comment 46 Ian Kent 2006-06-21 21:44:21 EDT
(In reply to comment #45)
> (In reply to comment #44)
> > 
> > Have you been able to check this?
> 
> Still no problems so far.  I'll try turning ghosting back on in more places.
> 

Orion, did you end up getting time to test this further?
Comment 47 Orion Poplawski 2006-06-30 18:34:55 EDT
So far so good.  I'm going to turn --ghost back on on all my systems and see
what happens.
Comment 49 Christian Iseli 2007-01-22 05:46:07 EST
This report targets the FC3 or FC4 products, which have now been EOL'd.

Could you please check that it still applies to a current Fedora release, and
either update the target product or close it ?

Thanks.
Comment 50 Orion Poplawski 2007-01-22 12:27:17 EST
Appears resolved in FC5+

Note You need to log in before you can comment on or make changes to this bug.