Red Hat Bugzilla – Bug 162636
autofs-4.1.4-non-replicated-ping.patch backported to 4.1.3?
Last modified: 2007-11-30 17:07:07 EST
Description of problem:
autofs-mounted NFS file systems are sometimes timing out quickly with a "file
error. This appears to have been introduced between
autofs-4.1.3-47 and auotfs-4.1.3-130
Version-Release number of selected component (if applicable):
Steps to Reproduce:
This looks like the bug fixed by autofs-4.1.4-non-replicated-ping.patch.
Is there a backport to autofs-4.1.3 for RHEL3 (autofs-4.1.3-non-replicated-ping
looks like it's solving a different problem)? Or should I use autofs-4.1.4
on RHEL3? If so, what's the minimum kernel release of 2.4.21 that's usable
Can you specify which version of autofs you are running?
Can you also attach your automount maps and entries from /var/log/messages when
you encountered this problem?
Created attachment 116947 [details]
auto.master for autofs bug
Created attachment 116948 [details]
auto.home for autofs bug
Created attachment 116949 [details]
auto.net for autofs bug
As detailed in my initial report, the problem doesn't seem to exhibit itself in
autofs-4.1.3-47 but _does_ occur when using auotfs-4.1.3-130.
I've attached our automount maps.
I can't find any relevant autofs messages in /var/log/messages that correspond
to the failure to find files. There's a slight chance I'm missing the messages
(it failed >week ago and on random machines in our compute pool) but I'm 90%
sure that nothing was reported when the "file not found" occurred.
There was a reasonable possibility of high network traffic or server load
when the failure occurred.
Also note that the server was running Solaris 2.8.
We've had an additional report of the problem, but this time with
autofs-4.1.3-47. We did get an automount failure message. The error
occurred druing a compile, with libraries on NFS-mounted partitions not
being found and then, later, reappearing (probably when network/server load
Jul 20 14:40:00 lentil CROND: (root) CMD (/usr/lib/sa/sa1 1 1)
Jul 20 14:50:00 lentil CROND: (root) CMD (/usr/lib/sa/sa1 1 1)
Jul 20 14:57:32 lentil automount: failed to mount /u/drspeech
Jul 20 15:00:00 lentil CROND: (root) CMD (/usr/lib/sa/sa1 1 1)
Jul 20 15:01:00 lentil CROND: (root) CMD (run-parts /etc/cron.hourly)
Jul 20 15:01:01 lentil ntpdate: adjust time server 188.8.131.52 offset 0
Jul 20 15:01:03 lentil automount: failed to mount /u/drspeech
Jul 20 15:07:23 lentil automount: failed to mount /u/drspeech
Jul 20 15:08:23 lentil automount: failed to mount /u/drspeech
Jul 20 15:10:00 lentil CROND: (root) CMD (/usr/lib/sa/sa1 1 1)
Could the priority of this bug be raised to "high"? We were assuming this
was due to the 4.1.3-47->4.1.3-130 upgrade and so we had a downgrade
as a workaround. Now
we have no workaround. We have 10's of people who rely on NFS for their work
and having random files disappear, if only occasionally, is obviously a serious
issue. If autofs
turns out to be unreliable we may be forced to move back to Solaris 10 for
our compute cluster.
Can you enable debugging on autofs? (This should help me track down exactly
where the problem is occurring.)
Add the following line to your /etc/syslog.conf:
Then add --debug to all of the maps in /etc/auto.master.
Then run 'service syslog restart'.
Then run 'service autofs restart'.
Once you reproduce the problem can you post the information into the bug report.
We're working on this.
Note that we have ~100 production machines, each with nearly 10 maps,
so pushing out the maps and restarting the services isn't trivial, even
with our push infrastructure.
Many of the hosts have multiple services on them, some of which are critical.
A "*.* /var/log/debug" on every machine could well be a great way of making
hosts crash with a full /var (depending on exactly what they're running).
We'll try and get something out to a representative enough subset to stand a
chance of seeing the problem. However, something less all-inclusive than
"*.*" in /etc/syslog.conf would be preferrable...
Unfortunately autofs puts data into syslog at several different priorities and
having the syslog outputing everything will also allow me to see if something
else is interacting with autofs causing those mounts to fail.
If you can find a way to consistently replicate the problem it would make it
easier for me to debug.
I'm changing this bug to NEEDINFO as I need either debug logs or a method to
replicate the problem.
I think we have managed to reproduce this problem. The way we did it was
to connect a test machine to our main network via a box that effectively
generated 10% packet loss. We then had a second machine try and randomly
log into the
first machine and execute a trivial script from an autofs-mounted directory.
The script ran for a few hours but then everything locked up. I'll attach
- the script that periodically logged in
- the output from that script around the time of failure
- the debug log output around the time of failure
The maps are as attached above.
Let me know if you want the full log output as it's 90MB.
Created attachment 117283 [details]
Test script to trigger autofs bug
Created attachment 117284 [details]
Example output from test script
Created attachment 117286 [details]
section of syslog for failed automount
After some broswing of what happened in the logs, it appears things went
00:44:33 attempted to mount
00:46:44 attempted to mount
00:47:51 attempted to mount
00:47:51 failed to mount
Note that autofs-4.1.3-47 fails in a similar manner.
Reducing the max. time waited from 300 to 60, with the --timeout to autofs set to
30, results in a failure within in few minutes (assuming 10% packet loss).
I'm on vacation until Thursday 4th August so won't be able to provide any
more info until then.
Is anything happening with this? We were hoping to include a fix in our
next kernel patch application, which is likely to happen very soon.
Engineering is requesting you compress and attach the full log from the debug
bzip2 -c /var/log/debug > /tmp/debug.log.bz2
Also please send a new sysreport from the same failing client you take the debug
Hmmm - it's taken nearly two weeks to go from "would you like the full debug
output" to "yes we would".
I'll do the sysreport and get a compressed log and attach it below.
I'm 80% sure that this bug will hit any RHEL3 autofs setup if packets are lost.
I'm 70% sure it's independent of the type of NFS server (Solaris in this case).
- seting a system with an out-of-the-box RHEL3 client and out-of-the-box RHEL3
NFS server, fully patched with latest patch set and no local customizations
other than minor test-related changes
- connecting it to a packet dropper and checking to see if I can crash the
client with a 5 line test script
This will prove that there are many thousands of RHEL3 autofs clients out there
that will eventually crash when the network is loaded.
If I do this, providing all of the information necessary to reproduce the
test (including how to build a simple packet dropper with free software), will
this help get this problem solved sooner and result in a usable autofs package?
Created attachment 117629 [details]
sysreport output for failing client system
Created attachment 117630 [details]
full syslog output for autofs failure
After a bunch more testing, it appears that this bug is apparent with both
RHEL3 and Solaris 8 NFS servers (although it may occur less often with
RHEL3). On the client side I've only managed to
reproduce it with maps using -tcp, although I'm gonna run a longer test using
UDP over the weekend. However, a fully-patched, out-of-the box, uncustomized
RHEL3U5 install exhibits the problem with both types of server.
The only change was to drop the
unmount timout in /etc/sysconfig/autofs, enable the misc map and add a misc
test -tcp fig:/scratch/test
After doing some testing myself it looks like the problems happens when the
mount request fails (which explains the "File not found" or "Command not found"
error). However it should fail so fast from the attempting mount. It looks
like it's trying to mount a directory that isn't directly exported by the nfs
server for some reason.
What do your maps look like on the test system? The map that you've posted
earlier show a /u/davidj map, but it seems like you're trying to use a
My original tests were on one of our production systems. However, on
Friday I had it failing on a much less customized setup. Below I'll attach:
- the new (simplified) test script
/etc/auto.master is unchanged except for enabling the auto.misc map.
Created attachment 117769 [details]
Created attachment 117771 [details]
second test script
Can you also post the /var/log/debug output and output from the 2nd test script?
Just to make sure - you did see from my way-back posting that the error
seems to happen when there's a timed-out unmount colliding with a mount?
Also note that the problem only seems to occur with -tcp mounts. I had it running
all weekend with UDP and it didn't fail.
I'll switch back to tcp and get some log output for you.
Created attachment 117819 [details]
*.* syslog output from second autofs test script
Here's some more log output, from the second test script. Both client
and server are RHEL3. The failure is right near the end of the file at
17:44:47 or thereabouts.
Looking through the logs it looks like the reason that the automount fails is
because the mount fails (because of RPC timeouts). If you try to run mount on a
server that doesn't respond or is too busy to respond autofs can't mount and
will report a failure.
I don't think there is anything that can be done differently with autofs as we
just run 'mount' and if it works we continue, and if it doesn't then we generate
I'm planning on closing this bug as NOTABUG unless you think there could be
other problems causing this.
- on our production system, our Solaris clients work fine but our RHEL3 clients
occasionally lose file systems, resulting in jobs crashing.
- on our test setup, using the udp mount option we've never seen a failure
of the sort causing problmes but
using the tcp mount option results in mounts failing on a regular basis.
- the failure only happens when the timeout-triggered unmount happens at
about the same time as the mount. I've never seen it fail to mount
except when the two are within a second or so of each other despite the fact
that the majority of the test cases are many seconds apart.
- a regular mount timeout takes many, many seconds. However, when it fails
to do the mount incorrectly, as detailed in the logs, the mount times
out in a fraction of a second.
Maybe I've spent too much time doing sysadmin and not enough time programming,
but this sounds very much like an umount/mount race condition when using NFS
over tcp, possibly autofs-triggered or possibly in the kernel or mount/umount
programs. To me it doesn't sound much like NOTABUG.
Fortunately switching back to NFS over UDP looks like it might be a short-term
workaround. I will also produce a google-indexed web page explaining our
diagnosis as others seem to be affected by this and they may disagree with the
NOTABUG conclusion too.
From the the debug log in comment 31, I see where the mount failed twice.
Once at 17:45:25 & once at 17:46:07.
Both times the mount was attempted ~20seconds before it actually failed due to
an RPC timeout. If autofs were failing the mount due to a race then I would
expect it to return immediately and would not have an error from mount (RPC:
So, it doesn't appear that the 2 debug log contains any issues. I've looked at
the first debug log you've posted and there may be a possible bug in there. I'm
not sure why autofs is trying to mount /u/davidj/src, that's not exported
directory anywhere, is it? I only see /u/davidj in the maps.
You're right - the two failures are different. Indeed, on reflection, the
symptoms are different - with the first type of failure the autofs system
appears to become a mess and the machine often becomes unusable, but in my
simplified testing it seemed to carry on fine after the failure. The first type
are often seen on our production system too - clients often become unusable and
need a reboot.
You're right - /u/davidj/src/ isn't exported directly - so the automounter
is doing something pretty weird. I've also had my account with a home of
/u/davidj2/ try and mount my home from /u/davidj/, with resultant permission
issues, although I've never been able to subsequently reproduce this. This
seems consistent with some type of corruption of the mount path with other
string info. Is it possible the unmount is called from a signal handler
and is using e.g. non-rentrant routines that are also used as part of mounts?
OK, just jumping into this now.
For the debug logs, you can configure syslog to only log daemon.*. That is the
facility to which autofs logs.
There is a kernel patch which addresses the mounting of a subdirectory instead
of the autofs "key" when ghosting is enabled. However, it doesn't appear that
ghosting is enabled in this environment. I can't verify that, since the updated
auto.master was never posted, but "stock" doesn't have --ghsot enabled. Also
note that /etc/sysconfig/autofs would be a helpful file to include in the bug
The unmount is not called from a signal handler. There are calls to ioctl,
which are unsafe, but I don't believe this is the code path you are hitting.
There are *also* calls to syslog in a signal handler, which are very bad. We've
seen these cause autofs to hang, and not perform any more mounts or unmounts. A
fix for this problem exists in the RHEL3 U7 beta.
As for a mount/umount race, I don't believe there is a problem in the actual
mount/umount commands themselves. Any race here would be an autofs problem.
Since you have a reproducer, would you be willing to test a new kernel? I'm not
certain it will make a difference, but I'd like to give it a try, anyway.
The kernel can be found here:
I've included the smp kernel. If you would like a different kernel, please let
me know. I also put the U7 beta version of autofs there for you to try.
NOTE: There is a chance that this will fix one of your problems: autofs
attempting to mount a non-exported path, that appears to be a directory within
an automounted directory.
I haven't looked into the other problem in enough detail to comment at this point.
Thanks for all of your help in diagnosing this issue.
Just a note, that kernel is part of the RHEL3U6 beta (not RHEL3U7).
Changing to NEEDINFO_REPORTER from NEEDINFO.
Were you able to try a newer kernel, and did it solve your problem?