Bug 162636 - autofs-4.1.4-non-replicated-ping.patch backported to 4.1.3?
autofs-4.1.4-non-replicated-ping.patch backported to 4.1.3?
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: autofs (Show other bugs)
3.0
i386 Linux
high Severity medium
: ---
: ---
Assigned To: Jeff Moyer
Brock Organ
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-07-06 20:04 EDT by David Johnson
Modified: 2007-11-30 17:07 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-05-23 13:45:17 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
auto.master for autofs bug (237 bytes, text/plain)
2005-07-19 17:23 EDT, David Johnson
no flags Details
auto.home for autofs bug (12.61 KB, text/plain)
2005-07-19 17:24 EDT, David Johnson
no flags Details
auto.net for autofs bug (167 bytes, text/plain)
2005-07-19 17:24 EDT, David Johnson
no flags Details
Test script to trigger autofs bug (239 bytes, text/plain)
2005-07-29 14:48 EDT, David Johnson
no flags Details
Example output from test script (1.29 KB, text/plain)
2005-07-29 14:51 EDT, David Johnson
no flags Details
section of syslog for failed automount (156.30 KB, text/plain)
2005-07-29 15:18 EDT, David Johnson
no flags Details
sysreport output for failing client system (627.82 KB, application/octet-stream)
2005-08-10 20:39 EDT, David Johnson
no flags Details
full syslog output for autofs failure (3.20 MB, application/octet-stream)
2005-08-10 20:40 EDT, David Johnson
no flags Details
auto.misc (605 bytes, text/plain)
2005-08-15 16:24 EDT, David Johnson
no flags Details
second test script (169 bytes, text/plain)
2005-08-15 16:26 EDT, David Johnson
no flags Details
*.* syslog output from second autofs test script (21.36 KB, application/octet-stream)
2005-08-16 20:51 EDT, David Johnson
no flags Details

  None (edit)
Description David Johnson 2005-07-06 20:04:21 EDT
Description of problem:

autofs-mounted NFS file systems are sometimes timing out quickly with a "file
not found"
error.  This appears to have been introduced between
autofs-4.1.3-47 and auotfs-4.1.3-130

Version-Release number of selected component (if applicable):

autofs-4.1.3-130

How reproducible:

Intermittent


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

This looks like the bug fixed by autofs-4.1.4-non-replicated-ping.patch.
Is there a backport to autofs-4.1.3 for RHEL3 (autofs-4.1.3-non-replicated-ping
looks like it's solving a different problem)?  Or should I use autofs-4.1.4
on RHEL3?  If so, what's the minimum kernel release of 2.4.21 that's usable
with autofs-4.1.4?
Comment 2 Chris Feist 2005-07-18 14:50:32 EDT
Can you specify which version of autofs you are running?

Can you also attach your automount maps and entries from /var/log/messages when
you encountered this problem?
Comment 3 David Johnson 2005-07-19 17:23:34 EDT
Created attachment 116947 [details]
auto.master for autofs bug
Comment 4 David Johnson 2005-07-19 17:24:13 EDT
Created attachment 116948 [details]
auto.home for autofs bug
Comment 5 David Johnson 2005-07-19 17:24:47 EDT
Created attachment 116949 [details]
auto.net for autofs bug
Comment 6 David Johnson 2005-07-19 17:30:41 EDT
As detailed in my initial report, the problem doesn't seem to exhibit itself in
autofs-4.1.3-47 but _does_ occur when using auotfs-4.1.3-130.

I've attached our automount maps.

I can't find any relevant autofs messages in /var/log/messages that correspond
to the failure to find files.  There's a slight chance I'm missing the messages
(it failed >week ago and on random machines in our compute pool) but I'm 90%
sure that nothing was reported when the "file not found" occurred.

There was a reasonable possibility of high network traffic or server load
when the failure occurred.

Comment 7 David Johnson 2005-07-19 17:31:49 EDT
Also note that the server was running Solaris 2.8.
Comment 8 David Johnson 2005-07-21 17:40:51 EDT
We've had an additional report of the problem, but this time with
autofs-4.1.3-47.  We did get an automount failure message.  The error
occurred druing a compile, with libraries on NFS-mounted partitions not
being found and then, later, reappearing (probably when network/server load
changed).

Jul 20 14:40:00 lentil CROND[13347]: (root) CMD (/usr/lib/sa/sa1 1 1) 
Jul 20 14:50:00 lentil CROND[13709]: (root) CMD (/usr/lib/sa/sa1 1 1) 
Jul 20 14:57:32 lentil automount[13966]: failed to mount /u/drspeech
Jul 20 15:00:00 lentil CROND[14071]: (root) CMD (/usr/lib/sa/sa1 1 1) 
Jul 20 15:01:00 lentil CROND[14106]: (root) CMD (run-parts /etc/cron.hourly) 
Jul 20 15:01:01 lentil ntpdate[14109]: adjust time server 192.150.186.8 offset 0
.085569 sec
Jul 20 15:01:03 lentil automount[14117]: failed to mount /u/drspeech
Jul 20 15:07:23 lentil automount[14368]: failed to mount /u/drspeech
Jul 20 15:08:23 lentil automount[14412]: failed to mount /u/drspeech
Jul 20 15:10:00 lentil CROND[14492]: (root) CMD (/usr/lib/sa/sa1 1 1) 
Comment 9 David Johnson 2005-07-21 17:47:48 EDT
Could the priority of this bug be raised to "high"?  We were assuming this
was due to the 4.1.3-47->4.1.3-130 upgrade and so we had a downgrade
as a workaround.  Now
we have no workaround.  We have 10's of people who rely on NFS for their work
and having random files disappear, if only occasionally, is obviously a serious
issue.  If autofs
turns out to be unreliable we may be forced to move back to Solaris 10 for
our compute cluster.

Thanks.
Comment 10 Chris Feist 2005-07-21 19:00:47 EDT
Can you enable debugging on autofs?  (This should help me track down exactly
where the problem is occurring.)

Add the following line to your /etc/syslog.conf:
*.*			/var/log/debug

Then add --debug to all of the maps in /etc/auto.master.

Then run 'service syslog restart'.
Then run 'service autofs restart'.

Once you reproduce the problem can you post the information into the bug report.
Comment 11 David Johnson 2005-07-25 17:01:40 EDT
We're working on this.

Note that we have ~100 production machines, each with nearly 10 maps,
so pushing out the maps and restarting the services isn't trivial, even
with our push infrastructure.

Many of the hosts have multiple services on them, some of which are critical.
A "*.* /var/log/debug" on every machine could well be a great way of making
hosts crash with a full /var (depending on exactly what they're running).

We'll try and get something out to a representative enough subset to stand a
chance of seeing the problem.  However, something less all-inclusive than
"*.*" in /etc/syslog.conf would be preferrable...

Thanks!
Comment 12 Chris Feist 2005-07-26 13:55:30 EDT
Unfortunately autofs puts data into syslog at several different priorities and
having the syslog outputing everything will also allow me to see if something
else is interacting with autofs causing those mounts to fail.

If you can find a way to consistently replicate the problem it would make it
easier for me to debug.

I'm changing this bug to NEEDINFO as I need either debug logs or a method to
replicate the problem.

Comment 13 David Johnson 2005-07-29 14:46:50 EDT
I think we have managed to reproduce this problem.  The way we did it was
to connect a test machine to our main network via a box that effectively
generated 10% packet loss.  We then had a second machine try and randomly
log into the
first machine and execute a trivial script from an autofs-mounted directory.
The script ran for a few hours but then everything locked up.  I'll attach
the following:

 - the script that periodically logged in
 - the output from that script around the time of failure
 - the debug log output around the time of failure

The maps are as attached above.

Let me know if you want the full log output as it's 90MB.
Comment 14 David Johnson 2005-07-29 14:48:52 EDT
Created attachment 117283 [details]
Test script to trigger autofs bug
Comment 15 David Johnson 2005-07-29 14:51:44 EDT
Created attachment 117284 [details]
Example output from test script
Comment 16 David Johnson 2005-07-29 15:18:25 EDT
Created attachment 117286 [details]
section of syslog for failed automount
Comment 17 David Johnson 2005-07-29 15:20:46 EDT
After some broswing of what happened in the logs, it appears things went
something like:

 00:42:20 expired
 00:44:33 attempted to mount
 00:45:15 expired
 00:46:44 attempted to mount
 00:47:51 expired
 00:47:51 attempted to mount
 00:47:51 failed to mount
Comment 18 David Johnson 2005-07-29 18:19:29 EDT
Note that autofs-4.1.3-47 fails in a similar manner.

Reducing the max. time waited from 300 to 60, with the --timeout to autofs set to
30, results in a failure within in few minutes (assuming 10% packet loss).

I'm on vacation until Thursday 4th August so won't be able to provide any
more info until then.
Comment 19 David Johnson 2005-08-05 17:15:26 EDT
Is anything happening with this?  We were hoping to include a fix in our
next kernel patch application, which is likely to happen very soon.
Comment 20 Chris Kloiber 2005-08-10 14:31:15 EDT
Engineering is requesting you compress and attach the full log from the debug
runs, eg:

   bzip2 -c /var/log/debug > /tmp/debug.log.bz2

Also please send a new sysreport from the same failing client you take the debug
log from.
Comment 21 David Johnson 2005-08-10 15:21:38 EDT
Hmmm - it's taken nearly two weeks to go from "would you like the full debug
output" to "yes we would".

I'll do the sysreport and get a compressed log and attach it below.

I'm 80% sure that this bug will hit any RHEL3 autofs setup if packets are lost.
I'm 70% sure it's independent of the type of NFS server (Solaris in this case).
I'm contemplating:
 - seting a system with an out-of-the-box RHEL3 client and out-of-the-box RHEL3
   NFS server, fully patched with latest patch set and no local customizations
   other than minor test-related changes
 - connecting it to a packet dropper and checking to see if I can crash the
   client with a 5 line test script
This will prove that there are many thousands of RHEL3 autofs clients out there
that will eventually crash when the network is loaded.

If I do this, providing all of the information necessary to reproduce the
test (including how to build a simple packet dropper with free software), will
this help get this problem solved sooner and result in a usable autofs package?
Comment 22 David Johnson 2005-08-10 20:39:42 EDT
Created attachment 117629 [details]
sysreport output for failing client system
Comment 23 David Johnson 2005-08-10 20:40:51 EDT
Created attachment 117630 [details]
full syslog output for autofs failure
Comment 24 David Johnson 2005-08-12 19:28:46 EDT
After a bunch more testing, it appears that this bug is apparent with both
RHEL3 and Solaris 8 NFS servers (although it may occur less often with
RHEL3).  On the client side I've only managed to
reproduce it with maps using -tcp, although I'm gonna run a longer test using
UDP over the weekend.  However, a fully-patched, out-of-the box, uncustomized
RHEL3U5 install exhibits the problem with both types of server.
The only change was to drop the
unmount timout in /etc/sysconfig/autofs, enable the misc map and add a misc
map line:

test            -tcp fig:/scratch/test
Comment 25 Chris Feist 2005-08-15 15:58:05 EDT
After doing some testing myself it looks like the problems happens when the
mount request fails (which explains the "File not found" or "Command not found"
error).      However it should fail so fast from the attempting mount.  It looks
like it's trying to mount a directory that isn't directly exported by the nfs
server for some reason.

What do your maps look like on the test system?  The map that you've posted
earlier show a /u/davidj map, but it seems like you're trying to use a
/u/davidj/src map.
Comment 26 David Johnson 2005-08-15 16:19:26 EDT
My original tests were on one of our production systems.  However, on
Friday I had it failing on a much less customized setup.  Below I'll attach:

 - /etc/auto.misc
 - the new (simplified) test script

/etc/auto.master is unchanged except for enabling the auto.misc map.
Comment 27 David Johnson 2005-08-15 16:24:53 EDT
Created attachment 117769 [details]
auto.misc
Comment 28 David Johnson 2005-08-15 16:26:39 EDT
Created attachment 117771 [details]
second test script
Comment 29 Chris Feist 2005-08-15 16:31:42 EDT
Can you also post the /var/log/debug output and output from the 2nd test script?
Comment 30 David Johnson 2005-08-15 16:35:03 EDT
Just to make sure - you did see from my way-back posting that the error
seems to happen when there's a timed-out unmount colliding with a mount?

Also note that the problem only seems to occur with -tcp mounts.  I had it running
all weekend with UDP and it didn't fail.

I'll switch back to tcp and get some log output for you.
Comment 31 David Johnson 2005-08-16 20:51:40 EDT
Created attachment 117819 [details]
*.* syslog output from second autofs test script

Here's some more log output, from the second test script.  Both client
and server are RHEL3.  The failure is right near the end of the file at
17:44:47 or thereabouts.
Comment 32 Chris Feist 2005-09-01 12:32:33 EDT
Looking through the logs it looks like the reason that the automount fails is
because the mount fails (because of RPC timeouts).  If you try to run mount on a
server that doesn't respond or is too busy to respond autofs can't mount and
will report a failure.

I don't think there is anything that can be done differently with autofs as we
just run 'mount' and if it works we continue, and if it doesn't then we generate
an error.

I'm planning on closing this bug as NOTABUG unless you think there could be
other problems causing this.
Comment 33 David Johnson 2005-09-01 18:07:14 EDT
Hmmmm.

 - on our production system, our Solaris clients work fine but our RHEL3 clients
   occasionally lose file systems, resulting in jobs crashing.
 - on our test setup, using the udp mount option we've never seen a failure
   of the sort causing problmes but
   using the tcp mount option results in mounts failing on a regular basis.
 - the failure only happens when the timeout-triggered unmount happens at
   about the same time as the mount.  I've never seen it fail to mount
   except when the two are within a second or so of each other despite the fact
   that the majority of the test cases are many seconds apart.
 - a regular mount timeout takes many, many seconds.  However, when it fails
   to do the mount incorrectly, as detailed in the logs, the mount times
   out in a fraction of a second.

Maybe I've spent too much time doing sysadmin and not enough time programming,
but this sounds very much like an umount/mount race condition when using NFS
over tcp, possibly autofs-triggered or possibly in the kernel or mount/umount
programs.  To me it doesn't sound much like NOTABUG.

Fortunately switching back to NFS over UDP looks like it might be a short-term
workaround.  I will also produce a google-indexed web page explaining our
diagnosis as others seem to be affected by this and they may disagree with the
NOTABUG conclusion too.
Comment 34 Chris Feist 2005-09-01 18:57:05 EDT
From the the debug log in comment 31, I see where the mount failed twice.

Once at 17:45:25 & once at 17:46:07.

Both times the mount was attempted ~20seconds before it actually failed due to
an RPC timeout.  If autofs were failing the mount due to a race then I would
expect it to return immediately and would not have an error from mount (RPC:
Timed out).

So, it doesn't appear that the 2 debug log contains any issues.  I've looked at
the first debug log you've posted and there may be a possible bug in there.  I'm
not sure why autofs is trying to mount /u/davidj/src, that's not exported
directory anywhere, is it?  I only see /u/davidj in the maps.
Comment 35 David Johnson 2005-09-01 19:52:12 EDT
You're right - the two failures are different.  Indeed, on reflection, the
symptoms are different - with the first type of failure the autofs system
appears to become a mess and the machine often becomes unusable, but in my
simplified testing it seemed to carry on fine after the failure.  The first type
of symptoms
are often seen on our production system too - clients often become unusable and
need a reboot.

You're right - /u/davidj/src/ isn't exported directly - so the automounter
is doing something pretty weird.  I've also had my account with a home of
/u/davidj2/ try and mount my home from /u/davidj/, with resultant permission
issues, although I've never been able to subsequently reproduce this.  This
seems consistent with some type of corruption of the mount path with other
string info.  Is it possible the unmount is called from a signal handler
and is using e.g. non-rentrant routines that are also used as part of mounts?
Comment 36 Jeff Moyer 2005-09-02 10:59:18 EDT
OK, just jumping into this now.

For the debug logs, you can configure syslog to only log daemon.*.  That is the
facility to which autofs logs.

There is a kernel patch which addresses the mounting of a subdirectory instead
of the autofs "key" when ghosting is enabled.  However, it doesn't appear that
ghosting is enabled in this environment.  I can't verify that, since the updated
auto.master was never posted, but "stock" doesn't have --ghsot enabled.  Also
note that /etc/sysconfig/autofs would be a helpful file to include in the bug
report.

The unmount is not called from a signal handler.  There are calls to ioctl,
which are unsafe, but I don't believe this is the code path you are hitting. 
There are *also* calls to syslog in a signal handler, which are very bad.  We've
seen these cause autofs to hang, and not perform any more mounts or unmounts.  A
fix for this problem exists in the RHEL3 U7 beta.

As for a mount/umount race, I don't believe there is a problem in the actual
mount/umount commands themselves.  Any race here would be an autofs problem.
Since you have a reproducer, would you be willing to test a new kernel?  I'm not
certain it will make a difference, but I'd like to give it a try, anyway.

The kernel can be found here:

  http://people.redhat.com/jmoyer/.bz162636/

I've included the smp kernel.  If you would like a different kernel, please let
me know.  I also put the U7 beta version of autofs there for you to try.

NOTE: There is a chance that this will fix one of your problems: autofs
attempting to mount a non-exported path, that appears to be a directory within
an automounted directory.

I haven't looked into the other problem in enough detail to comment at this point.

Thanks for all of your help in diagnosing this issue.
Comment 37 Chris Feist 2005-09-02 11:02:31 EDT
Just a note, that kernel is part of the RHEL3U6 beta (not RHEL3U7).
Comment 38 Jeff Moyer 2006-03-07 17:06:56 EST
Changing to NEEDINFO_REPORTER from NEEDINFO.
Comment 39 Jeff Moyer 2006-05-23 13:17:37 EDT
Were you able to try a newer kernel, and did it solve your problem?

Note You need to log in before you can comment on or make changes to this bug.