165961 – AMD intermittently leaves stale links after server outage

Bug 165961 - AMD intermittently leaves stale links after server outage

Summary: AMD intermittently leaves stale links after server outage

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	am-utils
Sub Component:
Version:	3.0
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Karel Zak
QA Contact:	Jay Turner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-08-15 10:46 UTC by Jon. Hallett
Modified:	2015-01-08 00:10 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-10-19 18:56:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
amd.conf (978 bytes, text/plain) 2005-08-15 10:48 UTC, Jon. Hallett	no flags	Details
/var/log/messages showing an outage and flushes (19.19 KB, text/plain) 2005-08-15 10:48 UTC, Jon. Hallett	no flags	Details
Stale links found after the server outage (2.41 KB, text/plain) 2005-08-15 10:51 UTC, Jon. Hallett	no flags	Details
View All

Description Jon. Hallett 2005-08-15 10:46:59 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6

Description of problem:
After an NFS file server outage, AMD intermittently leaves stale links to non-existent mount points, which stops AMD from correctly remounting them. On our busy web server, which uses automounted NFS heavily, about 1 in 40 server outages results in stale links.

We are using AMD to mount user home directories into /home using a NIS map on our busiest mail and web servers, and they all exhibit the problem.

Version-Release number of selected component (if applicable):
am-utils-6.0.9-2.4

How reproducible:
Always

Steps to Reproduce:
Run AMD on a busy server, mounting user home directories from a busy NFS server.

If you are regularly seeing "nfs: server [...] not responding, still trying" message in the logs, the servers are busy enough.

Actual Results: AMD attempts to flush mounts from the down NFS server. About 1 in 40 times, it does not flush correctly, leaving stale links. The easiest way to spot the broken links is to run "ls -l --color=always /home" and look for the red links.

Expected Results: AMD should not leave stale links in an automounted directory.

Additional info:

Although we've filed the bug against RHEL3, we've also seen it in RedHat 7.3 running am-utils-6.0.7-4 and RHEL4 running am-utils-6.0.9-10. We've chosen our RHEL3 machine, because it is our busiest and the easiest on which to reproduce the problem.

The stale links can be removed either by unlinking them with "rm" or by restarting AMD. Until the links are removed, AMD will not remount them correctly, which in our web server case means that affected user home directories will no longer appear on the web.

Attached is amd.conf.

Also attached are a portion of /var/log/messages showing a server outage and an AMD flush with AMD logging turned up high. Also attached is the output of a script we have watching for stale links (by tailing /var/log/messages). It shows the links found to be stale. In this example, all the links are from the same filesystem (this is typical) which resides on ecsxlv1. There is overlap between the flushed mounts and the stale links, although it isn't perfect.

Note that gaia, ecsxlv1 and ecsxlv2 are all the same server -- it has several IP aliases.

Comment 1 Jon. Hallett 2005-08-15 10:48:06 UTC

Created attachment 117733 [details]
amd.conf

Comment 2 Jon. Hallett 2005-08-15 10:48:49 UTC

Created attachment 117735 [details]
/var/log/messages showing an outage and flushes

Comment 3 Jon. Hallett 2005-08-15 10:51:37 UTC

Created attachment 117736 [details]
Stale links found after the server outage

Comment 4 Erez Zadok 2005-08-15 18:37:18 UTC

Dear Redhat, version 6.0 of am-utils is rather old.  I urge you to upgrade to
version 6.1.1, which had undergone two years of development and extensive
testing; it fixes many bugs and introduces additional useful features to improve
performance, reliability, and security.   see www.am-utils.org or contact me
(ezk) for more details. Thanks.

Comment 5 Jon. Hallett 2005-09-14 12:44:21 UTC

I'm happy to report that we have been running am-utils 6.1.1 amd without issue
for more than twenty four hours now, in which time we would previously have seen
several sets of stale links.

Any chance we could have an RPM version of am-utils 6.1.1 to test, please?

Comment 6 Jon. Hallett 2005-09-14 12:58:30 UTC

Unfortunately, shortly after posting the last message, amd 6.1.1 failed.  In
this case, all the links from /home to a particular mount started returning I/O
errors.  

As before, the failed links appeared to be associated with a recent server
"outage". 

kill -TERM failed to stop the amd process, which needed a kill -KILL.

amd successfully restarted, and it is again working now.

Comment 7 Erez Zadok 2005-09-14 15:05:51 UTC

Jon, there's a new feature in 6.1.1, which I hesistate to ask people to use
unless they see problems like what you've seen.  You can put forced_unmounts=yes
in your amd.conf, and Amd will use umount2() with MNT_FORCE or MNT_DETACH when
it faces serious remote server problems which otherwise may cause EIO/ESTALE
errors.  Note that using forced unmounts should not be taken lightly, and it
works best with the most recent linux 2.6 kernels (older 2.4/2.6 kernels have
had bugs in their implementations of MNT_FORCE/DETACH).  Please read the manual
carefully about this option.

There is one site I know which uses this on nearly 300 hosts and so far they're
happy with it.  I think it's worth a try, if you can, on a couple of your hosts.
 I'd appreciate if you let me know how well it works.

Comment 8 Jon. Hallett 2005-09-15 08:46:50 UTC

I'm not sure that forced_unmounts will help, I'm afraid.  The mounts themselves
were working, it was just that some of the entries in /home where giving I/O errors.

Anyway, we gave it a go: amd started but then didn't do anything.  /home
appeared empty and nothing was automatically mounted.  I had to kill -TERM it,
and restart without the forced_unmounts option.

Our kernel is 2.4.21, which may explain the problems if forced unmounts aren't
reliable until 2.6.

More bad news, by the way.  We also found a stale link when running 6.1.1, which
implies that the original problem still exists.

We also had another I/O error type failure with 6.1.1 last night, which forced
us to reboot the machine.  It also turns out that our old workaround script
(which unlinks any stale links in /home) didn't work on the I/O error failures,
although the script did successfully spot the problem as it happened.

As a consequence we've reverted to am-utils 6.0.9, which provides a reliable
service with the workaround script in place.

Note that the I/O errors in 6.1.1 seem to appear under the same conditions that
the stale links appear in 6.0.9: after our NFS servers fail to respond and AMD
flushes the servers' dependencies.

Any ideas?  Where would you like us to take it from here?  What would cause
6.0.9 and 6.1.1 to fail in different ways but under the same conditions?

Comment 9 RHEL Program Management 2007-10-19 18:56:12 UTC

This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.

Note You need to log in before you can comment on or make changes to this bug.