From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6 Description of problem: After an NFS file server outage, AMD intermittently leaves stale links to non-existent mount points, which stops AMD from correctly remounting them. On our busy web server, which uses automounted NFS heavily, about 1 in 40 server outages results in stale links. We are using AMD to mount user home directories into /home using a NIS map on our busiest mail and web servers, and they all exhibit the problem. Version-Release number of selected component (if applicable): am-utils-6.0.9-2.4 How reproducible: Always Steps to Reproduce: Run AMD on a busy server, mounting user home directories from a busy NFS server. If you are regularly seeing "nfs: server [...] not responding, still trying" message in the logs, the servers are busy enough. Actual Results: AMD attempts to flush mounts from the down NFS server. About 1 in 40 times, it does not flush correctly, leaving stale links. The easiest way to spot the broken links is to run "ls -l --color=always /home" and look for the red links. Expected Results: AMD should not leave stale links in an automounted directory. Additional info: Although we've filed the bug against RHEL3, we've also seen it in RedHat 7.3 running am-utils-6.0.7-4 and RHEL4 running am-utils-6.0.9-10. We've chosen our RHEL3 machine, because it is our busiest and the easiest on which to reproduce the problem. The stale links can be removed either by unlinking them with "rm" or by restarting AMD. Until the links are removed, AMD will not remount them correctly, which in our web server case means that affected user home directories will no longer appear on the web. Attached is amd.conf. Also attached are a portion of /var/log/messages showing a server outage and an AMD flush with AMD logging turned up high. Also attached is the output of a script we have watching for stale links (by tailing /var/log/messages). It shows the links found to be stale. In this example, all the links are from the same filesystem (this is typical) which resides on ecsxlv1. There is overlap between the flushed mounts and the stale links, although it isn't perfect. Note that gaia, ecsxlv1 and ecsxlv2 are all the same server -- it has several IP aliases.
Created attachment 117733 [details] amd.conf
Created attachment 117735 [details] /var/log/messages showing an outage and flushes
Created attachment 117736 [details] Stale links found after the server outage
Dear Redhat, version 6.0 of am-utils is rather old. I urge you to upgrade to version 6.1.1, which had undergone two years of development and extensive testing; it fixes many bugs and introduces additional useful features to improve performance, reliability, and security. see www.am-utils.org or contact me (ezk) for more details. Thanks.
I'm happy to report that we have been running am-utils 6.1.1 amd without issue for more than twenty four hours now, in which time we would previously have seen several sets of stale links. Any chance we could have an RPM version of am-utils 6.1.1 to test, please?
Unfortunately, shortly after posting the last message, amd 6.1.1 failed. In this case, all the links from /home to a particular mount started returning I/O errors. As before, the failed links appeared to be associated with a recent server "outage". kill -TERM failed to stop the amd process, which needed a kill -KILL. amd successfully restarted, and it is again working now.
Jon, there's a new feature in 6.1.1, which I hesistate to ask people to use unless they see problems like what you've seen. You can put forced_unmounts=yes in your amd.conf, and Amd will use umount2() with MNT_FORCE or MNT_DETACH when it faces serious remote server problems which otherwise may cause EIO/ESTALE errors. Note that using forced unmounts should not be taken lightly, and it works best with the most recent linux 2.6 kernels (older 2.4/2.6 kernels have had bugs in their implementations of MNT_FORCE/DETACH). Please read the manual carefully about this option. There is one site I know which uses this on nearly 300 hosts and so far they're happy with it. I think it's worth a try, if you can, on a couple of your hosts. I'd appreciate if you let me know how well it works.
I'm not sure that forced_unmounts will help, I'm afraid. The mounts themselves were working, it was just that some of the entries in /home where giving I/O errors. Anyway, we gave it a go: amd started but then didn't do anything. /home appeared empty and nothing was automatically mounted. I had to kill -TERM it, and restart without the forced_unmounts option. Our kernel is 2.4.21, which may explain the problems if forced unmounts aren't reliable until 2.6. More bad news, by the way. We also found a stale link when running 6.1.1, which implies that the original problem still exists. We also had another I/O error type failure with 6.1.1 last night, which forced us to reboot the machine. It also turns out that our old workaround script (which unlinks any stale links in /home) didn't work on the I/O error failures, although the script did successfully spot the problem as it happened. As a consequence we've reverted to am-utils 6.0.9, which provides a reliable service with the workaround script in place. Note that the I/O errors in 6.1.1 seem to appear under the same conditions that the stale links appear in 6.0.9: after our NFS servers fail to respond and AMD flushes the servers' dependencies. Any ideas? Where would you like us to take it from here? What would cause 6.0.9 and 6.1.1 to fail in different ways but under the same conditions?
This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.