Bug 144729

Summary: automount stops responding and fails to mount
Product: Red Hat Enterprise Linux 3 Reporter: Jeremy Rosengren <jeremy>
Component: autofsAssignee: Jeff Moyer <jmoyer>
Status: CLOSED INSUFFICIENT_DATA QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 3.0CC: jeremy, raju.singh
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard: RHEL3U7NAK
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-08-03 12:41:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jeremy Rosengren 2005-01-10 23:35:48 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.3)
Gecko/20040922

Description of problem:
An unknown event triggers the automount process for one of our
automount maps to stop trying to perform automounts.

After a certain amount of runtime, the following starts occurring on
one of our automounts:

[root@breeze8 root]# ls /tools/local/bin
ls: /tools/local/bin: No such file or directory

Here's some relevant info:

/etc/sysconfig/autofs:
LOCALOPTIONS="-DOSREL=2.4_tls"
DAEMONOPTIONS="--ghost --timeout=60"

#  UNDERSCORETODOT changes auto_home to auto.home and auto_mnt to auto.mnt
UNDERSCORETODOT=0
DISABLE_DIRECT=1
# MASTERMAP names the root autofs map in NIS (usually either auto.master
# or auto_master...)
MASTERMAP=auto_master

/etc/auto_master:
/home           yp:auto.home    rsize=32768,wsize=32768,tcp
/devl           yp:auto.devl    rsize=32768,wsize=32768,tcp
/build          yp:auto.build   rsize=32768,wsize=32768,tcp
/group          yp:auto.group   rsize=32768,wsize=32768,tcp
/mis            yp:auto.mis     rsize=32768,wsize=32768,tcp
/products       yp:auto.products        rsize=32768,wsize=32768,tcp
/hardwire       yp:auto.hardwire        rsize=32768,wsize=32768,tcp
/regres         yp:auto.regres  rsize=32768,wsize=32768,tcp
/public         yp:auto.public  rsize=32768,wsize=32768,tcp
/proj           yp:auto.proj    rsize=32768,wsize=32768,tcp
/web            yp:auto.web     rsize=32768,wsize=32768,tcp
/wrk            yp:auto.wrk     rsize=32768,wsize=32768,tcp
/usr/local      yp:auto_usr_local       rsize=32768,wsize=32768,tcp
/tools          yp:auto.tools   rsize=32768,wsize=32768,tcp
/setup          yp:auto.setup   rsize=32768,wsize=32768,tcp

YP auto.tools (not all of it, but the map we're having problems with):
local icdev3:/vol/vol7/tools/&


Version-Release number of selected component (if applicable):
autofs-4.1.3-47

How reproducible:
Always

Steps to Reproduce:
1. The steps to reproduce are unknown, since it's hard to observe what
causes the problem.  automount for the effected map just goes out to
lunch.
2.
3.
    

Additional info:

Comment 1 Jeff Moyer 2005-01-11 15:27:04 UTC
What do the logs look like when the failure occurs?

It also appears you have applied patches to autofs.  What patches have you applied?

Comment 2 Jeremy Rosengren 2005-01-11 16:09:30 UTC
The default logging shows nothing when the error occurs, which makes it hard for
us to determine exactly what is causing it.  We put "--debug" in the
DAEMONOPTIONS to try to get more information, but the last time it failed there
was nothing in the logs that referenced /tools at all, beyond previous
successful mount attempts.  After the problem starts occurring (ie, "No such
file or directory" when attempting to mount), nothing gets written to the
messages file for that automount.

We haven't patched autofs ourselves, we're using the autofs package from Update
4.  We did get that package a bit earlier than RHN got it, due to another bug
with replicated host mounting, but that shouldn't matter.

Comment 3 Jeremy Rosengren 2005-01-31 21:16:47 UTC
We've determined that the --ghost option in DAEMONOPTIONS has
something to do with the cause of this behavior.  We were able to
confirm this by removing --ghost and restarting:  The problem has not
resurfaced since.  We also found machines that never had the --ghost
option enabled and have never seen the problem.

We were never able to catch the problem as it was happening, so we
were never able to come up a set of steps to replicate the problem.

Comment 4 Jeff Moyer 2005-03-31 18:05:49 UTC
Please add the --debug option to your /tools entry in auto.master.  Then,
configure syslog to send all messages to a debug log, by adding a line like the
following to your /etc/syslog.conf.

*.*    /var/log/debug

Restart syslogd and the automount (please be sure the automounter did actually
stop and start again).

When the problem next occurs, attach the logs to this bugzilla.

Comment 5 Jeff Moyer 2005-04-11 22:11:00 UTC
There was a patch committed upstream recently which claims to fix problems such
as this.  I'll create a new kernel rpm for testing and post it when ready.

Comment 8 Raju Singh Mahala 2005-05-31 06:03:59 UTC
Hi Jeffrey,

We are also facing the same problem. We are using RedHat Enterprise 3 Update 4.
We upgraded to Update 5 also but same scenerio.
As Jeremy informed that after removing --ghost switich in /etc/sysconfig/autofs
file, problem didn't come. But we never enabled --ghost option and automount on
all machines are running without --ghost option. But even we are facing the problem.

Following are the our configuration :-

1) /etc/sysconfig/autofs

------  starts ---------
#
LOCALOPTIONS=""
DAEMONOPTIONS="--timeout=60"

#  UNDERSCORETODOT changes auto_home to auto.home and auto_mnt to auto.mnt
UNDERSCORETODOT=1
DISABLE_DIRECT=1

-------- END ------------

2) /etc/nsswitch.conf 

   automount:  files nis

3) /etc/auto.master

   +auto.master

4) output of % ypcat -k auto.master

------ starts -------

/design    auto.design     tcp,rsize=32768,wsize=32768
/proj      auto.proj       tcp,rsize=32768,wsize=32768
/home      auto.home       tcp,rsize=32768,wsize=32768
/data      auto.data       tcp,rsize=32768,wsize=32768
/sw        auto.sw         tcp,rsize=32768,wsize=32768

--------- END ----------

NFS servers for file system are mainly Netapp, HP Cluster and Solaris enterprise
machines.

5) Error log with automount in debug mode :

-------- Starts ---------------

May 26 18:36:57 del11frd automount[20523]: attempting to mount entry /home/piyushj
May 26 18:36:57 del11frd automount[15681]: mount(nfs): no host elected
May 26 18:36:57 del11frd automount[15681]: failed to mount /home/piyushj

May 26 12:10:22 del15frd automount[2122]: attempting to mount entry /data/rd192a
May 26 12:10:22 del15frd automount[5349]: mount(nfs): no host elected
May 26 12:10:22 del15frd automount[5349]: failed to mount /data/rd192a

------  END --------

Following are the tests we carried out :-

1) By keeping local /etc/auto.master file as given below :-

--------- starts ---------

/design    auto.design   --timeout=60     tcp
/proj      auto.proj     --timeout=60     tcp
/home      auto.home     --timeout=300    tcp
/data      auto.data     --timeout==300   tcp
/sw        auto.sw       --timeout=0      tcp

-------- ENDS -----------------

In this case problem didn't resurface for auto.sw map but in all other maps no
change.

2) We made machine as NIS slave instead of NIS client but all in vain.

3) In RHEL 3.0 Update 5,  and RHEL 4 same problem exists.
    But it seems that in RHEL 3.0 Update 5 occurence of problem is less compare
to RHEL 3.0 Update 4.

4) Workign with --ghost option so lets see what results comes.

Comment 9 Jeff Moyer 2005-05-31 13:25:21 UTC
Raju,

Your problem description is difficult to follow.  The bug you describe seems
similar to that reported in bz #150690.  Please look at the following URL:
  http://people.redhat.com/jmoyer
and update bug #150690 with all of the information requested under the "Filing
bug reports" section.

Comment 10 Raju Singh Mahala 2005-06-01 06:11:54 UTC
Jeffrey,

In bug report #150690 problem came after upgradation and it comes when he does
cd /home/<mountdir> but in our case it doesn't happens. 
Problem of "failed to mount" doesn't come everytime, but some time as Jeremy
said, "An unknown event triggers the automount process for one of our automount
maps to stop trying to perform automounts", automount failes to mount requested
directory and after some time if same request comes again then it works. So it
is intermittent problem. We were facing this problem in RHEL3.0 Update 4 so we
upgraded to Update 5 also but problem still exists. We also installed RHEL4 on
some machines but no change.
If you see this problem different compare to bug #144729 then suggest me what to
do. Either I should open new bug or should update bug #150690 which I seems 
something different in compare to our problem.

Following are the some extra details :-
-------------------------------------
% rpm -qa autofs
autofs-4.1.3-130
% uname -r
2.4.21-32.ELsmp

If you need some more details then let me know.

Thanks & Regards,
Raju

Comment 11 Raju Singh Mahala 2005-06-03 05:00:29 UTC
Hi Jeffrey,

This is update for "comment #8" in which I wrote that on some machines I enabled
"--ghost" option for automount daemon. So with "--ghost" option we observed that
error in log file is comming for one level ahead as given below :-

------------- starts ---------------

Jun  2 12:53:23 del16frd automount[3466]: failed to mount /data/rdnetifl/LIBIO

--------- ENDS ----------

PS: here mount point is /data and key is rdnetifl


Other than this I am also noticing some "kernel:  NFS" messages in log file but
not sure whether it is linked with this problem or not. It is given below :-

------------ Starts ---------

Jun  2 14:21:16 del16frd kernel: NFS: Buggy server - nlink == 0!
Jun  2 14:21:16 del16frd kernel: __nfs_fhget: iget failed
Jun  2 14:21:27 del16frd kernel: NFS: Buggy server - nlink == 0!
Jun  2 14:21:27 del16frd kernel: __nfs_fhget: iget failed
Jun  2 14:21:32 del16frd kernel: NFS: Buggy server - nlink == 0!
Jun  2 14:21:32 del16frd kernel: __nfs_fhget: iget failed

---------------- ENDs ------------

We are contineously working to resolve and get more details on this problem. So
will be updating you if gets something related.



Comment 12 Jeff Moyer 2005-06-03 13:36:40 UTC
There is a patch for the problem you are seeing with the --ghost option.  Please
try the kernel patch available here:
  https://bugzilla.redhat.com/bugzilla//attachment.cgi?id=114991

Comment 15 Raju Singh Mahala 2005-06-13 05:31:26 UTC
Actually we are not using "--ghost" mode so we are not applying this patch. We
enabled "--ghost" option on two-three machines just for testing purpose to see
the  behaviour of error.
We are using automount without --ghost option and error status at our side is as
it is.
Any update from your side.

Comment 16 Jeff Moyer 2005-06-13 11:37:54 UTC
I have been able to reproduce the problem without ghosting enabled, and the
patch corrects it.  Please try the patch.

Comment 17 Raju Singh Mahala 2005-06-14 05:03:47 UTC
Hi Jeffrey,

This patch seems to kernel patch so pls let me know whether I have to re-build
the kerenl also.
It will be more helpfull for us if it is possible for you to provide patch in
rpm form.

Comment 18 Raju Singh Mahala 2005-06-17 05:45:34 UTC
Hi Jeffrey,
Any updates on comment #17.
I am waiting for your reply regarding patch in rpm form.

Comment 23 Raju Singh Mahala 2005-07-05 14:09:21 UTC
Hello Jeffrey,

I compiled the module with this patch and kept in observation for a week. But
problem is still there.
So pls let us know what to do ?

Thanks & Regards,
Raju Singh

Comment 25 Jeff Moyer 2005-09-19 21:40:50 UTC
I'm putting together the next round of patches for submission.  Raju, I'll post
a test kernel for you when that is complete.

Comment 28 Jeff Moyer 2005-10-21 13:27:16 UTC
Hi, Jeremy and Raju,

Sorry for the delay on this.  There is a kernel rpm available on my people page
which contains all recent autofs patches:
  http://people.redhat.com/jmoyer/.bz144729/

Please give it a try and let me know whether it solves your problems.

Thanks!

Comment 29 Jeremy Rosengren 2005-10-21 17:53:30 UTC
When we initially found that disabling the --ghost feature prevented this
problem, we disabled --ghost on all our production machines.  It may not be
possible to do this testing in the same environment where I was seeing the
issue, but I'll see what I can do.

Thanks Jeff

Comment 30 Jeff Moyer 2006-02-27 18:35:22 UTC
Still waiting for testing feedback from interested parties.