Bug 833535

Summary:

autofs-5.0.6-19.fc17.x86_64 doesn't work with mock

Product:

[Fedora] Fedora

Reporter:

H.J. Lu <hongjiu.lu>

Component:

autofs

Assignee:

Ian Kent <ikent>

Status:

CLOSED WONTFIX

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

CC:

andrejoh, Bert.Deknuydt, bill, dax, elliott.forney, fedoraproject, gepeng1983, habicht, holm, igeorgex, ikent, imc, info, irlapati, jehan.procaccia, jlayton, Marcin.Dulak, mcarter, mkfischer, saguryev.gnu, vendor-redhat

Target Milestone:

---

Keywords:

Reopened

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-08-01 17:07:37 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
The complete autofs log	none
automount logs from when the error occurs	none
new syslog output for autofs-5.0.7-2	none
automount logs from /home	none
Patch - fix reset pending flag on mount fail	none
Patch - check nfs automount flag on root	none
Patch - use simple_empty() for empty directory check	none
Patch - dont clear DCACHE_NEED_AUTOMOUNT on rootless mount	none
Patch - use simple_empty() for empty directory check	none
strace from ls -la	none
Patch - Fix sparse warning: context imbalance in autofs4_d_automount() different lock contexts for basic block	none
Patch - don't do blind d_drop() in nfs_prime_dcache()	none

Description H.J. Lu 2012-06-19 17:15:30 UTC

I have

[hjl@gnu-6 glibc-x32]$ showmount --exports gnu-4  
Export list for gnu-4:
/export/server gnu*
/export        gnu*
[hjl@gnu-6 glibc-x32]$ 

With autofs-5.0.6-19.fc17.x86_64, I got

[hjl@gnu-32 glibc]$ ls /net/gnu-4/export
build  gnu  home  intel  linux  lost+found  redhat  server  spec  suse
[hjl@gnu-32 glibc]$ grep gnu-4 /proc/mounts
-hosts /net/gnu-4/export autofs rw,relatime,fd=13,pgrp=1452,timeout=300,minproto=5,maxproto=5,offset 0 0
gnu-4:/export/ /net/gnu-4/export nfs4 rw,nosuid,nodev,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.3.194.135,minorversion=0,local_lock=none,addr=10.3.194.54 0 0
-hosts /net/gnu-4/export/server autofs rw,relatime,fd=13,pgrp=1452,timeout=300,minproto=5,maxproto=5,offset 0 0
[hjl@gnu-32 glibc]$ 

But /net/gnu-4/export isn't unmounted cleanly.  I got

[hjl@gnu-32 mock]$ ls /net/gnu-4
export
[hjl@gnu-32 mock]$ ls /net/gnu-4/export
ls: cannot open directory /net/gnu-4/export: Too many levels of symbolic links
[hjl@gnu-32 mock]$ 

Restart autofs fixed the problem:

[root@gnu-32 hjl]# service autofs restart
Redirecting to /bin/systemctl  restart autofs.service
[root@gnu-32 hjl]# ls /net/gnu-4/export
build  gnu  home  intel  linux  lost+found  redhat  server  spec  suse
[root@gnu-32 hjl]#

Comment 1 Ian Kent 2012-06-20 05:41:06 UTC

But this doesn't usually happen, right?

We'll need to get a debug log of it happening and see if
there's anything in that which gives a clue as to what is
happening.

You can get a debug log by setting LOGGING="debug" in the
configuration file, /etc/sysconfig/autofs and ensuring that
all messages are bieng logged for facility daemon. I have
the line "daemon.*           /var/log/debug" in my syslog
configuration for this.

Comment 2 H.J. Lu 2012-06-20 16:07:29 UTC

Jun 20 09:03:38 gnu-32 automount[23401]: umount_multi: path /net/gnu-4 incl 1
Jun 20 09:03:38 gnu-32 automount[23401]: umount_multi_triggers: umount offset /net/gnu-4/export/server
Jun 20 09:03:38 gnu-32 automount[23401]: umount_autofs_offset: offset /net/gnu-4/export/server not mounted
Jun 20 09:03:38 gnu-32 automount[23401]: umount_multi_triggers: umount offset /net/gnu-4/export
Jun 20 09:03:38 gnu-32 automount[23401]: umounted offset mount /net/gnu-4/export
Jun 20 09:03:38 gnu-32 automount[23401]: failed to remove dir /net/gnu-4/export: Device or resource busy
Jun 20 09:03:38 gnu-32 automount[23401]: cache_delete_offset_list: deleting offset key /net/gnu-4/export
Jun 20 09:03:38 gnu-32 automount[23401]: cache_delete_offset_list: deleting offset key /net/gnu-4/export/server
Jun 20 09:03:38 gnu-32 automount[23401]: rm_unwanted_fn: removing directory /net/gnu-4/export
Jun 20 09:03:38 gnu-32 automount[23401]: unable to remove directory /net/gnu-4/export: Device or resource busy
Jun 20 09:03:38 gnu-32 automount[23401]: rm_unwanted_fn: removing directory /net/gnu-4
Jun 20 09:03:38 gnu-32 automount[23401]: unable to remove directory /net/gnu-4: Directory not empty
Jun 20 09:03:38 gnu-32 automount[23401]: expired /net/gnu-4
Jun 20 09:03:38 gnu-32 automount[23401]: dev_ioctl_send_ready: token = 35
Jun 20 09:03:38 gnu-32 automount[23401]: expire_cleanup: got thid 140737353955072 path /net stat 0
Jun 20 09:03:38 gnu-32 automount[23401]: expire_cleanup: sigchld: exp 140737353955072 finished, switching from 2 to 1
Jun 20 09:03:38 gnu-32 automount[23401]: st_ready: st_ready(): state = 2 path /net

Comment 3 Ian Kent 2012-06-21 02:59:59 UTC

This is a log relating to the symptom you've observed which
might not be the whole story.

Please post the log from the startup of the daemon to after
the problem occurs.

Also some more "grep gnu-4 /proc/mounts" outputs for the
proceedure you described in comment #1 might be useful.

Comment 4 H.J. Lu 2012-06-21 13:11:20 UTC

Created attachment 593444 [details]
The complete autofs log

Comment 5 H.J. Lu 2012-06-21 13:14:19 UTC

(In reply to comment #3)
> Please post the log from the startup of the daemon to after
> the problem occurs.

Done.

> Also some more "grep gnu-4 /proc/mounts" outputs for the
> proceedure you described in comment #1 might be useful.

When it happened, I got

[hjl@gnu-32 ~]$ grep gnu-4 /proc/mounts 
[hjl@gnu-32 ~]$

Comment 6 Ian Kent 2012-06-22 02:12:22 UTC

(In reply to comment #0)
> I have
> 
> [hjl@gnu-6 glibc-x32]$ showmount --exports gnu-4  
> Export list for gnu-4:
> /export/server gnu*
> /export        gnu*
> [hjl@gnu-6 glibc-x32]$ 
> 
> With autofs-5.0.6-19.fc17.x86_64, I got
> 
> [hjl@gnu-32 glibc]$ ls /net/gnu-4/export
> build  gnu  home  intel  linux  lost+found  redhat  server  spec  suse
> [hjl@gnu-32 glibc]$ grep gnu-4 /proc/mounts
> -hosts /net/gnu-4/export autofs
> rw,relatime,fd=13,pgrp=1452,timeout=300,minproto=5,maxproto=5,offset 0 0
> gnu-4:/export/ /net/gnu-4/export nfs4
> rw,nosuid,nodev,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,
> proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.3.194.135,
> minorversion=0,local_lock=none,addr=10.3.194.54 0 0
> -hosts /net/gnu-4/export/server autofs
> rw,relatime,fd=13,pgrp=1452,timeout=300,minproto=5,maxproto=5,offset 0 0
> [hjl@gnu-32 glibc]$ 
> 
> But /net/gnu-4/export isn't unmounted cleanly.  I got

According to the log the cause of that is a failure to umount
the autofs trigger at /net/gnu-4/export/server. The mount list
looks OK at this point so I'm not sure what caused umount(8) to
think it wasn't present.

> 
> [hjl@gnu-32 mock]$ ls /net/gnu-4
> export
> [hjl@gnu-32 mock]$ ls /net/gnu-4/export
> ls: cannot open directory /net/gnu-4/export: Too many levels of symbolic
> links

Anything else following the umount failure is not reliable.

I'm also not sure how to get more information about it unless
I can reproduce it, which I can't so far, sorry.

Comment 7 H.J. Lu 2012-06-22 02:40:14 UTC

It happened when I was doing

# mock -r fedora-17-i386 --rebuild glibc-2.15-48.fc17.src.rpm

on 64-bit Fedora 17/Core i7 965.

Comment 8 H.J. Lu 2012-06-22 02:42:28 UTC

(In reply to comment #6)
> According to the log the cause of that is a failure to umount
> the autofs trigger at /net/gnu-4/export/server. The mount list
> looks OK at this point so I'm not sure what caused umount(8) to
> think it wasn't present.

I saw

Jun 20 08:59:53 gnu-32 automount[23401]: umount_multi_triggers: umount offset /net/gnu-4/export/server
Jun 20 08:59:53 gnu-32 automount[23401]: umounted offset mount /net/gnu-4/export/server
Jun 20 08:59:53 gnu-32 automount[23401]: umount_subtree_mounts: unmounting dir = /net/gnu-4/export
Jun 20 08:59:53 gnu-32 automount[23401]: spawn_umount: mtab link detected, passing -n to mount 

Why did spawn_umount call "mount" instead of "umount"?

Comment 9 Ian Kent 2012-06-22 06:48:38 UTC

(In reply to comment #8)
> (In reply to comment #6)
> > According to the log the cause of that is a failure to umount
> > the autofs trigger at /net/gnu-4/export/server. The mount list
> > looks OK at this point so I'm not sure what caused umount(8) to
> > think it wasn't present.
> 
> I saw
> 
> Jun 20 08:59:53 gnu-32 automount[23401]: umount_multi_triggers: umount
> offset /net/gnu-4/export/server
> Jun 20 08:59:53 gnu-32 automount[23401]: umounted offset mount
> /net/gnu-4/export/server
> Jun 20 08:59:53 gnu-32 automount[23401]: umount_subtree_mounts: unmounting
> dir = /net/gnu-4/export
> Jun 20 08:59:53 gnu-32 automount[23401]: spawn_umount: mtab link detected,
> passing -n to mount 
> 
> Why did spawn_umount call "mount" instead of "umount"?

It didn't, that's a mistake in the message.

Comment 10 H.J. Lu 2012-06-22 16:54:18 UTC

I tried the current autofs git repo. It has a different problem.
After umount /net/gnu-4/export/server, automount complained:

Jun 22 09:44:20 gnu-32 automount[12972]: st_expire: state 1 path /net
Jun 22 09:44:20 gnu-32 automount[12972]: expire_proc: exp_proc = 140737353955072 path /net
Jun 22 09:44:20 gnu-32 automount[12972]: expire_proc_indirect: expire /net/gnu-4/export/server
Jun 22 09:44:20 gnu-32 automount[12972]: handle_packet: type = 6
Jun 22 09:44:20 gnu-32 automount[12972]: handle_packet_expire_direct: token 54, name /net/gnu-4/export/server
Jun 22 09:44:20 gnu-32 automount[12972]: expiring path /net/gnu-4/export/server
Jun 22 09:44:20 gnu-32 automount[12972]: umount_multi: path /net/gnu-4/export/server incl 1
Jun 22 09:44:20 gnu-32 automount[12972]: umount_subtree_mounts: unmounting dir = /net/gnu-4/export/server
Jun 22 09:44:20 gnu-32 automount[12972]: spawn_umount: mtab link detected, passing -n to umount
Jun 22 09:44:20 gnu-32 automount[12972]: expired /net/gnu-4/export/server
Jun 22 09:44:20 gnu-32 automount[12972]: dev_ioctl_send_ready: token = 54
Jun 22 09:44:20 gnu-32 automount[12972]: expire_proc_indirect: expire /net/gnu-4/export
Jun 22 09:44:20 gnu-32 automount[12972]: 1 remaining in /net

and never umounted /net/gnu-4/export.  I got

# ls /net/gnu-4/export
build  gnu  home  intel  linux  lost+found  redhat  server  spec  suse
# ls /net/gnu-4/export/server
ls: cannot open directory /net/gnu-4/export/server: Too many levels of symbolic links

Comment 11 H.J. Lu 2012-06-22 17:47:42 UTC

With the same nfs-utils-1.2.4-3.fc15.x86_64

gnu-4 has kernel 2.6.43.8-1.fc15.x86_64
gnu-1 has kernel 2.6.43.7-3.fc15.x86_64

df shows

gnu-4:/export/       721094656 520767488 200323072  73% /net/gnu-4/export
gnu-4:/export/server 721094656 520767488 200323072  73% /net/gnu-4/export/server
gnu-1:/export/       236515328 180578304  43726848  81% /net/gnu-1/export
gnu-1:/export/server/
                     961478656 358574080 602888192  38% /net/gnu-1/export/server

Comment 12 H.J. Lu 2012-06-22 17:52:59 UTC

Nerver mind.  gnu-4 exports the same partion twice.

Comment 13 H.J. Lu 2012-06-22 20:47:36 UTC

It still happens after I fixed my nfs server.

Comment 14 H.J. Lu 2012-06-22 21:52:20 UTC

It could be a mock issue,  which may hold /net/gnu-4/export.

Comment 15 H.J. Lu 2012-06-22 22:53:31 UTC

Jun 22 15:46:00 gnu-32 automount[7950]: handle_packet_expire_indirect: token 17, name gnu-4
Jun 22 15:46:00 gnu-32 automount[7950]: expiring path /net/gnu-4
Jun 22 15:46:00 gnu-32 automount[7950]: umount_multi: path /net/gnu-4 incl 1
Jun 22 15:46:00 gnu-32 automount[7950]: umount_multi_triggers: umount offset /net/gnu-4/export
Jun 22 15:46:00 gnu-32 automount[7950]: umounted offset mount /net/gnu-4/export
Jun 22 15:46:00 gnu-32 automount[7950]: umount_autofs_offset: failed to remove dir /net/gnu-4/export: Device or resource busy
Jun 22 15:46:00 gnu-32 automount[7950]: cache_delete_offset_list: deleting offset key /net/gnu-4/export
Jun 22 15:46:00 gnu-32 automount[7950]: rm_unwanted_fn: removing directory /net/gnu-4/export
Jun 22 15:46:00 gnu-32 automount[7950]: rm_unwanted_fn: unable to remove directory /net/gnu-4/export: Device or resource busy
Jun 22 15:46:00 gnu-32 automount[7950]: rm_unwanted_fn: removing directory /net/gnu-4
Jun 22 15:46:00 gnu-32 automount[7950]: rm_unwanted_fn: unable to remove directory /net/gnu-4: Directory not empty
Jun 22 15:46:00 gnu-32 automount[7950]: umount_multi: path /net/gnu-4 left 0
Jun 22 15:46:00 gnu-32 automount[7950]: expired /net/gnu-4

Were 2 threads trying to remove /net/gnu-4/export at the same time?

Comment 16 H.J. Lu 2012-06-23 01:01:31 UTC

spawn_umount has

               ret = do_spawn(logopt, wait, options, prog, (const char **) argv);
                if (ret & MTAB_NOTUPDATED) {

Is this the correct way to check return value from /bin/umount?

Comment 17 Ian Kent 2012-06-23 01:44:33 UTC

(In reply to comment #16)
> spawn_umount has
> 
>                ret = do_spawn(logopt, wait, options, prog, (const char **)
> argv);
>                 if (ret & MTAB_NOTUPDATED) {
> 
> Is this the correct way to check return value from /bin/umount?

That's been like that for a long time and has worked OK.
But it may be worth checking since /etc/mtab is now a symlink
to /proc/mounts.

Comment 18 Ian Kent 2012-06-23 01:58:22 UTC

(In reply to comment #13)
> It still happens after I fixed my nfs server.

There are a couple of patches that haven't been committed
to the git repo, one for a similar problem, but what you've
described so far doesn't look like the problem it fixes.

Remember that when autofs gets confused like this it can
remain confused over a restart because it tries to re-establish
the mount tree as it was at last shutdown when there are mounts
that cannot be umounted. That's a good thing most of the time
but can be a pain when the mount tree is in a state where it
can't be re-constructed correctly.

Have you stopped autofs and checked that all mounts related
to autofs are gone? Make sure automount is not running and if
there are any autofs related mounts remaining try to umount
them manually and if they won't let you umount them you will
have to reboot the machine.

The point here is you really need to umount the base of the
automount tree to be sure you have a clean slate when you
start autofs if you have problems like this that won't go
away.

Ian

Comment 19 H.J. Lu 2012-06-23 14:05:43 UTC

I started with

-hosts /net autofs rw,relatime,fd=13,pgrp=14770,timeout=300,minproto=5,maxproto=5,indirect 0 0
-hosts /net/gnu-4/export autofs rw,relatime,fd=13,pgrp=14770,timeout=300,minproto=5,maxproto=5,offset 0 0
gnu-4:/export/ /net/gnu-4/export nfs4 rw,nosuid,nodev,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.3.194.135,minorversion=0,local_lock=none,addr=10.3.194.54 0 0

/proc/mounts.  Then /net/gnu-4/export was umounted and I got

-hosts /net autofs rw,relatime,fd=13,pgrp=12856,timeout=300,minproto=5,maxproto=5,indirect 0 0
-hosts /net/gnu-4/export autofs rw,relatime,fd=13,pgrp=12856,timeout=300,minproto=5,maxproto=5,offset 0 0

Before automount could remove /net/gnu-4/export, /net/gnu-4/export was
used by another program.  When automount tried to remove /net/gnu-4/export,
it got:

Jun 22 15:46:00 gnu-32 automount[7950]: umount_autofs_offset: failed to remove dir /net/gnu-4/export: Device or resource busy
Jun 22 15:46:00 gnu-32 automount[7950]: cache_delete_offset_list: deleting offset key /net/gnu-4/export
Jun 22 15:46:00 gnu-32 automount[7950]: rm_unwanted_fn: removing directory /net/gnu-4/export
Jun 22 15:46:00 gnu-32 automount[7950]: rm_unwanted_fn: unable to remove directory /net/gnu-4/export: Device or resource busy

So automount removed /net/gnu-4/export from its key list while
it was still mounted.  Does this make sense?

Comment 20 H.J. Lu 2012-06-23 14:16:41 UTC

umount_autofs_offset has

        if (!rv && me->flags & MOUNT_FLAG_DIR_CREATED) {
                if  (rmdir(me->key) == -1) {
                        char *estr = strerror_r(errno, buf, MAX_ERR_BUF);
                        debug(ap->logopt, "failed to remove dir %s: %s",
                             me->key, estr);
                }
        }
        return rv;

When it failed to remove dir, shouldn't it set rv to 1?

Comment 21 H.J. Lu 2012-06-23 15:47:04 UTC

Even if /net/gnu-4/export isn't used by normal root,
it can still be used by chroot from mock.  How does
automount deal with rmdir failure?

Comment 22 Ian Kent 2012-06-24 03:48:58 UTC

(In reply to comment #19)
> 
> Before automount could remove /net/gnu-4/export, /net/gnu-4/export was
> used by another program.  When automount tried to remove /net/gnu-4/export,
> it got:
> 
> Jun 22 15:46:00 gnu-32 automount[7950]: umount_autofs_offset: failed to
> remove dir /net/gnu-4/export: Device or resource busy
> Jun 22 15:46:00 gnu-32 automount[7950]: cache_delete_offset_list: deleting
> offset key /net/gnu-4/export
> Jun 22 15:46:00 gnu-32 automount[7950]: rm_unwanted_fn: removing directory
> /net/gnu-4/export
> Jun 22 15:46:00 gnu-32 automount[7950]: rm_unwanted_fn: unable to remove
> directory /net/gnu-4/export: Device or resource busy
> 
> So automount removed /net/gnu-4/export from its key list while
> it was still mounted.  Does this make sense?

That is similar to the bug I mentioned above.

When something like this happens autofs is meant to re-construct
the tree so it continues to function but there are cases where
that fails to happen correctly and they need to be found and fixed.

I'll have a look and see what's happening here.

Comment 23 Ian Kent 2012-06-24 04:02:08 UTC

(In reply to comment #21)
> Even if /net/gnu-4/export isn't used by normal root,
> it can still be used by chroot from mock.  How does
> automount deal with rmdir failure?

The mount (tree) shouldn't be expired if it's in use, meaning
and open file or working directory present within the tree.
Once selected for expire path walks into the tree should be
blocked within the kernel until the expire is complete.

The expire check is done within the kernel so usage within
a chroot should be seen in the same as any other usage.

There is a case where the kernel check for business of a
tree is racey following (relatively) recent kernel changes
(not autofs changes though), and I'm trying to work out a
way to fix that. The case were looking at here is not that
case so that should be ok.

Comment 24 Ian Kent 2012-06-24 04:07:45 UTC

(In reply to comment #20)
> umount_autofs_offset has
> 
>         if (!rv && me->flags & MOUNT_FLAG_DIR_CREATED) {
>                 if  (rmdir(me->key) == -1) {
>                         char *estr = strerror_r(errno, buf, MAX_ERR_BUF);
>                         debug(ap->logopt, "failed to remove dir %s: %s",
>                              me->key, estr);
>                 }
>         }
>         return rv;
> 
> When it failed to remove dir, shouldn't it set rv to 1?

Maybe, I'll have a look.

Comment 25 Ian Kent 2012-06-24 04:18:56 UTC

(In reply to comment #19)
> 
> Before automount could remove /net/gnu-4/export, /net/gnu-4/export was
> used by another program.  When automount tried to remove /net/gnu-4/export,
> it got:

Right, don't think that should happen, the process path walk
should be blocked at /net/gnu-4 during the expire and then
trigger a re-mount of the tree.

I'll have a look at that too.

Comment 26 Ian Kent 2012-06-27 00:56:52 UTC

(In reply to comment #24)
> (In reply to comment #20)
> > umount_autofs_offset has
> > 
> >         if (!rv && me->flags & MOUNT_FLAG_DIR_CREATED) {
> >                 if  (rmdir(me->key) == -1) {
> >                         char *estr = strerror_r(errno, buf, MAX_ERR_BUF);
> >                         debug(ap->logopt, "failed to remove dir %s: %s",
> >                              me->key, estr);
> >                 }
> >         }
> >         return rv;
> > 
> > When it failed to remove dir, shouldn't it set rv to 1?
> 
> Maybe, I'll have a look.

Yeah, a failed directory removal is a problem.

The problem is not so much the return as if that fails the
offset trigger should be mounted back and then a fail returned.
Maybe the directory removal should be removed from the function
and handled on callback. I'm looking at doing that now.

Of course, once the mount tree becomes broken like this for
some reason there's no telling what will happen when we try
to mount it back.

Ian

Comment 27 Ian Kent 2012-08-14 02:09:35 UTC

It would be worth trying the Rawhide version of autofs
since there are some recent changes that might help with
this problem.

Grab the source rpm from the Rawhide or F18 and build
against your system. You will need autofs-5.0.6-23 or
later to get what I'd like to test.

Comment 28 H.J. Lu 2012-08-14 19:03:51 UTC

I tried autofs-5.0.7-1 and it didn't solve the problem.

Comment 29 Ian Kent 2012-08-15 02:55:28 UTC

(In reply to comment #28)
> I tried autofs-5.0.7-1 and it didn't solve the problem.

How about a debug log from 5.0.7-1 please.

Comment 30 Michael 2012-08-16 22:43:51 UTC

I am seeing a very similar issue where autofs will start giving the messages about 'Too many levels of symbolic links'.

Restarting autofs will clear up the issue.  I have seen it on both my NFS server and on various clients.

I have enabled syslog debugging as mentioned above and will post logs once it happens again.  It is sporatic - roughly every 2 weeks.

Comment 31 Ian Kent 2012-08-17 02:30:33 UTC

(In reply to comment #30)
> I am seeing a very similar issue where autofs will start giving the messages
> about 'Too many levels of symbolic links'.
> 
> Restarting autofs will clear up the issue.  I have seen it on both my NFS
> server and on various clients.
> 
> I have enabled syslog debugging as mentioned above and will post logs once
> it happens again.  It is sporatic - roughly every 2 weeks.

And /proc/mounts and kernel version.

Comment 32 Michael 2012-09-02 17:35:53 UTC

Finally happened this morning.


/proc/mounts:
rootfs / rootfs rw 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
sysfs /sys sysfs rw,seclabel,nosuid,nodev,noexec,relatime 0 0
devtmpfs /dev devtmpfs rw,seclabel,nosuid,size=2007944k,nr_inodes=501986,mode=75
5 0 0
devpts /dev/pts devpts rw,seclabel,nosuid,noexec,relatime,gid=5,mode=620,ptmxmod
e=000 0 0
tmpfs /dev/shm tmpfs rw,seclabel,nosuid,nodev 0 0
tmpfs /run tmpfs rw,seclabel,nosuid,nodev,mode=755 0 0
/dev/mapper/vg_bobafett-lv_root / ext4 rw,seclabel,relatime,data=ordered 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
selinuxfs /sys/fs/selinux selinuxfs rw,relatime 0 0
tmpfs /sys/fs/cgroup tmpfs rw,seclabel,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,release_age
nt=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpuacct
,cpu 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_eve
nt 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=27,pgrp=1,timeout=300,m
inproto=5,maxproto=5,direct 0 0
tmpfs /media tmpfs rw,seclabel,nosuid,nodev,noexec,relatime,mode=755 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
mqueue /dev/mqueue mqueue rw,seclabel,relatime 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,seclabel,relatime 0 0
configfs /sys/kernel/config configfs rw,relatime 0 0
/dev/sda2 /boot ext4 rw,seclabel,relatime,data=ordered 0 0
/dev/mapper/vg_bobafett-lv_vmware /VirtualBox ext4 rw,seclabel,relatime,data=ord
ered 0 0
/dev/mapper/Backup-Backup /backup ext4 rw,seclabel,relatime,data=ordered 0 0
auto.direct /services autofs rw,relatime,fd=7,pgrp=929,timeout=3600,minproto=5,m
axproto=5,direct 0 0
auto.direct /usr/local autofs rw,relatime,fd=7,pgrp=929,timeout=3600,minproto=5,
maxproto=5,direct 0 0
auto.direct /mythtv autofs rw,relatime,fd=7,pgrp=929,timeout=3600,minproto=5,max
proto=5,direct 0 0
auto.direct /home autofs rw,relatime,fd=7,pgrp=929,timeout=3600,minproto=5,maxpr
oto=5,direct 0 0


uname -a
Linux bobafett 3.5.2-3.fc17.x86_64 #1 SMP Tue Aug 21 19:06:52 UTC 2012 x86_64 x8
6_64 x86_64 GNU/Linux
rpm -qa | grep autofs
autofs-5.0.6-22.fc17.x86_64

cd /usr/local/bin
bash: cd: /usr/local/bin: Too many levels of symbolic links

ypcat -k auto.master
/-              auto.direct --timeout=3600 


ypcat -k auto.direct
/home -rw       deathstar:/export/users
/mythtv -rw     deathstar:/export/services/mythtv/
/usr/local -rw  deathstar:/export/services/local/x86_64/local
/services -rw   deathstar:/export/services

Comment 33 Michael 2012-09-02 17:36:41 UTC

Created attachment 609140 [details]
automount logs from when the error occurs

Comment 34 Michael 2012-09-02 17:38:17 UTC

I should add that this has only over occurred for the /usr/local/bin mount point.  Never for the home directory or other mount points.  I have no idea what makes that mount point more problematic than the others

Comment 35 H.J. Lu 2012-09-02 17:38:59 UTC

I am using autofs-5.0.7-1. The problem still happens, but much
less often.

Comment 36 Ian Kent 2012-09-03 01:01:27 UTC

(In reply to comment #34)
> I should add that this has only over occurred for the /usr/local/bin mount
> point.  Never for the home directory or other mount points.  I have no idea
> what makes that mount point more problematic than the others

You mean /usr/local, right?

Comment 37 Ian Kent 2012-09-03 01:49:03 UTC

(In reply to comment #33)
> Created attachment 609140 [details]
> automount logs from when the error occurs

Yes, logs from when you see the error but if there is anything
to be learned from the log it would have had to have happened
before the time range shown in the log here.

Comment 38 Ian Kent 2012-09-03 02:18:15 UTC

(In reply to comment #35)
> I am using autofs-5.0.7-1. The problem still happens, but much
> less often.

And that's even more puzzling.

It looks like one way for this to happen is when the kernel
thinks that /usr/local has submounts present within it. But
the mount table says there are no mounts under it. Given
that /proc/mounts is what the kernel thinks is mounted that
doesn't appear to be the case.

Another way for this to happen would be a race clearing the
flag that causes the automounting to occur (that is the flag
not being cleared when it should). But in Michael's setup
that flag does not need to be cleared at mount and set at
umount so there cannot be a race.

I don't think either of these cases can be caused by user
space except possibly automount doing a lazy umount and
some sort of race between detaching the mount and a lookup
trigging a mount. What's more that shouldn't be a persistent
condition which is also not the case here.

Perhaps I could put something into a kernel to return
ENOTEMPTY in the case where the kernel thinks there are
submounts present to see if that is actually where this
problem happens. Would either of you be able to install
and test with such a kernel?

Comment 39 Ian Kent 2012-09-03 02:21:27 UTC

(In reply to comment #36)
> (In reply to comment #34)
> > I should add that this has only over occurred for the /usr/local/bin mount
> > point.  Never for the home directory or other mount points.  I have no idea
> > what makes that mount point more problematic than the others
> 
> You mean /usr/local, right?

Does the problem also occur with other directories under
/usr/local?

Comment 40 H.J. Lu 2012-09-03 02:24:14 UTC

(In reply to comment #38)
> (In reply to comment #35)
> > I am using autofs-5.0.7-1. The problem still happens, but much
> > less often.
> 
> Perhaps I could put something into a kernel to return
> ENOTEMPTY in the case where the kernel thinks there are
> submounts present to see if that is actually where this
> problem happens. Would either of you be able to install
> and test with such a kernel?

If you put the kernel patch here, I can give it a try.

Comment 41 Michael 2012-09-03 02:33:01 UTC

Yes, I mean /usr/local, not /usr/local/bin

I have noticed the symlink message in two different circumstances.  One in trying to run something in /usr/local/bin, and the other trying to read man pages which access /usr/local/man.

Assuming the patches for the kernel yield an rpm kernel, I would be happy to install it on one of the boxes.  Should I also try to install the 5.0.7 autofs?

Comment 42 Ian Kent 2012-09-03 04:35:25 UTC

(In reply to comment #41)
> 
> Assuming the patches for the kernel yield an rpm kernel, I would be happy to
> install it on one of the boxes.  Should I also try to install the 5.0.7
> autofs?

That's probably not usefull since H.J. Lu has already done that
without it yeilding extra information.

I'm not sure why the perception is that the problem occurs less
with the latest version so, in that sense, you may want to build
and install the Rawhide or F18 rpm. Your choice.

Comment 43 Michael 2012-09-07 17:49:31 UTC

Created attachment 610823 [details]
new syslog output for autofs-5.0.7-2

Comment 44 Michael 2012-09-07 17:51:38 UTC

Luckily, the issue happened again rather quickly, so we have full logs from startup until the issue occurred.

I am running autofs-5.0.7-2.fc17.x86_6 rebuilt from Rawhide with kernel-3.5.2-3.fc17.x86_64

Comment 45 Ian Kent 2012-09-08 03:41:39 UTC

(In reply to comment #44)
> Luckily, the issue happened again rather quickly, so we have full logs from
> startup until the issue occurred.
> 
> I am running autofs-5.0.7-2.fc17.x86_6 rebuilt from Rawhide with
> kernel-3.5.2-3.fc17.x86_64

Just to be absolutly clear.

In this case, a simple direct mount /usr/local is automounted
once and after expiring the NFS mount on it at 18:38:57 the
daemon never gets a callback again and user space processes
that should mount it get an ELOOP error. And, the direct mount
trigger itself, the autofs fs mount, is still present in
/proc/mounts.

Is that accurate?

Ian

Comment 46 Michael 2012-09-08 15:37:53 UTC

Yes, that is how I am reading it.


cat /proc/mounts  | grep /usr/local
auto.direct /usr/local autofs rw,relatime,fd=19,pgrp=964,timeout=3600,minproto=5,maxproto=5,direct 0 0


Michael

Comment 47 Michael 2012-09-12 04:59:45 UTC

Created attachment 611990 [details]
automount logs from /home

Comment 48 Michael 2012-09-12 05:04:57 UTC

Today, for the first time, I saw the error on a filesystem besides /usr/local.  

All previously supplied logs were from an NFS/NIS client mounting from the server.  

Today, however, the problem filesystem was /home.  This filesystem is mounted from /export/users and in this case, it happened on the server itself.

ypcat -k auto.direct | grep home
/home -rw    deathstar:/export/users

This system is running 3.5.3-1.fc17.x86_64 and autofs-5.0.6-22.fc17.x86_64

Looking at the logs, the symptom/issue looks the same.

Michael

Comment 49 Ian Kent 2012-09-12 06:47:36 UTC

(In reply to comment #48)
> Today, for the first time, I saw the error on a filesystem besides
> /usr/local.  
> 
> All previously supplied logs were from an NFS/NIS client mounting from the
> server.  
> 
> Today, however, the problem filesystem was /home.  This filesystem is
> mounted from /export/users and in this case, it happened on the server
> itself.
> 
> ypcat -k auto.direct | grep home
> /home -rw    deathstar:/export/users
> 
> This system is running 3.5.3-1.fc17.x86_64 and autofs-5.0.6-22.fc17.x86_64
> 
> Looking at the logs, the symptom/issue looks the same.

It has to be a race crept into the kernel at some point.

I'm still thinking about where to put the printks and
what to print to get some information (in fact I need
to get back to it). Problem is that this could print
lots of useless info if I don't use some checks to make
it more specific.

Comment 50 procaccia 2012-10-05 20:46:56 UTC

I have the same problem:
[root@d012-05 ~]# ls /mci/inf
ls: cannot open directory /mci/inf: Too many levels of symbolic links

/mci/inf comes fron an NFS server mounted via autofs through ldap automount maps.
other maps works fine:
[root@d012-05 ~]# ls /mci/eph
abib_gh boudi denohe ....

[root@d012-05 ~]# automount --dumpmaps
...
Mount point: /-
 type: ldap
 map: ldap:ou=direct,ou=automount,dc=int-evry,dc=fr
...
/mci/inf | -rw,intr,soft gizey:/disk19/inf
/mci/eph | -rw,intr,soft gizey:/disk19/eph
...

the automount of /mci/inf works fine though on some other stations, on a same station it can work sometimes and other not.
A workaround is to restart autofs:

[root@d012-05 ~]# systemctl restart autofs.service
then
[root@d012-05 ~]# ls /mci/inf
abid_zue  berge_re  shoman ...

autofs debug logs for the succeded automount above:

Oct  5 22:38:53 d012-05 automount[1655]: handle_packet_missing_direct: token 19, name /mci/inf, request pid 1667
Oct  5 22:38:53 d012-05 automount[1655]: attempting to mount entry /mci/inf
Oct  5 22:38:53 d012-05 automount[1655]: lookup_mount: lookup(ldap): /mci/inf -> -rw,intr,soft gizey:/disk19/inf
Oct  5 22:38:53 d012-05 automount[1655]: mount_mount: mount(nfs): root=/mci/inf name=/mci/inf what=gizey:/disk19/inf, fstype=nfs, options=rw,intr,soft
Oct  5 22:38:53 d012-05 automount[1655]: mount_mount: mount(nfs): calling mkdir_path /mci/inf
Oct  5 22:38:53 d012-05 automount[1655]: mount_mount: mount(nfs): calling mount -t nfs -s -o rw,intr,soft gizey:/disk19/inf /mci/inf
Oct  5 22:38:53 d012-05 automount[1655]: spawn_mount: mtab link detected, passing -n to mount
Oct  5 22:38:53 d012-05 automount[1655]: mount_mount: mount(nfs): mounted gizey:/disk19/inf on /mci/inf
Oct  5 22:38:53 d012-05 automount[1655]: dev_ioctl_send_ready: token = 19
Oct  5 22:38:53 d012-05 automount[1655]: mounted /mci/inf

I run on : 
fedora17, autofs-5.0.6-22.fc17.i686 , kernel-PAE-3.5.4-1.fc17.i686

any advices greatly appreciated as whe have dozens of fedora17 stations for hundred of users .
Thanks .

Comment 51 procaccia 2012-10-10 15:40:25 UTC

as suggested by a sysadmin, in one of our lab (12 computers), I changed the way automount is fetching ldap maps
in /etc/nsswitch I now have:
automount:  files sss
instead of files ldap
+ adapted sssd.conf as described in https://fedoraproject.org/wiki/Features/SSSDAutoFSSupport
maybe it as nothing to do with my pb, but as I am at the point to add a crontab that restarts autofs.service every 15mn (on an another 12 computers lab), I am trying different workarounds .

I don't know yet if things will get better as the problem is sporadic, but for now (after 3 hours on the 12 stations) all autofs mounts still work fine.
I also blindly, (as I don't really understand what it does)  , did a "mount --make-private /" on 2 stations to check if it has any incidence .

Thanks for any other clues or suggestions that could help on this sporadic "Too many levels of symbolic links" .

Comment 52 Ian Kent 2012-10-11 00:19:18 UTC

(In reply to comment #51)
> as suggested by a sysadmin, in one of our lab (12 computers), I changed the
> way automount is fetching ldap maps
> in /etc/nsswitch I now have:
> automount:  files sss
> instead of files ldap
> + adapted sssd.conf as described in
> https://fedoraproject.org/wiki/Features/SSSDAutoFSSupport
> maybe it as nothing to do with my pb, but as I am at the point to add a
> crontab that restarts autofs.service every 15mn (on an another 12 computers
> lab), I am trying different workarounds .
> 
> I don't know yet if things will get better as the problem is sporadic, but
> for now (after 3 hours on the 12 stations) all autofs mounts still work fine.
> I also blindly, (as I don't really understand what it does)  , did a "mount
> --make-private /" on 2 stations to check if it has any incidence .

See Documentation/filesystems/sharedsubtree.txt for an explaination.

Basically a shared subtree mount replicates mounts made witin a
tree to replicas such as those made by binding a mount to another
location.

It turns out that syatemd making "/" shared causes a problem with
autofs expiration of indirect mounts due to mount point dentry
reference count being elevated (so it appears busy) even though
the autofs mount is not usable in another mount tree.

I'm not sure if the sharedness has any relatition to the problem
we are seeing here.

Comment 53 Ian Kent 2012-10-11 00:31:22 UTC

Created attachment 625260 [details]
Patch - fix reset pending flag on mount fail

I can't see how this patch would fix the problem we are seeing
here but it is a newly discovered bug and should be checked
in case it is.

If someone can build a kernel with this patch and test it that
would be very much appreciated.

Comment 54 Michael 2012-10-11 01:11:04 UTC

Hi Ian,

Unfortunately, that bug is restricted and I can't view it.  Can you attach the patch or add me to that ticket?

I haven't built my own kernel in a LONG time.  I can patch it and if you point me to the right directions to build my own kernel I will be happy to give it a try.  Instructions that generate a customer kernel rpm are preferred.

Comment 55 procaccia 2012-10-15 09:10:39 UTC

The sporadic error appeared today on one machine

[root@d012-04 ~]# ls /mci/inf
ls: cannot open directory /mci/inf: Too many levels of symbolic links

is there anything I can check while the problem is happening ? 

here's what I have 
[root@d012-04 ~]# uname -a
Linux d012-04.int-evry.fr 3.5.4-1.fc17.i686.PAE #1 SMP Mon Sep 17 15:19:42 UTC 2012 i686 i686 i386 GNU/Linux
[root@d012-04 ~]# uptime
 11:05:00 up 4 days, 18:11,  2 users,  load average: 0.00, 0.01, 0.05

[root@d012-04 ~]# ps auwx | grep auto
root     12092  0.0  0.0   4732   508 pts/0    S+   11:05   0:00 grep --color=auto auto
root     17424  0.0  0.0  15184  2828 ?        S    Oct10   0:04 /usr/libexec/sssd/sssd_autofs --debug-to-files
root     17433  0.0  0.1  49976  5248 ?        Ssl  Oct10   0:26 /usr/sbin/automount --pid-file /run/autofs.pid

Although I read above in the bug report that the pb could arise because the mount point wasn't correctly umounted, I run those 2 commands

[root@d012-04 ~]# df | grep /mci/inf
[root@d012-04 ~]# lsof /mci/inf

But they returned nothing .

Let me know if there are more relevant things I could check while the pb is here .

Thanks .

Comment 56 Ian Kent 2012-10-15 10:22:44 UTC

(In reply to comment #55)
> The sporadic error appeared today on one machine
> 
> [root@d012-04 ~]# ls /mci/inf
> ls: cannot open directory /mci/inf: Too many levels of symbolic links
> 
> is there anything I can check while the problem is happening ? 
> 
> here's what I have 
> [root@d012-04 ~]# uname -a
> Linux d012-04.int-evry.fr 3.5.4-1.fc17.i686.PAE #1 SMP Mon Sep 17 15:19:42
> UTC 2012 i686 i686 i386 GNU/Linux
> [root@d012-04 ~]# uptime
>  11:05:00 up 4 days, 18:11,  2 users,  load average: 0.00, 0.01, 0.05
> 
> [root@d012-04 ~]# ps auwx | grep auto
> root     12092  0.0  0.0   4732   508 pts/0    S+   11:05   0:00 grep
> --color=auto auto
> root     17424  0.0  0.0  15184  2828 ?        S    Oct10   0:04
> /usr/libexec/sssd/sssd_autofs --debug-to-files
> root     17433  0.0  0.1  49976  5248 ?        Ssl  Oct10   0:26
> /usr/sbin/automount --pid-file /run/autofs.pid
> 
> Although I read above in the bug report that the pb could arise because the
> mount point wasn't correctly umounted, I run those 2 commands
> 
> [root@d012-04 ~]# df | grep /mci/inf
> [root@d012-04 ~]# lsof /mci/inf
> 
> But they returned nothing .

Well, that may be evidence that the patch in comment #53 will
fix the problem. The fact that the mount pending flag remains
set after the call to ->d_automount() means that it will just
try again, and again... but only if there's a success return
when the mount is attempted, even though it didn't mount.

But I don't yet understand how we can get a success when the
mount fails, which really must be the case. Not only that it
would only happen to certain types of mounts, such as
multi-mounts (which are used by the internal hosts map, for
example).

Ian

Comment 57 procaccia 2012-10-15 21:30:40 UTC

following http://fedoraproject.org/wiki/Building_a_custom_kernel I rebuilt a kernel including patch on comment #53

added in kernel.spec
Patch833535: autofs4-bug833535.patch

then after real	153m59.916s of building ...

[root@d012-04 SPECS]# rpm -Uvh /root/rpmbuild/RPMS/i686/kernel-PAE-3.6.1-1.fc17.i686.rpm

I cannot tell right away if autofs will fail again with that sporadic automount failure , but I will keep an eye on it

For now, after a fresh reboot on that newly patched kernel, it works fine apparently, let's give it some time though ...  

[root@d012-04 ~]# uname -a
Linux d012-04.int-evry.fr 3.6.1-1.fc17.i686.PAE #1 SMP Mon Oct 15 20:34:39 CEST 2012 i686 i686 i386 GNU/Linux
[root@d012-04 ~]# ls /mci/inf
abid_zaz  ben_ghue ...

we have dozen of computers labs of 12 stations each, only the 2 labs installed on fedora17 showed that problem, none of the ~100 stations in others labs equiped with Fedora16 have shown that problem. Is there something relevant that changed from fedora16 to fedora17 regarding autofs ?

my fedora16 runs
kernel-PAE-3.4.11-1.fc16.i686
autofs-5.0.6-8.fc16.i686

Comment 58 Ian Kent 2012-10-16 01:45:50 UTC

(In reply to comment #57)
> 
> we have dozen of computers labs of 12 stations each, only the 2 labs
> installed on fedora17 showed that problem, none of the ~100 stations in
> others labs equiped with Fedora16 have shown that problem. Is there
> something relevant that changed from fedora16 to fedora17 regarding autofs ?
> 
> my fedora16 runs
> kernel-PAE-3.4.11-1.fc16.i686
> autofs-5.0.6-8.fc16.i686

From my changelog entries it looks like the main difference was
to allow for changes in mount.nfs where it now passes options
directly to the kernel. This introduced a change to the behavior
of rpc for servers that are not available at mount time and we
saw large mount waits in that case.

This also means that autofs now always probes server availabilty
before attempting a mount which is different and the rpc code
tries to detect servers that aren't available early in this
process to avoid the lengthy delays.

One thing that you could do to avoid calling the changed code
somewhat is to set MOUNT_WAIT to some sensible value for your
site, say 15-30 seconds, and see if that makes a difference.

Ian

Comment 59 Ian Kent 2012-10-16 03:14:47 UTC

(In reply to comment #57)
> 
> my fedora16 runs
> kernel-PAE-3.4.11-1.fc16.i686
> autofs-5.0.6-8.fc16.i686

The other thing you could do is grab the autofs source rpm
for f16 and build and install it on f17. That should tell
us if we actually should be looking to user space rather
than kernel space.

Comment 60 procaccia 2012-10-16 08:42:59 UTC

I recompiled from SRPMS/autofs-5.0.6-2.fc16.src.rpm to F17 so that my F17 stations runs the F16 version
(I renamed autofs-5.0.6-2.fc16 after recompile on f17 to autofs-5.0.6-23_2fc16_itsp.fc17., as on f17 there's a  autofs-5.0.6-22 ! to facilitate install by rpm -Uvh )
Unfortunatly this version of autofs does'nt seem to support ldap automount maps look up from sssd
at a systemctl restart autofs.service, i get in debug.log:

Oct 16 10:36:34 d012-07 automount[9801]: ignored unsupported autofs nsswitch source "sss"

reagarding MOUNT_WAIT

in /etc/sysconfig/autofs
# MOUNT_WAIT - time to wait for a response from umount(8).
#              Setting this timeout can cause problems when
#              mount would otherwise wait for a server that
#              is temporarily unavailable, such as when it's
#              restarting. The defailt of waiting for mount(8)
#              usually results in a wait of around 3 minutes.
#
#MOUNT_WAIT=-1
#
# UMOUNT_WAIT - time to wait for a response from umount(8).
#
#UMOUNT_WAIT=12

you proposed to set it to 30s, default seems to be 3minutes ! do you confirm to test with 30s, when value is commented as above, is it set to default to 3mn ?

Thanks .

Comment 61 Ian Kent 2012-10-16 09:25:10 UTC

(In reply to comment #60)
> 
> you proposed to set it to 30s, default seems to be 3minutes ! do you confirm
> to test with 30s, when value is commented as above, is it set to default to
> 3mn ?

I said, set mount wait to something sensible (for your site), between
15 and 30 seconds is most likey best. The default setting is -1, which
says wait until mount returns which usually means waiting for network
protocol timeouts to expire, normally about 3 minutes.

Comment 62 procaccia 2012-10-16 15:59:17 UTC

Unfortunatly the kernel patch didn't worked
Today I found a machine with thet kernel patch (comment 53) having the pb :

[root@d012-04 ~]# ls /mci/inf
ls: cannot open directory /mci/inf: Too many levels of symbolic links
[root@d012-04 ~]# uname -a
Linux d012-04.int-evry.fr 3.6.1-1.fc17.i686.PAE #1 SMP Mon Oct 15 20:34:39 CEST 2012 i686 i686 i386 GNU/Linux
[root@d012-04 ~]# uptime
 17:57:01 up 18:36,  2 users,  load average: 1.00, 1.01, 1.05

I will continue my tests with the MOUNT_WAIT=30 on the other half of the computer labs (6 machines) .

Comment 63 procaccia 2012-10-25 18:23:46 UTC

regarding the other half of my stations running the tunning on /etc/sysconfig/autofs :
#MOUNT_WAIT=-1
MOUNT_WAIT=30

today, there's also a station runing into the pb again

[root@d012-12 ~]# ls /mci/inf
ls: cannot open directory /mci/inf: Too many levels of symbolic links
[root@d012-12 ~]# ls /mci/eph
ls: cannot open directory /mci/eph: Too many levels of symbolic links

other autofs FS do work though o the same station :

[root@d012-12 ~]# ls /mci/lor
abreu_re  bernadal  cavallo

 [root@d012-12 ~]# ps auwx | grep autofs
root       663  0.0  0.0  24524  1888 ?        S    Oct23   0:01 /usr/libexec/sssd/sssd_autofs --debug-to-files
root       790  0.0  0.0  47952  2532 ?        Ssl  Oct23   0:11 /usr/sbin/automount --pid-file /run/autofs.pid

So neither "MOUNT_WAIT=30" nor home-built kernel-PAE-3.6.1-1.fc17.i686.rpm patched with patch on comment 53 solve the problem .

have you an other idea ?

what about /etc/sysconfig/autofs

# UMOUNT_WAIT - time to wait for a response from umount(8).
#
#UMOUNT_WAIT=12

could it be tunned to something else, longer ? shorter ?

Thanks .

Comment 64 Ian Kent 2012-10-26 09:34:52 UTC

(In reply to comment #63)
> 
> have you an other idea ?

Tell me again, what happened with machines you set "/" as
private?

I know there's something wierd going on there because of
a bizare expire bug but I don't know why or how it happens.

In that case the reference count of the mountpoint dentry
became elevated so it would never expire, like there was
some hidden kernel user, but the actual umount worked fine
and the reference count returned to normal afterward. Setting
"/" as private made the problem go away completely.

The expire fix simply removed the troublsome check since it
was an optimization and a later check using the kernel mount
struct always returned accurrate information.

Being sure about "/" being marked private and whether the
problem persists is important.

I don't think it should make a difference but, when you
make "/" private stop autofs, check there are no mounts
related to autofs and then start it up.

> 
> what about /etc/sysconfig/autofs
> 
> # UMOUNT_WAIT - time to wait for a response from umount(8).
> #
> #UMOUNT_WAIT=12
> 
> could it be tunned to something else, longer ? shorter ?

I doubt it's related.

Comment 65 procaccia 2012-10-26 10:15:33 UTC

I didn't generalized the "mount --make-private /" , just did it once on 2 station,  so I cannot tell for now as it didn't last .
I'll be glad to generelize it on all the lab stations, but I cannot find how to tell at boot time to mount / as private .
is there an /etc/fstab option to do that ?
for now I have
/dev/mapper/vg-db_vol   /                       ext4    defaults        1 1

I can run it manually, but as the labs station reboot regularly I need a way to automate it.
How can I check that / is 'private' mounted ?

thanks .

Comment 66 habicht 2012-10-26 10:32:03 UTC

We have the same problems here:

~ (habicht@pxe-122) 101 $ ls /opt/envmodules/
ls: cannot access /opt/envmodules/: Too many levels of symbolic links
~ (habicht@pxe-122) 102 $ ls /opt/eod/
Exceed_onDemand_Client_7  Exceed_onDemand_Client_8
~ (habicht@pxe-122) 103 $ 

All dirs in /opt are mounted via autofs.

After boot everything is ok. It started around 10min after using
the dirs the first time (but this is only true for some dirs).

We are using F17 and autofs-5.0.6-22.fc17.x86_64.

Comment 67 Ian Kent 2012-10-26 10:33:43 UTC

(In reply to comment #65)
> I didn't generalized the "mount --make-private /" , just did it once on 2
> station,  so I cannot tell for now as it didn't last .
> I'll be glad to generelize it on all the lab stations, but I cannot find how
> to tell at boot time to mount / as private .
> is there an /etc/fstab option to do that ?

I don't think that would help if there was since systemd does
this after / in mounted I believe.

> for now I have
> /dev/mapper/vg-db_vol   /                       ext4    defaults        1 1
> 
> I can run it manually, but as the labs station reboot regularly I need a way
> to automate it.

Good question, I dodn't know since this is controled by systemd
and I don't know of an rc.local equivalent.

> How can I check that / is 'private' mounted ?

Look in /proc/self/mountinfo

Ian

Comment 68 Ian Kent 2012-10-26 10:40:37 UTC

(In reply to comment #67)
> (In reply to comment #65)
> > I didn't generalized the "mount --make-private /" , just did it once on 2
> > station,  so I cannot tell for now as it didn't last .
> > I'll be glad to generelize it on all the lab stations, but I cannot find how
> > to tell at boot time to mount / as private .
> > is there an /etc/fstab option to do that ?
> 
> I don't think that would help if there was since systemd does
> this after / in mounted I believe.
> 
> > for now I have
> > /dev/mapper/vg-db_vol   /                       ext4    defaults        1 1
> > 
> > I can run it manually, but as the labs station reboot regularly I need a way
> > to automate it.
> 
> Good question, I dodn't know since this is controled by systemd
> and I don't know of an rc.local equivalent.

Actually, this is all a bit too hard.
I should be able to change autofs to set its mounts to private
so we don't need to worry about systemd.

I'll get onto that.

Comment 69 Ian Kent 2012-10-26 11:19:55 UTC

How about giving this build a try:

http://people.redhat.com/~ikent/autofs-5.0.6-22.fc17.bz833535.1

It should mount the autofs mounts as private.

Comment 70 procaccia 2012-10-26 14:25:20 UTC

OK, I was on

[root@d012-10 ~]# rpm -qa | grep autofs
libsss_autofs-1.8.4-14.fc17.i686
autofs-5.0.6-22.fc17.i686

I update to your package:
[root@d012-10 ~]# wget http://people.redhat.com/~ikent/autofs-5.0.6-22.fc17.bz833535.1/autofs-5.0.6-22.fc17.bz833535.1.i686.rpm
[root@d012-10 ~]# rpm -Uvh autofs-5.0.6-22.fc17.bz833535.1.i686.rpm
Préparation...              ########################################### [100%]
   1:autofs                 ########################################### [100%]
[root@d012-10 ~]# rpm -qa | grep autofs
autofs-5.0.6-22.fc17.bz833535.1.i686


I restart my stations and let them live for a while to see if it gets better now .
is there a way to "see" that an autofs is "private" mounted ?

thanks for the package.

Comment 71 info@kobaltwit.be 2012-10-27 09:59:20 UTC

My recent experiences seem to suggest at least part of the problem starts in kernel land.

Fedora 16 recently rebased to kernel 3.6.2. The very first time I booted F16 into the new kernel I got mount errors. When I boot into the previous 3.4 kernel I don't get this. I have waited a couple of days to report this just to be sure. But it is consistent: each time I boot into 3.6.2, I see automount issues, which I never see when booting in 3.4. The autofs version hasn't changed.

At the same time, I'm running some experiments on two F17 machines. On both machines I see the mount issues with kernel 3.6.2, but on one machin I also still have kernel 3.5.5. With that kernel I haven't been able to reproduce any mount issue so far. The other machine still has a 3.5.6 kernel. While playing with this kernel I have had one mount issue over the course of several days, but I can't remember for sure if this was when booted into 3.6.2 or 3.5.6. I'll keep monitoring this system for further information.

Maybe this information can help to narrow the search range for potential issues.

Comment 72 info@kobaltwit.be 2012-10-27 10:33:38 UTC

More info on the machine running 3.5.6. I can reproduce the mount issues easily on this version. I got confused before, because I was working with a simplified automount map, which seemed not to hit the issue.

The mount map that works without issues:
vialila  -rw,soft,intr  files.kobaltwit.lan:/home/vialila

The mount map that fails:
vialila  -rw,soft,intr  / files.kobaltwit.lan:/home/vialila \
                        Pictures files.kobaltwit.lan:/home/common/pictures

Does this say something regarding the suggestion to mount the root as private ?

Comment 73 info@kobaltwit.be 2012-10-27 10:54:55 UTC

Perhaps also worth mentioning that while the symptoms on F16 and F17 are exactly the same with kernel 3.6.2, the actual error message on the console is different.

F16:
"Device or resource is busy"

F17:
"Too many levels of symbolic links"

Comment 74 Michael Fischer 2012-10-29 16:47:28 UTC

Adding an "Us Too!" comment.

F17
autofs-5.0.6-22.fc17.bz833535.1.x86_64
libsss_autofs-1.8.5-2.fc17.x86_64
kernel-3.6.2-4.fc17.x86_64

nsswitch.conf -> automount: files sss

We are getting "Too many levels of symbolic links also."  It is inconsistent with events happening sometimes once a day, or every six or seven minutes.

Comment 75 Elliott Forney 2012-10-29 18:27:09 UTC

We are also seeing this in both F16 and F17 now.  We only see the problem with the 3.6.2 kernel though and rolling back to the previous kernel seems to have stopped the problems.  I wonder if there are multiple issues going on here?

Comment 76 Ian Collier 2012-10-29 23:46:15 UTC

The autofs-5.0.6-22.fc17.bz833535.1.x86_64 build does not seem to have made any difference to the problem here.

Comment 77 habicht 2012-10-30 12:37:02 UTC

Problem also disappeared when going back to the initial kernel of F17,
but using the rest updated:

kernel-3.3.4-5.fc17.x86_64
autofs-5.0.6-22.fc17.x86_64

Comment 78 procaccia 2012-10-31 09:16:30 UTC

I must inform that apparently the package  autofs-5.0.6-22.fc17.bz833535.1.i686 which mounts autofs mounts as private (cf comment 69) fails :

[root@d012-04 ~]# uname -a
Linux d012-04.int-evry.fr 3.6.1-1.fc17.i686.PAE #1 SMP Mon Oct 15 20:34:39 CEST 2012 i686 i686 i386 GNU/Linux
[root@d012-04 ~]# rpm -q autofs
autofs-5.0.6-22.fc17.bz833535.1.i686
[root@d012-04 ~]# ls /mci/inf
ls: cannot open directory /mci/inf: Too many levels of symbolic links
[root@d012-04 ~]# date
Wed Oct 31 09:53:04 CET 2012

Now I will try to downgrade kernels to 3.3.4-5.fc17.i686.PAE as suggested above .
for the record (so that I remember myself) stations d012-06 , 07, 08 downgraded to kernel 3.3.4-5
I am waiting for potentials "Too many levels of symbolic links" on those stations now .
to be continued .

Comment 79 Ian Kent 2012-10-31 09:47:19 UTC

Thanks for doing the testing, it is very helpful, it will
narrow the search but, unfortuneately, the area of code to
consider is still quite large.

I've been thinking that, given the symptoms, NFS is setting
the automount needed flag on the root dentry of mounts (I
really can't see why that only happens sometimes). Had a
look at that today but it isn't straight forward to see
what's happening.

I'll run some limited tests and, if it seems OK, I'll post
a patch for testing. That will only establish if that really
is the problem and won't be a fix. If it is the problem it
will be harder to write a patch for NFS to resolve it.

Comment 80 Ian Kent 2012-11-01 04:16:55 UTC

Created attachment 636456 [details]
Patch - check nfs automount flag on root

Can someone build a kernel with this patch to see if there's
a problem with the setting of flags on the root of the NFS
mount?

Comment 81 procaccia 2012-11-01 18:02:10 UTC

ok I am building with patch on comment 80 just above .
as compiling a kernel takes quite some times ... 
I wonder if I still need to apply also patch on comment 53 in the same time?
by the way, I relized that the kernel I built on comment 57 might have not included the patch (c-53). Indeed, although I declared the patch in the kernel.spec rpm I forgot to apply it !

Now for this build,  I double check, the fs/namei.c does contain the changes before building.

Kernel.spec:
...
Patch22071: 3.6.2-stable-queue.patch
Patch22535: autofs4-bug833535-c80.patch
...
ApplyPatch 3.6.2-stable-queue.patch
#autofs NFS ian Kent  https://bugzilla.redhat.com/show_bug.cgi?id=833535
ApplyPatch autofs4-bug833535-c80.patch
...

let me know if I apply that latest patch (c-80) only or also the one from comment 53 ?

Thanks .

Comment 82 procaccia 2012-11-01 20:39:55 UTC

I applied the patch (c-80)  and deployed the corresponding kernel to 4 stations (d012-01 ... d012-04, for the record ...) 

[root@d012-02 ~]# uname -a
Linux d012-02.int-evry.fr 3.6.1-1.fc17.i686.PAE #1 SMP Thu Nov 1 18:15:14 CET 2012 i686 i686 i386 GNU/Linux

lets give them some time running too see if it gets better .

Comment 83 Ian Kent 2012-11-02 00:34:50 UTC

(In reply to comment #81)
> ok I am building with patch on comment 80 just above .
> as compiling a kernel takes quite some times ... 
> I wonder if I still need to apply also patch on comment 53 in the same time?
> by the way, I relized that the kernel I built on comment 57 might have not
> included the patch (c-53). Indeed, although I declared the patch in the
> kernel.spec rpm I forgot to apply it !
> 
> Now for this build,  I double check, the fs/namei.c does contain the changes
> before building.
> 
> Kernel.spec:
> ...
> Patch22071: 3.6.2-stable-queue.patch
> Patch22535: autofs4-bug833535-c80.patch
> ...
> ApplyPatch 3.6.2-stable-queue.patch
> #autofs NFS ian Kent  https://bugzilla.redhat.com/show_bug.cgi?id=833535
> ApplyPatch autofs4-bug833535-c80.patch
> ...
> 
> let me know if I apply that latest patch (c-80) only or also the one from
> comment 53 ?

It's probably better not to also add the comment 53 patch since,
if there is a difference, we won't know which patch was responsible.

The comment 80 path is just to find out if the cause is even what
I think it is (it might not be).

Also, be aware that it will probably will break NFSv4 referrals
if you are using them.

Comment 84 procaccia 2012-11-05 14:43:22 UTC

I am afraid that the patch from comment 80 fails, I have 1 of my 4 machines patched that have today the  "Too Many levels of symbolic links":

[root@d012-01 ~]# ls /mci/inf
ls: cannot open directory /mci/inf: Too many levels of symbolic links
[root@d012-01 ~]# uname -a 
Linux d012-01.int-evry.fr 3.6.1-1.fc17.i686.PAE #1 SMP Thu Nov 1 18:15:14 CET 2012 i686 i686 i386 GNU/Linux

That kernel had only patch "c-80", would it be usefull to create a kernel with patch from comment-53 only or to you have better plans ?

Thanks .

Comment 85 Ian Kent 2012-11-06 02:26:09 UTC

(In reply to comment #84)
> I am afraid that the patch from comment 80 fails, I have 1 of my 4 machines
> patched that have today the  "Too Many levels of symbolic links":
> 
> [root@d012-01 ~]# ls /mci/inf
> ls: cannot open directory /mci/inf: Too many levels of symbolic links
> [root@d012-01 ~]# uname -a 
> Linux d012-01.int-evry.fr 3.6.1-1.fc17.i686.PAE #1 SMP Thu Nov 1 18:15:14
> CET 2012 i686 i686 i386 GNU/Linux
> 
> That kernel had only patch "c-80", would it be usefull to create a kernel
> with patch from comment-53 only or to you have better plans ?

It's possible this error isn't actually comming from the automount
code at all.

Comment 86 marcindulak 2012-11-06 14:03:39 UTC

i'm joining the thread too.
I see that manual mount works:
# mount -o rsize=8192,wsize=8192,ro,tcp,vers=3 server:/home/test /test1
while the corresponding automount has problems with "Too many levels of symbolic links".
This happens with Fedora 17 (up to date) clients, and different RHEL or proprietary IBM servers.

Comment 87 JM 2012-11-06 17:01:53 UTC

Same problem here with 

kernel-3.6.5-1.fc17.x86_64
autofs-5.0.6-22.fc17.x86_64

To me it looks like that the problem starts with kernel-3.6.2, before it was okay.

Comment 88 Ian Kent 2012-11-10 01:13:07 UTC

Created attachment 641908 [details]
Patch - use simple_empty() for empty directory check

Can someone give this patch a try please.

Comment 89 procaccia 2012-11-10 16:53:53 UTC

ok, I've rebuild the kernel with patch from comment-88 (only, not the other patchs from comment-80 & comment-53) 
I've now have 4 machines runing that newly patch kernel

[root@d012-04 ~]# uptime
 17:32:05 up 1 min,  2 users,  load average: 1.52, 0.53, 0.19
[root@d012-04 ~]# uname -a
Linux d012-04.int-evry.fr 3.6.6-1.fc17.i686.PAE #1 SMP Sat Nov 10 13:30:32 CET 2012 i686 i686 i386 GNU/Linux
[root@d012-04 ~]# date
sam. nov. 10 17:32:32 CET 2012
[root@d012-04 ~]# ls /mci/inf
 abid_zue  berge_re  shoman ...
for now that works fine.

for the record stations d012-01 , 02, 03, 04 are runing patched kernel-PAE-3.6.6-1.fc17.i686.rpm

thanks for the patch,
let's wait a while and see ...

Comment 90 Elliott Forney 2012-11-11 08:22:58 UTC

Could bug 851131 be related to some of these problems?

Comment 91 Ian Kent 2012-11-12 01:15:30 UTC

(In reply to comment #90)
> Could bug 851131 be related to some of these problems?

Probably not, since making / private has been tested and
a patched autofs that makes its mounts private has also
been tested.

Comment 92 procaccia 2012-11-15 09:12:45 UTC

bad news for the latest patch (c-88)
The patch I applied on comment 89 seems to fail :-(
at least on one station

[root@d012-12 ~]# uptime
 10:08:36 up 35 min,  3 users,  load average: 0.41, 0.15, 0.11
[root@d012-12 ~]# date
Thu Nov 15 10:08:38 CET 2012
[root@d012-12 ~]# uname -a 
Linux d012-12.int-evry.fr 3.6.6-1.fc17.i686.PAE #1 SMP Sat Nov 10 13:30:32 CET 2012 i686 i686 i386 GNU/Linux
[root@d012-12 ~]# su - testtsp
su: warning: cannot change directory to /mci/ei1215/testtsp: Too many levels of symbolic links

I still have other stations running this patched kernel .

Comment 93 Ian Kent 2012-11-15 09:25:53 UTC

(In reply to comment #92)
> bad news for the latest patch (c-88)
> The patch I applied on comment 89 seems to fail :-(
> at least on one station
> 
> [root@d012-12 ~]# uptime
>  10:08:36 up 35 min,  3 users,  load average: 0.41, 0.15, 0.11
> [root@d012-12 ~]# date
> Thu Nov 15 10:08:38 CET 2012
> [root@d012-12 ~]# uname -a 
> Linux d012-12.int-evry.fr 3.6.6-1.fc17.i686.PAE #1 SMP Sat Nov 10 13:30:32
> CET 2012 i686 i686 i386 GNU/Linux
> [root@d012-12 ~]# su - testtsp
> su: warning: cannot change directory to /mci/ei1215/testtsp: Too many levels
> of symbolic links

What is the map entry corresponding to this path and the
corresponding server export?

Comment 94 Ian Kent 2012-11-19 11:23:15 UTC

Created attachment 647703 [details]
Patch - dont clear DCACHE_NEED_AUTOMOUNT on rootless mount

Comment 95 Ian Kent 2012-11-19 11:24:03 UTC

Created attachment 647705 [details]
Patch - use simple_empty() for empty directory check

Comment 96 Ian Kent 2012-11-19 11:28:55 UTC

I still can't reproduce this problem so all I can do is try
and find a problem with the code.

The patch in comment #94 is a result of this while the patch
in comment #95 is the result of work on another bug.

The patch of comment #53 is already included in kernel-3.6.6-1.

Can someone test a kernel with these patches please.

Ian

Comment 97 procaccia 2012-11-21 17:01:42 UTC

I've compiled a new kernel including patchs from c-94 and c-95
unfortunatly after only few minutes of run, the pb arises:

[root@d012-05 ~]# uptime
 17:58:40 up 26 min,  3 users,  load average: 0.00, 0.04, 0.13
[root@d012-05 ~]# uname -a
Linux d012-05.int-evry.fr 3.6.6-1.fc17.i686.PAE #1 SMP Wed Nov 21 12:02:50 CET 2012 i686 i686 i386 GNU/Linux
[root@d012-05 ~]# ls /mci/inf
ls: cannot open directory /mci/inf: Too many levels of symbolic links

Comment 98 Ian Kent 2012-11-22 00:48:23 UTC

(In reply to comment #97)
> I've compiled a new kernel including patchs from c-94 and c-95
> unfortunatly after only few minutes of run, the pb arises:
> 
> [root@d012-05 ~]# uptime
>  17:58:40 up 26 min,  3 users,  load average: 0.00, 0.04, 0.13
> [root@d012-05 ~]# uname -a
> Linux d012-05.int-evry.fr 3.6.6-1.fc17.i686.PAE #1 SMP Wed Nov 21 12:02:50
> CET 2012 i686 i686 i386 GNU/Linux
> [root@d012-05 ~]# ls /mci/inf
> ls: cannot open directory /mci/inf: Too many levels of symbolic links

There are two things that we haven't done.

One is to try and work out why I can't reproduce the problem.
There must be something different between my enviroment and ones
where this occurs. So what are those environments, server export
perameters, server os and version and anything else you can think
of?

Second, an strace of the ls command so I can identify exactly
which system call returns the ELOOP might help.

Comment 99 JM 2012-11-22 01:18:14 UTC

Created attachment 649476 [details]
strace from ls -la

This is a strace of the ls command (Too many levels of symbolic links)

Comment 100 JM 2012-11-22 01:24:10 UTC

Server OS:
Scientific Linux release 6.3 (Carbon), Kernel: 2.6.32-279.14.1.el6.x86_64

Client OS:
Fedora 17, Kernel: 3.6.6-1.fc17.x86_64

Export parameters:
/export        *(rw,root_squash,no_subtree_check,async,insecure,fsid=0)
/export/home   *(rw,root_squash,no_subtree_check,async,insecure,nohide)

Mount parameters:
* -fstype=nfs,nfsvers=4,rw,nosuid,intr,noatime,acdirmin=2 nfs:/home/&

Hope this helps…

Comment 101 JM 2012-11-22 01:29:16 UTC

The mount parameters are not correct, here the correct parameters:

* -fstype=nfs4,port=2049,rw,nosuid,intr,relatime,acdirmin=2 nfs:/home/&

Comment 102 Ian Kent 2012-11-22 01:36:04 UTC

Is anyone seeing anything is syslog like "VFS: ...." at all?

Comment 103 Ian Kent 2012-11-22 01:38:51 UTC

(In reply to comment #100)
> Server OS:
> Scientific Linux release 6.3 (Carbon), Kernel: 2.6.32-279.14.1.el6.x86_64
> 
> Client OS:
> Fedora 17, Kernel: 3.6.6-1.fc17.x86_64
> 
> Export parameters:
> /export        *(rw,root_squash,no_subtree_check,async,insecure,fsid=0)
> /export/home   *(rw,root_squash,no_subtree_check,async,insecure,nohide)

Do you still see this if you don't use the nfs4 global root?
ie. remove the fsid=0 form /export and the nohide from /export/home

> 
> Mount parameters:
> * -fstype=nfs,nfsvers=4,rw,nosuid,intr,noatime,acdirmin=2 nfs:/home/&
> 
> Hope this helps…

Comment 104 JM 2012-11-22 01:40:56 UTC

(In reply to comment #102)
> Is anyone seeing anything is syslog like "VFS: ...." at all?

Nov 13 16:10:31 foo kernel: [    4.178389] VFS: Disk quotas dquot_6.5.2

Comment 105 Ian Kent 2012-11-22 01:43:08 UTC

(In reply to comment #103)
> (In reply to comment #100)
> > Server OS:
> > Scientific Linux release 6.3 (Carbon), Kernel: 2.6.32-279.14.1.el6.x86_64
> > 
> > Client OS:
> > Fedora 17, Kernel: 3.6.6-1.fc17.x86_64
> > 
> > Export parameters:
> > /export        *(rw,root_squash,no_subtree_check,async,insecure,fsid=0)
> > /export/home   *(rw,root_squash,no_subtree_check,async,insecure,nohide)
> 
> Do you still see this if you don't use the nfs4 global root?
> ie. remove the fsid=0 form /export and the nohide from /export/home

Oh, hang on, how does, what looks like an automount map entry
below relate to /export on the server at all?

> 
> > 
> > Mount parameters:
> > * -fstype=nfs,nfsvers=4,rw,nosuid,intr,noatime,acdirmin=2 nfs:/home/&
> > 
> > Hope this helps…

Comment 106 JM 2012-11-22 02:17:26 UTC

(In reply to comment #105)
> (In reply to comment #103)
> > (In reply to comment #100)
> > > Server OS:
> > > Scientific Linux release 6.3 (Carbon), Kernel: 2.6.32-279.14.1.el6.x86_64
> > > 
> > > Client OS:
> > > Fedora 17, Kernel: 3.6.6-1.fc17.x86_64
> > > 
> > > Export parameters:
> > > /export        *(rw,root_squash,no_subtree_check,async,insecure,fsid=0)
> > > /export/home   *(rw,root_squash,no_subtree_check,async,insecure,nohide)
> > 
> > Do you still see this if you don't use the nfs4 global root?
> > ie. remove the fsid=0 form /export and the nohide from /export/home
> 
> Oh, hang on, how does, what looks like an automount map entry
> below relate to /export on the server at all?

Hmmm… I think I don't understand the question, the automount map entry is direct from /etc/auto.home

* -fstype=nfs4,port=2049,rw,nosuid,intr,relatime,acdirmin=2	nfs:/home/&

and for /home/jm (for example) it looks then like this

jm -fstype=nfs4,port=2049,rw,nosuid,intr,relatime,acdirmin=2	nfs:/home/jm

mount shows then this:
nfs:/home/jm on /home/jm type nfs4 (rw,nosuid,relatime,vers=4.0,rsize=524288,wsize=524288,namlen=255,acdirmin=2,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.140.38,local_lock=none,addr=192.168.140.78)

> > 
> > > 
> > > Mount parameters:
> > > * -fstype=nfs,nfsvers=4,rw,nosuid,intr,noatime,acdirmin=2 nfs:/home/&
> > > 
> > > Hope this helps…

A second server has this as export:
/srv/export			gss/krb5p(rw,root_squash,no_subtree_check,async,insecure,fsid=0)
/srv/export/home		gss/krb5p(rw,root_squash,no_subtree_check,async,insecure,nohide)

and this is the automount map entry for the client:
* -fstype=nfs4,port=2049,sec=krb5p,rw,nosuid,intr,relatime,acdirmin=2 nfs:/home/&

I have the same problem (Too many levels of symbolic links) on the second server as well.

Comment 107 Ian Kent 2012-11-22 02:36:27 UTC

(In reply to comment #106)
> (In reply to comment #105)
> > (In reply to comment #103)
> > > (In reply to comment #100)
> > > > Server OS:
> > > > Scientific Linux release 6.3 (Carbon), Kernel: 2.6.32-279.14.1.el6.x86_64
> > > > 
> > > > Client OS:
> > > > Fedora 17, Kernel: 3.6.6-1.fc17.x86_64
> > > > 
> > > > Export parameters:
> > > > /export        *(rw,root_squash,no_subtree_check,async,insecure,fsid=0)
> > > > /export/home   *(rw,root_squash,no_subtree_check,async,insecure,nohide)
> > > 
> > > Do you still see this if you don't use the nfs4 global root?
> > > ie. remove the fsid=0 form /export and the nohide from /export/home
> > 
> > Oh, hang on, how does, what looks like an automount map entry
> > below relate to /export on the server at all?
> 
> Hmmm… I think I don't understand the question, the automount map entry is
> direct from /etc/auto.home
> 
> * -fstype=nfs4,port=2049,rw,nosuid,intr,relatime,acdirmin=2	nfs:/home/&
> 
> and for /home/jm (for example) it looks then like this
> 
> jm -fstype=nfs4,port=2049,rw,nosuid,intr,relatime,acdirmin=2	nfs:/home/jm
> 
> mount shows then this:
> nfs:/home/jm on /home/jm type nfs4
> (rw,nosuid,relatime,vers=4.0,rsize=524288,wsize=524288,namlen=255,acdirmin=2,
> hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.140.38,
> local_lock=none,addr=192.168.140.78)
> 
> > > 
> > > > 
> > > > Mount parameters:
> > > > * -fstype=nfs,nfsvers=4,rw,nosuid,intr,noatime,acdirmin=2 nfs:/home/&
> > > > 
> > > > Hope this helps…
> 
> A second server has this as export:
> /srv/export		
> gss/krb5p(rw,root_squash,no_subtree_check,async,insecure,fsid=0)
> /srv/export/home	
> gss/krb5p(rw,root_squash,no_subtree_check,async,insecure,nohide)
> 
> and this is the automount map entry for the client:
> * -fstype=nfs4,port=2049,sec=krb5p,rw,nosuid,intr,relatime,acdirmin=2
> nfs:/home/&
> 
> I have the same problem (Too many levels of symbolic links) on the second
> server as well.

The automount map entries don't appear to mount any of the exports
from the servers you describe.

Comment 108 JM 2012-11-22 03:06:03 UTC

(In reply to comment #107)

> The automount map entries don't appear to mount any of the exports
> from the servers you describe.

It mounts the entries below /export/home, e.g. /export/home/jm, /export/home/gm, etc. it is not necessary to export every single Userdirectory below /export/home it is sufficient to export only the main directory /export/home.

Comment 109 Ian Kent 2012-11-22 03:43:50 UTC

(In reply to comment #108)
> (In reply to comment #107)
> 
> > The automount map entries don't appear to mount any of the exports
> > from the servers you describe.
> 
> It mounts the entries below /export/home, e.g. /export/home/jm,
> /export/home/gm, etc. it is not necessary to export every single
> Userdirectory below /export/home it is sufficient to export only the main
> directory /export/home.

That is an obvious assumption I've made.

How does the setup map from /home to /export/home?

There is no way for me to even begin to try and duplicate this
if I don't know how it is setup!

Comment 110 procaccia 2012-11-22 09:46:44 UTC

regarding my autofs map (from ldap) here it is (for /mci/inf) 

# mci_inf#076, direct, automount, int-evry.fr
dn: cn=mci_inf#076,ou=direct,ou=automount,dc=int-evry,dc=fr
automountInformation: -fstype=autofs ldap:ou=direct.mci_inf#076,ou=direct,ou=a
 utomount,dc=int-evry,dc=fr
cn: mci_inf#076
objectClass: top
objectClass: automount


# /mci/inf#076, direct.mci_inf#076, direct, automount, int-evry.fr
dn: description=/mci/inf#076,ou=direct.mci_inf#076,ou=direct,ou=automount,dc=i
 nt-evry,dc=fr
automountInformation: -rw,intr,soft gizeh:/disk19/inf
cn: /mci/inf
description: /mci/inf#076
objectClass: top
objectClass: automount

[root@d012-05 ~]# ls /mci/inf
ls: cannot open directory /mci/inf: Too many levels of symbolic links

regarding VFS string in the logs:

[root@d012-05 log]# grep VFS messages
Nov 19 13:38:18 d012-05 kernel: [    0.498067] VFS: Disk quotas dquot_6.5.2
Nov 20 09:18:09 d012-05 kernel: [    0.499029] VFS: Disk quotas dquot_6.5.2
Nov 20 09:21:13 d012-05 kernel: [    0.497768] VFS: Disk quotas dquot_6.5.2
Nov 21 10:15:43 d012-05 kernel: [    0.498080] VFS: Disk quotas dquot_6.5.2
Nov 21 17:32:14 d012-05 kernel: [    0.506066] VFS: Disk quotas dquot_6.5.2

and finally strace on the faulty /mci/inf, quite long ... 
hope this help ?

Thanks

[root@d012-05 ~]# strace ls -la /mci/inf 
execve("/bin/ls", ["ls", "-la", "/mci/inf"], [/* 31 vars */]) = 0
brk(0)                                  = 0x85e0000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7768000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=112294, ...}) = 0
mmap2(NULL, 112294, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb774c000
close(3)                                = 0
open("/lib/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\340(SM4\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=130708, ...}) = 0
mmap2(0x4d52e000, 138376, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x4d52e000
mmap2(0x4d54d000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e) = 0x4d54d000
mmap2(0x4d54f000, 3208, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4d54f000
close(3)                                = 0
open("/lib/librt.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0 \311PM4\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=42224, ...}) = 0
mmap2(0x4d50b000, 33324, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x4d50b000
mmap2(0x4d512000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6) = 0x4d512000
close(3)                                = 0
open("/lib/libcap.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0@\216[N4\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=16900, ...}) = 0
mmap2(0x4e5b8000, 18096, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x4e5b8000
mmap2(0x4e5bc000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3) = 0x4e5bc000
close(3)                                = 0
open("/lib/libacl.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\3206\371N4\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=36124, ...}) = 0
mmap2(0x4ef92000, 33116, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x4ef92000
mmap2(0x4ef99000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7) = 0x4ef99000
close(3)                                = 0
open("/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\3\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\220\0072M4\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=2011672, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb774b000
mmap2(0x4d307000, 1776316, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x4d307000
mprotect(0x4d4b2000, 4096, PROT_NONE)   = 0
mmap2(0x4d4b3000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1ab) = 0x4d4b3000
mmap2(0x4d4b6000, 10940, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4d4b6000
close(3)                                = 0
open("/lib/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\320JPM4\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=19780, ...}) = 0
mmap2(0x4d504000, 16496, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x4d504000
mmap2(0x4d507000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2) = 0x4d507000
close(3)                                = 0
open("/lib/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0P\331NM4\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=131156, ...}) = 0
mmap2(0x4d4e8000, 102908, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x4d4e8000
mmap2(0x4d4fe000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x15) = 0x4d4fe000
mmap2(0x4d500000, 4604, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4d500000
close(3)                                = 0
open("/lib/libattr.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0p~>N4\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=19460, ...}) = 0
mmap2(0x4e3e7000, 20652, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x4e3e7000
mmap2(0x4e3eb000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3) = 0x4e3eb000
close(3)                                = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb774a000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7749000
set_thread_area({entry_number:-1 -> 6, base_addr:0xb7749740, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0
mprotect(0x8064000, 4096, PROT_READ)    = 0
mprotect(0x4d54d000, 4096, PROT_READ)   = 0
mprotect(0x4d512000, 4096, PROT_READ)   = 0
mprotect(0x4ef99000, 4096, PROT_READ)   = 0
mprotect(0x4d4b3000, 8192, PROT_READ)   = 0
mprotect(0x4d507000, 4096, PROT_READ)   = 0
mprotect(0x4d303000, 4096, PROT_READ)   = 0
mprotect(0x4d4fe000, 4096, PROT_READ)   = 0
mprotect(0x4e3eb000, 4096, PROT_READ)   = 0
munmap(0xb774c000, 112294)              = 0
set_tid_address(0xb77497a8)             = 15515
set_robust_list(0xb77497b0, 12)         = 0
rt_sigaction(SIGRTMIN, {0x4d4ed3f0, [], SA_SIGINFO}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {0x4d4ed480, [], SA_RESTART|SA_SIGINFO}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM_INFINITY}) = 0
uname({sys="Linux", node="d012-05.int-evry.fr", ...}) = 0
statfs64("/sys/fs/selinux", 84, 0xbfbc47fc) = -1 ENOENT (No such file or directory)
statfs64("/selinux", 84, 0xbfbc47fc)    = -1 ENOENT (No such file or directory)
brk(0)                                  = 0x85e0000
brk(0x8601000)                          = 0x8601000
open("/proc/filesystems", O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7767000
read(3, "nodev\tsysfs\nnodev\trootfs\nnodev\tb"..., 1024) = 370
read(3, "", 1024)                       = 0
close(3)                                = 0
munmap(0xb7767000, 4096)                = 0
open("/usr/lib/locale/locale-archive", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=105038208, ...}) = 0
mmap2(NULL, 2097152, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7549000
close(3)                                = 0
open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=2512, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7767000
read(3, "# Locale name alias data base.\n#"..., 4096) = 2512
read(3, "", 4096)                       = 0
close(3)                                = 0
munmap(0xb7767000, 4096)                = 0
open("/usr/lib/locale/US/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(1, TIOCGWINSZ, {ws_row=41, ws_col=131, ws_xpixel=0, ws_ypixel=0}) = 0
lstat64("/mci/inf", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lgetxattr("/mci/inf", "security.selinux", 0x85e3ea8, 255) = -1 EOPNOTSUPP (Operation not supported)
getxattr("/mci/inf", "system.posix_acl_access", 0x0, 0) = -1 EOPNOTSUPP (Operation not supported)
socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_FILE, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(3)                                = 0
socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_FILE, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(3)                                = 0
open("/etc/nsswitch.conf", O_RDONLY|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=1717, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7767000
read(3, "#\n# /etc/nsswitch.conf\n#\n# An ex"..., 4096) = 1717
read(3, "", 4096)                       = 0
close(3)                                = 0
munmap(0xb7767000, 4096)                = 0
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=112294, ...}) = 0
mmap2(NULL, 112294, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb774c000
close(3)                                = 0
open("/lib/libnss_files.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0@\32\0\0004\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=55108, ...}) = 0
mmap2(NULL, 50144, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb753c000
mmap2(0xb7547000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xa) = 0xb7547000
close(3)                                = 0
mprotect(0xb7547000, 4096, PROT_READ)   = 0
munmap(0xb774c000, 112294)              = 0
open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=2072, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7767000
read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 2072
close(3)                                = 0
munmap(0xb7767000, 4096)                = 0
socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_FILE, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(3)                                = 0
socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_FILE, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(3)                                = 0
open("/etc/group", O_RDONLY|O_CLOEXEC)  = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=757, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7767000
read(3, "root:x:0:\nbin:x:1:\ndaemon:x:2:\ns"..., 4096) = 757
close(3)                                = 0
munmap(0xb7767000, 4096)                = 0
openat(AT_FDCWD, "/mci/inf", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = -1 ELOOP (Too many levels of symbolic links)
write(2, "ls: ", 4ls: )                     = 4
write(2, "cannot open directory /mci/inf", 30cannot open directory /mci/inf) = 30
write(2, ": Too many levels of symbolic li"..., 35: Too many levels of symbolic links) = 35
write(2, "\n", 1
)                       = 1
close(1)                                = 0
close(2)                                = 0
exit_group(2)                           = ?
+++ exited with 2 +++

Comment 111 JM 2012-11-22 11:48:36 UTC

(In reply to comment #109)
> (In reply to comment #108)
> > (In reply to comment #107)
> > 
> > > The automount map entries don't appear to mount any of the exports
> > > from the servers you describe.
> > 
> > It mounts the entries below /export/home, e.g. /export/home/jm,
> > /export/home/gm, etc. it is not necessary to export every single
> > Userdirectory below /export/home it is sufficient to export only the main
> > directory /export/home.
> 
> That is an obvious assumption I've made.
> 
> How does the setup map from /home to /export/home?

It's a simple bind

/mnt/home			/export/home	none	bind	0 0
 
> There is no way for me to even begin to try and duplicate this
> if I don't know how it is setup!

I think the problem is part of the kernel on the client system (The problem disappears when I switch back to the initial kernel of F17)… maybe it's not a problem to mount the directory, maybe autofs fails to unmount the directory correctly? I know, I'm just guessing, I have no clue what exactly triggers the error, I tried a lot of differnt mount options and still the same result, so far the only workaround is to switch back to an old kernel version.

Comment 112 Ian Kent 2012-11-22 11:55:58 UTC

(In reply to comment #110)
> regarding my autofs map (from ldap) here it is (for /mci/inf) 
> 
> # mci_inf#076, direct, automount, int-evry.fr
> dn: cn=mci_inf#076,ou=direct,ou=automount,dc=int-evry,dc=fr
> automountInformation: -fstype=autofs
> ldap:ou=direct.mci_inf#076,ou=direct,ou=a
>  utomount,dc=int-evry,dc=fr
> cn: mci_inf#076
> objectClass: top
> objectClass: automount
> 
> 
> # /mci/inf#076, direct.mci_inf#076, direct, automount, int-evry.fr
> dn:
> description=/mci/inf#076,ou=direct.mci_inf#076,ou=direct,ou=automount,dc=i
>  nt-evry,dc=fr
> automountInformation: -rw,intr,soft gizeh:/disk19/inf
> cn: /mci/inf
> description: /mci/inf#076
> objectClass: top
> objectClass: automount

I'll assume the #076 isn't significant, the key /mci/inf does
get a lookup kit after all.

> 
> [root@d012-05 ~]# ls /mci/inf
> ls: cannot open directory /mci/inf: Too many levels of symbolic links
> 
> regarding VFS string in the logs:
> 
> [root@d012-05 log]# grep VFS messages
> Nov 19 13:38:18 d012-05 kernel: [    0.498067] VFS: Disk quotas dquot_6.5.2
> Nov 20 09:18:09 d012-05 kernel: [    0.499029] VFS: Disk quotas dquot_6.5.2
> Nov 20 09:21:13 d012-05 kernel: [    0.497768] VFS: Disk quotas dquot_6.5.2
> Nov 21 10:15:43 d012-05 kernel: [    0.498080] VFS: Disk quotas dquot_6.5.2
> Nov 21 17:32:14 d012-05 kernel: [    0.506066] VFS: Disk quotas dquot_6.5.2

Right, so the spot I was wondering about isn't throughing a
kernel warning, thanks for that.

> 
> and finally strace on the faulty /mci/inf, quite long ... 
> hope this help ?

This is what I wanted to confirm.
I've been looking at the open call.
Like I said the problem may not be in the automounting code
but somewhere else, triggered by the act of automounting.

> openat(AT_FDCWD, "/mci/inf",
> O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = -1 ELOOP (Too many
> levels of symbolic links)

Thanks for that too.

One other thing, does the server export your nfs mounts under a
global root? Can I see the server export entries (again plaease
if you've posted them before).

And the last thing, the last patch(es) I posted seemeds to fail
much more quickly than without them but it shouldn't have. Is
that the case or have you not done any further testing.

Ian

Comment 113 Ian Kent 2012-11-22 12:03:17 UTC

(In reply to comment #111)
> (In reply to comment #109)
> > (In reply to comment #108)
> > > (In reply to comment #107)
> > > 
> > > > The automount map entries don't appear to mount any of the exports
> > > > from the servers you describe.
> > > 
> > > It mounts the entries below /export/home, e.g. /export/home/jm,
> > > /export/home/gm, etc. it is not necessary to export every single
> > > Userdirectory below /export/home it is sufficient to export only the main
> > > directory /export/home.
> > 
> > That is an obvious assumption I've made.
> > 
> > How does the setup map from /home to /export/home?
> 
> It's a simple bind
> 
> /mnt/home			/export/home	none	bind	0 0

I still can't see how this fits together, ie. what the /home
of the nfs:/home is on the server.

How is your automounting setup meant to work?

>  
> > There is no way for me to even begin to try and duplicate this
> > if I don't know how it is setup!
> 
> I think the problem is part of the kernel on the client system (The problem
> disappears when I switch back to the initial kernel of F17)… maybe it's not
> a problem to mount the directory, maybe autofs fails to unmount the
> directory correctly? I know, I'm just guessing, I have no clue what exactly
> triggers the error, I tried a lot of differnt mount options and still the
> same result, so far the only workaround is to switch back to an old kernel
> version.

Al that really says is that it probably isn't the kernel or
autofs mounting code since that has seen very little change
since very early 3.0 kernels. The path walking code in the
VFS OTOH has been virtaully re-written.

Comment 114 JM 2012-11-22 13:03:18 UTC

(In reply to comment #113)
 
> I still can't see how this fits together, ie. what the /home
> of the nfs:/home is on the server.

You can use autofs on the server as well (with the nosymlink option of autofs) or a bind mount or nothing at all. The users don't log onto the server so a /home/<user> is not really necessary there.

Btw. /mnt/home is "/dev/mapper/vg1-lvhome /mnt/home ext4 defaults 1 2"

That's everything you need if you really want to recreate the config.
 
> How is your automounting setup meant to work?

It mounts the home directory of the user when he or she logs in. It works since many many years without any problems, just trust me :-). You are obsessed with my configuration :-), so please take the config from procaccia.

> Al that really says is that it probably isn't the kernel or
> autofs mounting code since that has seen very little change
> since very early 3.0 kernels. The path walking code in the
> VFS OTOH has been virtaully re-written.

Something changed between kernel 3.3.4-5.fc17.x86_64 and 3.6.6-1.fc17.x86_64, I changed nothing on the server side and it looks like that it is not a problem with autofs itself because the same version which creates problems with kernel 3.6.6-1.fc17.x86_64 works with kernel 3.3.4-5.fc17.x86_64. I really want to help but I have no clue what exactly fails, the kernel code, autofs, or whatever… I can't trigger the error on purpose (I tried :-)) and so I can't create a configuration which triggers the error 100%, it worked for a while without problems but yesterday I had the problem again, that's the reason I answerd to your question from comment #98

Comment 115 procaccia 2012-11-22 17:38:35 UTC

my /etc/exports file regarding /mci/inf => /disk19 on the NFS server is:

/disk19 @s2ia(rw,async) @serveur(rw,async) @stpfix(rw,async)

[root@gizeh /etc]
$ grep disk19 /proc/fs/nfs/exports
/disk19	@stpfix(rw,root_squash,async,wdelay,no_subtree_check)
/disk19	@serveur(rw,root_squash,async,wdelay,no_subtree_check)


on the client side, on station d012-12 (kernel from comment-88, not comment-97 !)
I have the problem also

[root@d012-12 ~]# uptime
 18:18:16 up 1 day,  3:50,  2 users,  load average: 0.16, 0.12, 0.38
[root@d012-12 ~]# uname -a 
Linux d012-12.int-evry.fr 3.6.6-1.fc17.i686.PAE #1 SMP Sat Nov 10 13:30:32 CET 2012 i686 i686 i386 GNU/Linux
[root@d012-12 ~]# ls /mci/inf
ls: cannot open directory /mci/inf: Too many levels of symbolic links

but what is interesting is that from the same server export (/disk19) it works fine for another subdirectory (other sub map)

[root@d012-12 ~]# ls /mci/eph
abib_ghi .....

[root@d012-12 ~]# df -H /mci/eph
Filesystem          Size  Used Avail Use% Mounted on
gizeh:/disk19/eph/  212G   87G  115G  44% /mci/eph

concerning the fact that it failed more quickly on d012-05 (latest kernel with patches c-94 & c-95
Linux d012-05.int-evry.fr 3.6.6-1.fc17.i686.PAE #1 SMP Wed Nov 21 12:02:50 CET 2012 )

I cannot tell, it happened just once for now and only on that machine .

Comment 116 Philippe Dax 2012-11-22 17:49:19 UTC

Of course, I have the same problem as everybody here since 1 month approximatively, same release F17, same kernels updated by yum.

When I tried to access to "/infres/s3" an automount point I observe that :

/var/log/messages
Nov 22 18:15:48 mesa automount[2780]: umount_autofs_indirect: ask umount returned busy /stud
Nov 22 18:15:50 mesa automount[2780]: umount_autofs_indirect: ask umount returned busy /infres
Nov 22 18:15:52 mesa automount[5265]: do_reconnect: lookup(ldap): failed to find available server

# these 4 mounting points are not unmounted
[root@mesa: 61] ls /infres
bd/  ic2/  s3/  stag/

# and are unaccessible
[root@mesa: 62] ls /infres/s3
/bin/ls: cannot open directory /infres/s3: Too many levels of symbolic links
[root@mesa: 63] ls /infres/bd
/bin/ls: cannot open directory /infres/bd: Too many levels of symbolic links
[root@mesa: 64] ls /infres/ic2
/bin/ls: cannot open directory /infres/ic2: Too many levels of symbolic links
[root@mesa: 65] ls /infres/stag
/bin/ls: cannot open directory /infres/stag: Too many levels of symbolic links

# access to a new mounting point works
[root@mesa: 66] ls /infres/sr
ahmed/     diamanti/  jouguet/   makiou/   natouri/   spina/     zhioua/
alleaume/  famulari/  kaplan/    marin/    pappa/     ttnguyen/  zwang/
aranda/    fotue/     kumarps/   markham/  qin/       urien/
benchaib/  hamdane/   labiod/    moalla/   riguidel/  vdang/
dau/       hecker/    leneutre/  msahli/   sohbi/     zhao/

# and we see it
[root@mesa: 67] ls /infres
bd/  ic2/  s3/  sr/  stag/

May be this can help ?

Philippe

Comment 117 Philippe Dax 2012-11-22 21:05:29 UTC

Sorry, I forgot to precise that the logs in /var/log/messages comes from restarting the automount daemon (pid changed) by systemctl restart autofs.service but without any effects.

Comment 118 Sergey 2012-11-24 17:52:36 UTC

(In reply to comment #116)
> Of course, I have the same problem as everybody here since 1 month
> approximatively, same release F17, same kernels updated by yum.
> 
> When I tried to access to "/infres/s3" an automount point I observe that :
> 
> /var/log/messages
> Nov 22 18:15:48 mesa automount[2780]: umount_autofs_indirect: ask umount
> returned busy /stud
> Nov 22 18:15:50 mesa automount[2780]: umount_autofs_indirect: ask umount
> returned busy /infres
> Nov 22 18:15:52 mesa automount[5265]: do_reconnect: lookup(ldap): failed to
> find available server
> 

Trying to find a workaround fro this problem I've set TIMEOUT=0 in /etc/sysconfig/autofs, and this seems to help. Just as a temporary remedy. 

- Sergey

Comment 119 Philippe Dax 2012-11-27 11:03:25 UTC

(In reply to comment #118)

> Trying to find a workaround fro this problem I've set TIMEOUT=0 in
> /etc/sysconfig/autofs, and this seems to help. Just as a temporary remedy. 

I have applied your tuning and since 2 days it seems effective, everything works fine without any errors.  Thanks for your help.

Philippe
--

Comment 120 William H. Haller 2012-11-28 04:46:45 UTC

Adding an additional user - multiple f17 clients and a f17 server which serves ups nfs /home automounted for any client as their /home. I just started getting hit with this when completed the upgrades this last weekend. I don't believe I was seeing it with a F17 client to F16 server.

A user can apparently only log in once per system. If they try to log in again they get the symbolic link error. If they go a system where they haven't logged in, they can do so - just not twice on the same system. Restarting autofs clears the issue - whether the end problem turns out to be in NFS or kernel.

Server:
/home                   x.x.x.x/25(rw,async,insecure,no_root_squash,no_subtree_check,anonuid=65534,anongid=65534,fsid=0)

Clients auto.home:
*  -fstype=nfs4,rw,rsize=1048576,wsize=1048576,intr,soft,proto=tcp,port=2049,noatime x.x.x.x:/&

Trying the TIMEOUT=0 as a workaround, although it isn't ideal it is better than being locked out. Hope it works here too.

Comment 121 Jürgen Holm 2012-11-28 06:57:21 UTC

(In reply to comment #120)
> Adding an additional user - multiple f17 clients and a f17 server which
> serves ups nfs /home automounted for any client as their /home. I just
> started getting hit with this when completed the upgrades this last weekend.
> I don't believe I was seeing it with a F17 client to F16 server.

We see this with f17 clients to a CentOS 5 Server 2.6.18-308.16.1.el5

Comment 122 marcindulak 2012-12-04 11:47:13 UTC

The TIMEOUT=0 in /etc/sysconfig/autofs trick does not work for me

Comment 123 Michael Fischer 2012-12-04 11:57:20 UTC

(In reply to comment #122)
> The TIMEOUT=0 in /etc/sysconfig/autofs trick does not work for me

I had to make the mounts browsable as well as revert to the 3.3.4 kernel before we were able to function normally.

Comment 124 Jürgen Holm 2012-12-04 11:58:34 UTC

(In reply to comment #122)
> The TIMEOUT=0 in /etc/sysconfig/autofs trick does not work for me

Confirmed for 3.6.8-2.fc17.x86_64

Comment 125 Jürgen Holm 2013-01-03 10:36:14 UTC

Hi!

Reverting to 3.3.4, set TIMEOUT=0  and browsable mounts didn't fix this completly.
If you umount an automount directory, an than cd to this directory, you get the "Too many.." message.

@H.J. Lu: I think, the bug priority should be set to URGENT!

Comment 126 Ian Kent 2013-01-04 00:19:48 UTC

(In reply to comment #125)
> Hi!
> 
> Reverting to 3.3.4, set TIMEOUT=0  and browsable mounts didn't fix this
> completly.
> If you umount an automount directory, an than cd to this directory, you get
> the "Too many.." message.

According to the testing that's been done here that's a different
bug and doesn't fix the original problem. It's fixed by the patches
in comments #94 and #95. As I see upstream (today) there's also a
correction needed, which I'll post here as well. Manual umounting
isn't supported, especially for offset mounts like these, but I
will at least try and fix problems that arise because of it, such
as is the case with these patches.

There is an NFS patch which may be related, I'll also post that
for those that wish to test it.

Ian

Comment 127 Ian Kent 2013-01-04 00:32:03 UTC

Created attachment 672268 [details]
Patch - Fix sparse warning: context imbalance in autofs4_d_automount() different lock contexts for basic block

This patch is a correction to the patch of comment #95.

Comment 128 Ian Kent 2013-01-04 00:41:50 UTC

Created attachment 672270 [details]
Patch - don't do blind d_drop() in nfs_prime_dcache()

Another possibly related patch for testing.

Comment 129 info@kobaltwit.be 2013-02-17 16:35:59 UTC

Meanwhile I upgraded one of my systems to Fedora 18
kernel-PAE-3.7.7-201.fc18.i686
autofs-5.0.7-10.fc18.i686

Autofs is still not working properly. I can access my directories straight after startup, but after some time they are not found anymore.

I have switched back to kernel 3.5.5.fc17 for now.

Comment 130 Ian Collier 2013-02-22 17:29:06 UTC

This may be tempting fate, but I don't think I have seen this problem occur on any of our lab machines since they applied kernel 3.7.3-101.fc17.x86_64 from Fedora a couple of weeks ago.  (In fact, most of them are now running 3.7.6-102.fc17.x86_64.)

Comment 131 info@kobaltwit.be 2013-02-28 12:52:32 UTC

Since my last comment I have set TIMEOUT=0 in /etc/sysconfig/autofs. I haven't experienced any timeout problems since.

Current kernel is 3.7.9-201.fc18.i686.PAE. Running on 3 PC's now.

With TIMEOUT=300 (the default on Fedora), I got a hanging system frequently. That bad even that is DOS'ed my file server (Centos 5, kernel 2.6.18-308.16.1.el5xen). The server is running 100% with no way to access it. Only a forced reboot helped (it's a virtual server running on the xen vm). This may be a totally unrelated problem though. If you want I can generate a separate bugreport for it.

Comment 132 Fedora End Of Life 2013-07-04 05:51:35 UTC

This message is a reminder that Fedora 17 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 17. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '17'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 17's end of life.

Bug Reporter:  Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 17 is end of life. If you 
would still like  to see this bug fixed and are able to reproduce it 
against a later version  of Fedora, you are encouraged  change the 
'version' to a later Fedora version prior to Fedora 17's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 133 Fedora End Of Life 2013-08-01 17:07:47 UTC

Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.