Bug 868985

Summary: "too many symbolic links" error appears on mounted filesystems
Product: [Fedora] Fedora Reporter: Art Werschulz <agw>
Component: kernelAssignee: nfs-maint
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 17CC: Bert.Deknuydt, gansalmon, ikent, irlapati, itamar, jforbes, jlayton, jonathan, j, jtrutwin, kernel-maint, madhu.chinakonda, mauricio.esguerra, mkfischer, moniot, nneul, w3euu
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-11 20:51:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
NFS server exports none

Description Art Werschulz 2012-10-22 16:26:34 UTC
Description of problem:
After system is running for awhile, get "too many levels of symbolic links" error msg, mainly on automounted NFS shares

Version-Release number of selected component (if applicable):
3.6 (all)

How reproducible:
See below.

Steps to Reproduce:
1. Start the system in a 3.6 kernel
2. Wait a while
3. Try to access (e.g., cd to) an automounted NFS share
4. See the "can't access: too many levels of symbolic links" error msg
  
Actual results:
See above.

Expected results:
Would expect to be able cd (or ls or whatever) directories in said filesystems.

Additional info:

Comment 1 Josh Trutwin 2012-11-26 16:31:43 UTC
I am experiencing this problem as well - some more info:

/etc/auto.net:

people		-fstype=nfs,rsize=8192,wsize=8192,nfsvers=3,hard,intr,nodev,nosuid elm:/usr/people
home		-fstype=nfs,rsize=8192,wsize=8192,nfsvers=3,hard,intr,nodev,nosuid elm:/home
mail		-fstype=nfs,rsize=8192,wsize=8192,nfsvers=3,hard,intr,nodev,nosuid elm:/var/mail
physics		-fstype=nfs,rsize=8192,wsize=8192,nfsvers=3,hard,intr,nodev,nosuid elm:/apps/linux/physics
fortran		-fstype=nfs,rsize=8192,wsize=8192,nfsvers=3,hard,intr,nodev,nosuid elm:/apps/linux/fortran
plasma		-fstype=nfs,rsize=8192,wsize=8192,nfsvers=3,hard,intr,nodev,nosuid elm:/plasma
ccd		-fstype=nfs,rsize=8192,wsize=8192,nfsvers=3,hard,intr,nodev,nosuid elm:/ccd

/usr/people is symlinked to /net/people, /usr/local/physics is symlinked to /net/physics, /home is symlinked to /net/home

When the problem happens it seems like /usr/people and /usr/local/physics always fail, but /home and /var/spool/mail are ok.  The problem happened after doing a full upgrade on my fedora 17 clients this weekend.  Autofs is still version 5.0.6-22 but kernel is now 3.6.7-4.fc17.x86_64.  Sometimes restarting autofs fixes it but lately it does not and a full reboot is needed.

I've tried:

service autofs stop
automount -d -v -f

But whenever I list /usr/people nothing is displayed but the automount command, just Too Many Levels of Symlinks on the ls.

NFS server is RHEL 6.3 but has not changed configuration or kernel version when this started happening.

I'll attach my /etc/exports to the ticket.  Please let me know what additional information is helpful.

Comment 2 Josh Trutwin 2012-11-26 17:13:39 UTC
Created attachment 652128 [details]
NFS server exports

This is the RHEL 6.3 NFS server exports.  There is a mix of v3 and v4 exports due to issues with the idmapper forcing me to return to NFS3 on the fedora clients.  This has been unaltered tho for months.

Comment 3 Josh Trutwin 2012-11-26 20:33:04 UTC
One thing I noticed, likely it's just a symptom - when I ls -al /net on my system, the ones with too many symlinks have different perms than the mounts that still work:

# ls -al /net
dr-xr-xr-x   2 root root    0 Nov 25 01:48 fortran
drwxr-xr-x  36 root root 4096 Sep 13 08:05 home
dr-xr-xr-x   2 root root    0 Nov 24 18:36 mail
drwxr-xr-x  17 root root 4096 Apr 24  2012 people
dr-xr-xr-x   2 root root    0 Nov 25 01:48 physics
dr-xr-xr-x   2 root root    0 Nov 24 21:00 plasma

If this case, all the 555 ones throw the error, the 755 ones are fine (home/people).

Also, I noticed that nfs-utils was updated to version 1.2.6-5.fc17.x86_64 over the weekend, not sure if it's to blame.

What is strange is that if I manually mount instead of using the automounter it works just fine:

# mount -t nfs -s -o rsize=8192,wsize=8192,nfsvers=3,hard,intr,nodev,nosuid elm:/var/mail /net/mail

I'm going to try to revert to kernel 3.6.6-1 until this is fixed...

Comment 4 Jeff Layton 2012-12-05 13:33:44 UTC
Ian, might this be a duplicate of 833535?

Josh, you may want to try pulling in commit 696199f8ccf from upstream kernels and see if it helps.

Comment 5 Ian Kent 2012-12-05 15:52:01 UTC
(In reply to comment #4)
> Ian, might this be a duplicate of 833535?
> 
> Josh, you may want to try pulling in commit 696199f8ccf from upstream
> kernels and see if it helps.

This may be a problem that's has been around since 3.5, maybe
earlier. I haven't seen this exact senario either, it's never
been reported when using only indirect mounts like this but
perhaps the symlink following is a new clue.

I've only ever seen it myself once, and could not reproduce
it after changing exports on the server and then changing
them back again.

The real problem is I can't reproduce it.

I've thought that bug 833535 might be related but haven't
checked the timeline closely for when that NFS change was
made.

We definitely need to check if the upstream commit fixes
the problem seen here, although it may not solve my existing
problem.

See also bug 833535.

Ian

Comment 6 Ian Kent 2012-12-05 16:00:57 UTC
(In reply to comment #3)
> 
> If this case, all the 555 ones throw the error, the 755 ones are fine
> (home/people).

Yeah, maybe a further clue, not sure.

> 
> Also, I noticed that nfs-utils was updated to version 1.2.6-5.fc17.x86_64
> over the weekend, not sure if it's to blame.

I doubt that is realated. I think it's a kernel issue.

> 
> What is strange is that if I manually mount instead of using the automounter
> it works just fine:
> 
> # mount -t nfs -s -o rsize=8192,wsize=8192,nfsvers=3,hard,intr,nodev,nosuid
> elm:/var/mail /net/mail

Yeah, tell me about it, I've gone over the autofs and vfs code
in detail many times looking for this and I just don't see a
problem. Assuming of course this is my exiting problem .....

At this point I believe the issue is an unextected interaction
between the NFS client and server, like bug 833535, but I can't
nail down what leads to it.

> 
> I'm going to try to revert to kernel 3.6.6-1 until this is fixed...

That will be interesting because that kernel definitely has the
problem I'm struggling with, although it hasn't been seen with
indirect mounts before.

Ian

Comment 7 Josh Trutwin 2012-12-05 16:25:36 UTC
(In reply to comment #5)

> The real problem is I can't reproduce it.

I can get it to happen fairly consistently here, anything I can do to help?  I've since switched all systems to manually NFS mounts in /etc/fstab, it's only a problem when using the automounter.

Josh

Comment 8 Ian Kent 2012-12-05 16:55:41 UTC
(In reply to comment #7)
> (In reply to comment #5)
> 
> > The real problem is I can't reproduce it.
> 
> I can get it to happen fairly consistently here, anything I can do to help? 

I wish, I really need to work out what is different about my
systems and those who are seeing the problem so I can reproduce
in order to do a bisect. Doing a bisect involves using upstream
sources and multiple kernel builds to identify the commit that
started the problem.

Right now it's most important to find out if the upstream patch
Jeff mentioned makes a difference.

Ian

Comment 9 Ian Kent 2012-12-06 01:25:45 UTC
(In reply to comment #5)
> 
> I've thought that bug 833535 might be related but haven't
> checked the timeline closely for when that NFS change was
> made.

Umm .. that doesn't make sense.
That should be "bug 874372 might be related" and the possible
duplicate being bug 833535.

Ian

Comment 10 Ian Kent 2012-12-06 07:27:21 UTC
Here is a scratch build of the current F17 kernel which inludes
the patch referred to in comment #4.

https://koji.fedoraproject.org/koji/taskinfo?taskID=4761802

Please check to see if it makes a difference to the problem.

Comment 11 Lee H. 2013-01-04 21:59:49 UTC
I have experienced this bug as well.  I have had it happen regularly, but not in any predictable manner  for the past several months -- not sure how long, but at least 3 or 4 -- on each of 4 separate, but largely identical systems, all running FC17 with quite current Kernels.  Current Kernel on all 4 systems is 3.6.10-2.fc17.i686. It always happens on the automounts, I have never seen it on a manual mount.  Automounter is autofs-5.0.6-23.fc17.i686.

The mounts that are failing are from a data pull that occurs every five minutes so the directories get remounted at 5 minute intervals.  They time out in 60 seconds.

I am able to remediate the failures with the following procedure:
  1.  Kill the automounter (systemctl stop autofs)
  2.  The mounts are in /misc/.  I check to make sure that /etc/auto.misc has been unmounted.  Do mount | grep auto.misc.  If it is there do umount -l.
  3.  Then wait a few minutes for the "stuck" directory to unmount -- it seems to have to "time out".
  4.  Restart autofs and all is well.

At least there is no need to reboot.

However, I cannot reproduce the problem other than waiting for it to recur.

Let me know if I can provide further data.

Comment 12 Robert K. Moniot 2013-01-07 19:46:37 UTC
(In reply to comment #10)
> Here is a scratch build of the current F17 kernel which inludes
> the patch referred to in comment #4.
> 
> https://koji.fedoraproject.org/koji/taskinfo?taskID=4761802
> 
> Please check to see if it makes a difference to the problem.

I wanted to test this but when I clicked on the link I couldn't find anything to download.  Has it expired?  If you provide the kernel I will try it on a system that has been exhibiting the problem.

Comment 13 Robert K. Moniot 2013-02-11 15:06:02 UTC
This bug appears to be fixed as of the 3.7 kernel.  Machines that showed the problem with the 3.6 kernel have been running kernel-3.7.3-101.fc17.x86_64 for more than a week with no automount issues.  The problem always manifested within a week so I believe it is cured.

To whoever fixed this -- thanks!

Comment 14 Justin M. Forbes 2013-02-11 20:51:50 UTC
Thanks for the update!