Bug 980088

Summary: Unable to umount stale mountpoints
Product: [Fedora] Fedora Reporter: Tom Horsley <horsley1953>
Component: kernelAssignee: Jeff Layton <jlayton>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 18CC: bfields, gansalmon, itamar, jlayton, jonathan, kernel-maint, madhu.chinakonda, marmalodak, orion, steved
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 980172 (view as bug list) Environment:
Last Closed: 2013-09-11 11:18:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1007607    
Attachments:
Description Flags
strace of the umount -f command none

Description Tom Horsley 2013-07-01 11:55:09 UTC
Created attachment 767353 [details]
strace of the umount -f command

Description of problem:

Regression back to the same as bug 866303, trying to umount -l or umount -f a
NFS filesystem when the server is down (or at least not talking very well) fails with a stale NFS filehandle message and the mountpoint still exists.


Version-Release number of selected component (if applicable):
util-linux-2.22.2-6.fc18.x86_64
nfs-utils-1.2.7-6.fc18.x86_64


How reproducible:
Every time I've tried this morning

Steps to Reproduce:
1.wait for NFS server to crash for some reason
2.get timeouts on server
3.try to umount -l or umount -f

Actual results:
umount never works, gives stale NFS filehandle error, mountpoint is still there

Expected results:
umount the annoying NFS filesystem that is causing incessant timeouts.

Additional info:

I ran this command, I'll attach the strace listing:

[root@tomh ~]# strace -t -f -o umount.trace umount -f /userland
umount.nfs: /userland: Stale file handle

I see it timing out trying to talk to 10.134.30.17 which is the bloody system that is down. That's why I used the -f option :-).

Comment 1 Jeff Layton 2013-07-01 12:00:30 UTC
Almost certainly a kernel problem and something I happen to be working on at the moment. Unmounting stale NFS mountpoints is problematic at the moment. What kernel are you running on the client here?

> 
> Steps to Reproduce:
> 1.wait for NFS server to crash for some reason
> 2.get timeouts on server
> 3.try to umount -l or umount -f
> 

In this situation is the server coming back up? A server reboot due to crash or other issue should not cause stale NFS filehandles. What sort of server is this?

Comment 2 Tom Horsley 2013-07-01 12:50:35 UTC
The server was totally down (no power) during the strace above (the UPS it was on died when the power failed, which seems to be all UPSs are good for :-).

The server is back up now, it is a moderately old redhat system running a custom kernel:

Linux userland 2.6.18.8-RedHawk-4.2-trace #1 SMP PREEMPT Tue Apr 3 10:36:40 EDT 2007 i686 i686 i386 GNU/Linux

Red Hat Enterprise Linux WS release 4 (Nahant Update 4)

The client is Fedora 18:

Linux tomh 3.9.6-200.fc18.x86_64 #1 SMP Thu Jun 13 18:56:55 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Fedora release 18 (Spherical Cow)

The server is so old it does not support TCP NFS mounts and runs version 3 NFS, so the mount options on my client look like:

userland:/drbd_dir10    /drbd_dir10     nfs  noauto,proto=udp,rw,bg,soft,intr,rsize=8192,wsize=8192 0 0

(Forgetting about support for old NFS servers seems to be a rampant problem with NFS development :-).

Comment 3 Jeff Layton 2013-07-01 13:01:56 UTC
We actually take quite a bit of care when it comes to old servers. That one is likely just broken though, since it appears to be changing filehandles on a reboot. This is a major no-no for any sort of NFS server. When that happens, there's little the client can do to recover.

That said, there *is* a regression in current mainline kernels, but it doesn't really have anything to do with old vs. new NFS servers. The problem is that when we're failing the pathwalk when the root of a NFS filesystem goes stale. When the pathwalk fails, we can't unmount the filesystem.

I have a patch in the works that should fix this the right way and will probably post it upstream in a day or two. Until then, I'm afraid there's little you can do but reboot the box.

Comment 4 Tom Horsley 2013-07-01 14:03:56 UTC
I don't know if the server would have changed filehandles on a reboot or not. Like I said, it was powered off the entire time I was trying to do the umount.

Whenever I've had stale NFS filehandles before (which, in fact, seems to be the single most common NFS error), I've always been able to recover by doing a umount -l then a remount (when umount -l was working, that is). If I can do that manually on the client, I don't know why the client couldn't do it all by itself, so I wouldn't say there is little the client can do to recover.

Comment 5 Jeff Layton 2013-07-01 14:13:56 UTC
It shouldn't matter if you power down the box for a year. When it comes back up, it should be serving the *same* filesystem that it was before. I notice you're using drbd here, so perhaps you're did something to the filesystem between shutting it down and restarting it?

In any case, the client can't recover in this situation all by itself because doing a unmount/remount cycle turns this into an entirely different mount as far as the kernel is concerned. You're redoing the lookup of the root.

When the filehandle for an inode changes, the filesystem has no way to know what the inode actually *is* anymore. The lookup of the root of the mount is *long* since done. We have a filehandle but not necessarily any name that we can attach to it anymore.

So, while we can make it easier to unmount an NFS mount that has a stale root filehandle, we can't do anything to automatically work around servers that suddenly decide to start throwing ESTALE errors on the root of the mount.

Initial patch posted here:

    http://marc.info/?l=linux-fsdevel&m=137268484100869&w=2

Comment 6 Tom Horsley 2013-07-01 14:23:40 UTC
You keep talking about the server doing something wrong when it comes back up, and I keep saying the server was down the whole time. That stale NFS file handle could not possibly be from the server telling me it was stale because the server wasn't talking at any point in this process. Whatever was saying stale filehandle was entirely the client's idea of what to report.

Comment 7 Jeff Layton 2013-07-01 14:34:23 UTC
Ok, I misunderstood then. You have these mount options:

    "proto=udp,rw,bg,soft,intr,rsize=8192,wsize=8192"

The "soft" is what's causing that to occur then. The client is issuing a GETATTR against the root of the filesystem. That eventually times out and returns an error. At that point, the revalidation of the dentry fails and the lookup returns -ESTALE.

The patch I've proposed should fix this situation as well.

Comment 8 Jeff Layton 2013-09-11 11:18:21 UTC
Patch has made its way into mainline and should be in rawhide kernels soon. I'll go ahead and close this with a resolution of RAWHIDE. If f18 eventually gets 3.12 kernels, it should get this as well.

Comment 9 Jeff Layton 2013-09-16 13:03:35 UTC
*** Bug 1007745 has been marked as a duplicate of this bug. ***