Red Hat Bugzilla – Bug 624131
First attempt at nfs mounting an ext3/ext4/xfs filesystem always fails with stale NFS handle
Last modified: 2011-09-14 08:23:20 EDT
Created attachment 438741 [details]
ethereal file of failed mount attempt. cmd was: mount -t nfs -o nfsvers=3 sfss1:/sfs1 /mnt
Description of problem:
Ive been running into a stale NFS file handle issue during client mount since the beginning of RHEL6 testing but now the failure seems to be happening more and effecting certain testing.
When I do my NFS server testing from test to test, the nfs service is stopped, the file systems are unmounted, they are then recreated (mkfs) and remounted, the networks supporting NFS are restarted and finally nfs is started. This is the way I have been doing it for years.
The problem is, for ext3, ext4, xfs I get a stale NFS handle on the first mount attempt from a client. ext2 and gfs2 do not fail. If my test harness doesnt try this initially, the benchmark innards will fail.
My RHEL6 server has been updated to SNAP 10 and running the -59 kernel.
This issue never happened with RHEL5 server. The clients were all running an old version of RHEL4. In fact they were at 2.6.9-27.ELsmp. I brought them up to 2.6.9-89.ELsmp yet the problem persists.
With steved's help, I captured the ethereal log attempt at mounting. It is attached.
The reason I'm concerned so much now is I have been unable to test one of those specific file systems successfully because of "Stale NFS errors" shortly after the benchmark tries to start.
Version-Release number of selected component (if applicable):
RHEL6 - SNAP 10 -59 kernel
Steps to Reproduce:
1. Running SPECsfs on the BIGI testbed
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.
** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **
Thank you for your bug report. This issue was evaluated for inclusion
in the current release of Red Hat Enterprise Linux. Unfortunately, we
are unable to address this request in the current release. Because we
are in the final stage of Red Hat Enterprise Linux 6 development, only
significant, release-blocking issues involving serious regressions and
data corruption can be considered.
If you believe this issue meets the release blocking criteria as
defined and communicated to you by your Red Hat Support representative,
please ask your representative to file this issue as a blocker for the
current release. Otherwise, ask that it be evaluated for inclusion in
the next minor release of Red Hat Enterprise Linux.
While not formally bz'ed, this issue may be related to the problems where running into when executing the SPECsfs benchmark. Xfs presented filesystems on the NFS server return stale NFS handle to the clients within minutes (sometimes seconds) after starting. This is the only presented filesystem type that does this ...
The xfs issue you ran into on SPECsfs, and the fix for it, were entirely xfs-specific; if you're seeing this problem across multiple filesystems I doubt that it's related to Dave's patch for bug #624860.
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
Bruce, can you see if this still is an issue? If so, can we fix it for 6.1 or is this a 6.2 issue?
The attached trace shows:
client sends MNT for /sfs1
server replies with filehandle
client sends FSINFO with that filehandle
server replies with NFS3ERR_STALE
So, clearly a server bug.
I tried running
mkfs.xfs -f /dev/vdb
mount /dev/vdb /exports
service nfs start
exportfs -orw '*:/exports'
mount -onfsvers3 localhost:/exports /mnt/
a few times in a loop on an rhel6 guest and didn't see any failures.
So I'm stuck for now.
Barry, are you still seeing this?
Im still seeing this ... at least with the -71 kernel.
I noticed that I had not re exportfs after building the filesystems like you did after bringing them nfs online. Doing so had no effect, nor did a showmount -e from the client just before the mount attempt.
Is this still an issue with the latest 6.1 code?
While I should consider upgrading the client side kernel, I've locked it down for years for way back testing ...
As of now, a 2.6.9-89.ELsmp client trying to mount a 2.6.32-105.el6.x86_64 still fails.
Could I get a look at the exact scripts that are doing the mkfs, nfsd start, etc.? I just want to make sure it's not doing anything unusual.
This request was erroneously denied for the current release of
Red Hat Enterprise Linux. The error has been fixed and this
request has been re-proposed for the current release.
Looking at /proc/net/rpc/nfsd.fh/content after a failed mount, it looks like mountd is failing to resolve the uuid; I wonder if this is the same problem as http://www.spinics.net/lists/linux-nfs/msg00876.html (or something similar).
I found a similar problem on an RHEL6 test machine: if I shut down nfs, unmount /dev/vdb (which holds my exported filesystem), re-mkfs /dev/vdb, remount it, restart nfs, and try to mount it, the mount succeeds--but, interestingly, comparing 'blkid /dev/vdb' with the export cache (/proc/net/rpc/nfsd.export/content) shows that mountd is still using the uuid of the *old* filesystem.
However, if instead of doing "service nfs stop" and "service nfs start" to stop and start nfs, I *just* stop and start rpc.mountd by hand, then mountd gets updated information.
Stripping out code from /etc/init.d/nfs, I eventually replaced the "start" and "stop" cases by exactly the commands I was using to start and stop rpc.mountd by hand, and still saw the difference in behavior.
My only remaining idea was that it could be some selinux rule; and indeed: looking at strace's of rpc.mountd in both cases, I see that in one an open of /dev/vdb fails, and in the other it succeeds; and after "setenforce 0", everything works. So in my case selinux appears to be preventing liblkid from getting a current uuid. Perhaps it is in your case too.
Could you try turning off selinux and seeing if the problem is still reliably reproduceable?
selinux is disabled in the clients via /etc/selinux/config
the server has selinux=0 on the boot line
Looks like this is too late for 6.1...
Sorry, I was never able to duplicate this or work out what's going on here; are you still seeing the problem?
If not, let's close this BZ until we see it again....