Bug 624131 - First attempt at nfs mounting an ext3/ext4/xfs filesystem always fails with stale NFS handle
First attempt at nfs mounting an ext3/ext4/xfs filesystem always fails with s...
Status: CLOSED WORKSFORME
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.0
All Linux
low Severity high
: rc
: ---
Assigned To: J. Bruce Fields
Filesystem QE
: RHELNAK
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-08-13 14:52 EDT by Barry Marson
Modified: 2011-09-14 08:23 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-09-14 08:23:20 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
ethereal file of failed mount attempt. cmd was: mount -t nfs -o nfsvers=3 sfss1:/sfs1 /mnt (5.59 KB, text/x-log)
2010-08-13 14:52 EDT, Barry Marson
no flags Details

  None (edit)
Description Barry Marson 2010-08-13 14:52:34 EDT
Created attachment 438741 [details]
ethereal file of failed mount attempt. cmd was: mount -t nfs -o nfsvers=3 sfss1:/sfs1 /mnt

Description of problem:

Ive been running into a stale NFS file handle issue during client mount since the beginning of RHEL6 testing but now the failure seems to be happening more and effecting certain testing.

When I do my NFS server testing from test to test, the nfs service is stopped, the file systems are unmounted, they are then recreated (mkfs) and remounted, the networks supporting NFS are restarted and finally nfs is started.  This is the way I have been doing it for years.

The problem is, for ext3, ext4, xfs I get a stale NFS handle on the first mount attempt from a client.  ext2 and gfs2 do not fail.  If my test harness doesnt try this initially, the benchmark innards will fail.

My RHEL6 server has been updated to SNAP 10 and running the -59 kernel.

This issue never happened with RHEL5 server.  The clients were all running an old version of RHEL4.  In fact they were at 2.6.9-27.ELsmp.  I brought them up to 2.6.9-89.ELsmp yet the problem persists.

With steved's help, I captured the ethereal log attempt at mounting.  It is attached.

The reason I'm concerned so much now is I have been unable to test one of those specific file systems successfully because of "Stale NFS errors" shortly after the benchmark tries to start.

Version-Release number of selected component (if applicable):
RHEL6 - SNAP 10  -59 kernel

How reproducible:
every time

Steps to Reproduce:
1. Running SPECsfs on the BIGI testbed
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 2 RHEL Product and Program Management 2010-08-13 15:17:50 EDT
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **
Comment 3 RHEL Product and Program Management 2010-08-18 17:24:00 EDT
Thank you for your bug report. This issue was evaluated for inclusion
in the current release of Red Hat Enterprise Linux. Unfortunately, we
are unable to address this request in the current release. Because we
are in the final stage of Red Hat Enterprise Linux 6 development, only
significant, release-blocking issues involving serious regressions and
data corruption can be considered.

If you believe this issue meets the release blocking criteria as
defined and communicated to you by your Red Hat Support representative,
please ask your representative to file this issue as a blocker for the
current release. Otherwise, ask that it be evaluated for inclusion in
the next minor release of Red Hat Enterprise Linux.
Comment 4 Barry Marson 2010-08-19 08:41:17 EDT
While not formally bz'ed, this issue may be related to the problems where running into when executing the SPECsfs benchmark. Xfs presented filesystems on the NFS server return stale NFS handle to the clients within minutes (sometimes seconds) after starting.  This is the only presented filesystem type that does this ...

Barry
Comment 5 Eric Sandeen 2010-08-19 16:39:53 EDT
The xfs issue you ran into on SPECsfs, and the fix for it, were entirely xfs-specific; if you're seeing this problem across multiple filesystems I doubt that it's related to Dave's patch for bug #624860.

-Eric
Comment 6 RHEL Product and Program Management 2011-01-06 23:26:57 EST
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.
Comment 7 Ric Wheeler 2011-01-07 12:55:42 EST
Bruce, can you see if this still is an issue? If so, can we fix it for 6.1 or is this a 6.2 issue?
Comment 8 J. Bruce Fields 2011-01-07 19:58:33 EST
The attached trace shows:

  client sends MNT for /sfs1
  server replies with filehandle
    01:00:06:00:00:00:08:00:00:00:00:00:00:00:00:00:00:00:00:00
  client sends FSINFO with that filehandle
  server replies with NFS3ERR_STALE

So, clearly a server bug.

I tried running

  mkfs.xfs -f /dev/vdb
  mount /dev/vdb /exports
  service nfs start
  exportfs -orw '*:/exports'
  mount -onfsvers3 localhost:/exports /mnt/
  umount /mnt/
  umount /exports

a few times in a loop on an rhel6 guest and didn't see any failures.

So I'm stuck for now.

Barry, are you still seeing this?
Comment 9 Barry Marson 2011-01-08 11:25:43 EST
Bruce,

Im still seeing this ... at least with the -71 kernel.

I noticed that I had not re exportfs after building the filesystems like you did after bringing them nfs online.  Doing so had no effect, nor did a showmount -e from the client just before the mount attempt.

Barry
Comment 10 RHEL Product and Program Management 2011-02-01 01:04:57 EST
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.
Comment 11 Ric Wheeler 2011-02-01 07:50:19 EST
Is this still an issue with the latest 6.1 code?

Thanks!
Comment 12 Barry Marson 2011-02-01 10:47:58 EST
While I should consider upgrading the client side kernel, I've locked it down for years for way back testing ...

As of now,  a 2.6.9-89.ELsmp client trying to mount a 2.6.32-105.el6.x86_64 still fails.

Barry
Comment 13 J. Bruce Fields 2011-02-01 12:22:25 EST
Could I get a look at the exact scripts that are doing the mkfs, nfsd start, etc.?  I just want to make sure it's not doing anything unusual.
Comment 14 RHEL Product and Program Management 2011-02-01 13:33:31 EST
This request was erroneously denied for the current release of
Red Hat Enterprise Linux.  The error has been fixed and this
request has been re-proposed for the current release.
Comment 15 J. Bruce Fields 2011-02-01 16:55:43 EST
Looking at /proc/net/rpc/nfsd.fh/content after a failed mount, it looks like mountd is failing to resolve the uuid; I wonder if this is the same problem as http://www.spinics.net/lists/linux-nfs/msg00876.html (or something similar).
Comment 16 J. Bruce Fields 2011-02-02 22:35:29 EST
I found a similar problem on an RHEL6 test machine: if I shut down nfs, unmount /dev/vdb (which holds my exported filesystem), re-mkfs /dev/vdb, remount it, restart nfs, and try to mount it, the mount succeeds--but, interestingly, comparing 'blkid /dev/vdb' with the export cache (/proc/net/rpc/nfsd.export/content) shows that mountd is still using the uuid of the *old* filesystem.

However, if instead of doing "service nfs stop" and "service nfs start" to stop and start nfs, I *just* stop and start rpc.mountd by hand, then mountd gets updated information.

Stripping out code from /etc/init.d/nfs, I eventually replaced the "start" and "stop" cases by exactly the commands I was using to start and stop rpc.mountd by hand, and still saw the difference in behavior.

My only remaining idea was that it could be some selinux rule; and indeed: looking at strace's of rpc.mountd in both cases, I see that in one an open of /dev/vdb fails, and in the other it succeeds; and after "setenforce 0", everything works.  So in my case selinux appears to be preventing liblkid from getting a current uuid.  Perhaps it is in your case too.

Could you try turning off selinux and seeing if the problem is still reliably reproduceable?
Comment 17 Barry Marson 2011-02-03 00:19:43 EST
selinux is disabled in the clients via /etc/selinux/config

the server has selinux=0 on the boot line

Barry
Comment 18 Ric Wheeler 2011-03-17 15:07:59 EDT
Looks like this is too late for 6.1...
Comment 19 J. Bruce Fields 2011-09-13 17:25:17 EDT
Sorry, I was never able to duplicate this or work out what's going on here; are you still seeing the problem?
Comment 20 Ric Wheeler 2011-09-13 19:26:53 EDT
If not, let's close this BZ until we see it again....

Note You need to log in before you can comment on or make changes to this bug.