| Summary: | NFS ESTALE from mkdir/rmdir | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Paul Victor Novarese <pvn> |
| Component: | kernel | Assignee: | Jeff Layton <jlayton> |
| Status: | CLOSED NOTABUG | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 5.7 | CC: | bfields, bryce.petrini, dhowells, dwysocha, jlayton, rwheeler, sprabhu, steved, wnix |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-10-21 17:02:15 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Paul Victor Novarese
2011-10-18 21:03:38 UTC
Created attachment 528889 [details]
network trace during reproduction
Well, now that is interesting... There's not a single NFSERR_STALE error in either capture. Typically when we see these sorts of problems that means that server returned that error, meaning that it didn't recognize the filehandle. There are however some places in the NFS and generic VFS code that return that error for other reasons. I'll need to look over the code and the capture and see if there is any reason that this would occur... Question: when you run this test, do you ever see any kernel messages pop up in dmesg? I ran the reproducer above for about 30000 passes on a linux nfs server and never saw it fail. My suspicion at this point is that the server they're testing against is significant here.
What we may need to do is get them to run the same reproducer with NFS debugging turned up.
On their test client machine, have them run this to turn up NFS debugging:
# rpcdebug -m nfs -s all
...then run the test. Once it fails, immediately run:
# rpcdebug -m nfs -c all
...to turn it off again. Then, run dmesg and collect the output. With luck we may see the source of the ESTALE problem. Ideally you'll want to do this on a client that's only running this test at the time. Any other NFS activity will just add extra chaff to the logs.
Ok, that log helps a lot:
-----------------[snip]-------------------
Oct 19 16:03:59 s_kernel@gram0142 kernel: NFS: nfs_lookup_revalidate(npavey/test_dir.gram0142) is valid
Oct 19 16:03:59 s_kernel@gram0142 kernel: --> nfs_follow_mountpoint()
Oct 19 16:03:59 s_kernel@gram0142 kernel: nfs_follow_mountpoint: enter
Oct 19 16:03:59 s_kernel@gram0142 kernel: NFS call lookup test_dir.gram0142
Oct 19 16:03:59 s_kernel@gram0142 kernel: NFS: nfs_update_inode(0:26/14035749 ct=1 info=0x6)
Oct 19 16:03:59 s_kernel@gram0142 kernel: NFS reply lookup: 0
Oct 19 16:03:59 s_kernel@gram0142 kernel: --> nfs_do_submount()
Oct 19 16:03:59 s_kernel@gram0142 kernel: nfs_do_submount: submounting on npavey/test_dir.gram0142
Oct 19 16:03:59 s_kernel@gram0142 kernel: --> nfs_xdev_get_sb()
Oct 19 16:03:59 s_kernel@gram0142 kernel: --> nfs_clone_server(,23475a4a:0,)
Oct 19 16:03:59 s_kernel@gram0142 kernel: --> nfs_probe_fsinfo()
Oct 19 16:03:59 s_kernel@gram0142 kernel: NFS call fsinfo
Oct 19 16:03:59 s_kernel@gram0142 kernel: NFS reply fsinfo: 0
Oct 19 16:03:59 s_kernel@gram0142 kernel: NFS call pathconf
Oct 19 16:03:59 s_kernel@gram0142 kernel: NFS reply pathconf: 0
Oct 19 16:03:59 s_kernel@gram0142 kernel: <-- nfs_probe_fsinfo() = 0
Oct 19 16:03:59 s_kernel@gram0142 kernel: Cloned FSID: 23475a4a:0
Oct 19 16:03:59 s_kernel@gram0142 kernel: <-- nfs_clone_server() = ffff810b8d847000
Oct 19 16:03:59 s_kernel@gram0142 kernel: --> nfs_free_server()
Oct 19 16:03:59 s_kernel@gram0142 kernel: --> nfs_put_client({2})
Oct 19 16:03:59 s_kernel@gram0142 kernel: <-- nfs_free_server()
Oct 19 16:03:59 s_kernel@gram0142 kernel: do_proc_get_root: call fsinfo
Oct 19 16:03:59 s_kernel@gram0142 kernel: do_proc_get_root: reply fsinfo: 0
Oct 19 16:03:59 s_kernel@gram0142 kernel: NFS: nfs_update_inode(0:26/14035757 ct=2 info=0x6)
Oct 19 16:03:59 s_kernel@gram0142 kernel: NFS: nfs_fhget(0:26/14035757 ct=2)
Oct 19 16:03:59 s_kernel@gram0142 kernel: <-- nfs_xdev_get_sb() = -116 [splat]
Oct 19 16:03:59 s_kernel@gram0142 kernel: nfs_do_submount: done
Oct 19 16:03:59 s_kernel@gram0142 kernel: <-- nfs_do_submount() = ffffffffffffff8c
Oct 19 16:03:59 s_kernel@gram0142 kernel: nfs_follow_mountpoint: done, returned -116
Oct 19 16:03:59 s_kernel@gram0142 kernel: <-- nfs_follow_mountpoint() = -116
-----------------[snip]-------------------
-116 is ESTALE...
So the reply from the server is causing the client to attempt a new submount.
I don't have a capture from this new reproducer, but in looking at the older one, almost all of the mkdir calls have this in both the new object attributes and the dir_wcc info:
fsid: 0x000000000d9c6279 (228352633)
...but I see this in frame 239 (just prior to the failure) in the dir_wcc info:
fsid: 0x000000000d9c61c9 (228352457)
...the new object attributes show the original fsid.
An fsid change implies that you've crossed a mountpoint on the server. RHEL5 in this situation will attempt to mount the new filesystem as a new submount. RHEL4 does not do this which explains why they're only seeing this now.
Since you're only creating a directory in this situation, this is almost certainly a server bug. I'd contact Netapp and ask them if this is a known problem and whether they have a fix.
...and and since the client is returning an ESTALE error when we attempt the submount, the info the server is providing is clearly bogus. Thanks Paul. Going ahead and closing this as NOTABUG. Please reopen if you wish to discuss it further. |