Bug 1148619
| Summary: | moving NFSv4 export causes client mounts to fail | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | David Vossel <dvossel> | ||||||
| Component: | nfs-utils | Assignee: | J. Bruce Fields <bfields> | ||||||
| Status: | CLOSED WONTFIX | QA Contact: | JianHong Yin <jiyin> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 7.1 | CC: | bcodding, bfields, dvossel, eguan, fdinitto, jruemker, mnovacek, ovasik, sbradley, swhiteho | ||||||
| Target Milestone: | rc | Keywords: | TestBlocker | ||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2015-01-07 20:14:20 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
The filehandle has a part that identifies the filesystem and a part that identifies the particular file within that filesystem.
The usual explanation for this kind of failure is that the part identifying the filesystem changed between servers. Using the 'fsid=' option consistently on the exports is one way to avoid this (though on a modern system it should default to identifying the filesystem by uuid, which should also remain the same across servers).
On each server, after a mount attempt (succesful or unsuccesful), could you collect the output of:
exportfs -v
cat /proc/net/rpc/*/content
That should help figure out how the servers are identifying the exported filesystems.
(In reply to J. Bruce Fields from comment #2) > The filehandle has a part that identifies the filesystem and a part that > identifies the particular file within that filesystem. > > The usual explanation for this kind of failure is that the part identifying > the filesystem changed between servers. Using the 'fsid=' option here's a thought The exports that are movable between servers exist on shared storage. Their fsid values and the physical storage device never change. The root export (fsid=0) will always have the same fsid id, but the physical storage device will be different on every server. The root filesystem for the exports is unique for every server whereas the non-root filesystems are completely consistent between servers. > consistently on the exports is one way to avoid this (though on a modern > system it should default to identifying the filesystem by uuid, which should > also remain the same across servers). Looking at the man page for exports, the fsid should uniquely identify the export and not the UUID of the filesystem. Perhaps there is a problem here? I have another data point that seems to indicate this is where the problem is. When I place the root export on shared storage and migrate the root export with the other shared exports, everything is fine... So, if the root export is physically the same device, we have no problem after the move. If the root export's device is unique between servers, clients fail after the shared exports move. again, this same scenario that fails in rhel7 works in rhel6. > On each server, after a mount attempt (succesful or unsuccesful), could you > collect the output of: The mounts are always successful. > > exportfs -v > cat /proc/net/rpc/*/content > > That should help figure out how the servers are identifying the exported > filesystems. Yep, I can gather this. -- David "I have another data point that seems to indicate this is where the problem is. When I place the root export on shared storage and migrate the root export with the other shared exports, everything is fine... So, if the root export is physically the same device, we have no problem after the move. If the root export's device is unique between servers, clients fail after the shared exports move." That's interesting! In that case maybe comment #2 is wrong: it may not be the filesystem part of the filehandle that's the problem, but instead the part that identifies the actual file. It'd still be useful to get that export information just to confirm there's no problem there. "again, this same scenario that fails in rhel7 works in rhel6." But I don't have an explanation for this. Are you setting up the pseudofilesystem explicitly with an fsid=0 export in all of these tests? If so, I wonder if there's something different about how you're creating that filesystem in the rhel6 and rhel7 cases. What are the inode numbers of the directories in the "pseudofilesystem"? Is it possible that in the rhel6 case they were created in some way that happened to give the corresponding directories the same inode numbers, but in the rhel7 case they weren't? But if it's the pseudofilesystem that's the issue, then there's one more thing I don't understand: the client normally only uses the pseudofilesystem at mount time. Is the client actually mounting / itself? There might also be a race condition here where the mount might fail if the failover happened while a mount was in progress--I'll look into that--but I'd expect that to be harder to reproduce, so not what you're seeing here. (In reply to J. Bruce Fields from comment #5) > "I have another data point that seems to indicate this is where the problem > is. When I place the root export on shared storage and migrate the root > export with the other shared exports, everything is fine... So, if the root > export is physically the same device, we have no problem after the move. If > the root export's device is unique between servers, clients fail after the > shared exports move." > > That's interesting! In that case maybe comment #2 is wrong: it may not be > the filesystem part of the filehandle that's the problem, but instead the > part that identifies the actual file. > > It'd still be useful to get that export information just to confirm there's > no problem there. > > "again, this same scenario that fails in rhel7 works in rhel6." > > But I don't have an explanation for this. > > Are you setting up the pseudofilesystem explicitly with an fsid=0 export in > all of these tests? yes, the setup is identical. This test is setup using a template scenario file that runs the exact same way on both rhel7 and rhel6. https://github.com/davidvossel/phd/blob/master/scenarios/nfs-active-active.scenario > If so, I wonder if there's something different about how you're creating > that filesystem in the rhel6 and rhel7 cases. > > What are the inode numbers of the directories in the "pseudofilesystem"? Is > it possible that in the rhel6 case they were created in some way that > happened to give the corresponding directories the same inode numbers, but > in the rhel7 case they weren't? you are onto something here. For rhel6 the fsid=0 directory does have the same inode on both nodes. For rhel7 they are different. [root@rhel6-auto1 ~]# ls -i /mnt/ 134506 exports [root@rhel6-auto2 ~]# ls -i /mnt/ 134506 exports [root@rhel7-auto1 ~]# ls -i /mnt/ 1248784 exports 17323771 gfs2share [root@rhel7-auto2 ~]# ls -i /mnt/ 1257550 exports 18886413 gfs2share This is concerning. If this is in fact the problem, what options do we have to make this reliable? -- David (In reply to J. Bruce Fields from comment #6) > But if it's the pseudofilesystem that's the issue, then there's one more > thing I don't understand: the client normally only uses the pseudofilesystem > at mount time. > > Is the client actually mounting / itself? no, the client is external to all the nfs servers. > There might also be a race condition here where the mount might fail if the > failover happened while a mount was in progress--I'll look into that--but > I'd expect that to be harder to reproduce, so not what you're seeing here. good to look into, but that's not happening here Created attachment 943572 [details]
exportfs debug
I've uploaded the debug you asked for.
cat /proc/net/rpc/*/content
exportfs -v
The txt file shows the state of the system before exports move between servers and after. I've included output from both rhel6 and rhel7.
(In reply to J. Bruce Fields from comment #6) > There might also be a race condition here where the mount might fail if the > failover happened while a mount was in progress--I'll look into that--but > I'd expect that to be harder to reproduce, so not what you're seeing here. (In reply to David Vossel from comment #8) > (In reply to J. Bruce Fields from comment #6) > > But if it's the pseudofilesystem that's the issue, then there's one more > > thing I don't understand: the client normally only uses the pseudofilesystem > > at mount time. > > > > Is the client actually mounting / itself? > > no, the client is external to all the nfs servers. What I meant was: is it doing something like mount server:/ /mnt/ ? I don't completely understand this snippet from the file you linked: pcs -f $tmpfile resource create nfs-client-v4 Filesystem device=${PHD_ENV_floating_ips1}:/ directory=/nfsclientv4 fstype=nfs4 but it looks like it might be? I think I do understand why that wouldn't always work. > > There might also be a race condition here where the mount might fail if the > > failover happened while a mount was in progress--I'll look into that--but > > I'd expect that to be harder to reproduce, so not what you're seeing here. > > good to look into, but that's not happening here I talked the Trond, unfortunately I think it's likely we do have a problem here that we'll have to figure out how to fix. So I think the situation for now is: - we should tell people doing HA to mount individual exports, not /. - and we might need to warn them that there could be temporarily mount failures if the mount happens at the same time as failover. Thanks for the export information, I haven't had a chance to look that over yet, but I think we may understand the issue now. In both rhel6 and rhel7 cases on the target server after the failover /proc/net/rpc/nfsd.fh/content has only: 192.168.122.0/255.255.255.0 1 0x00000000 /mnt/exports Which means that's the only export the client attempts to access after the failover. That looks consistent with the client mounting only floating-ip:/ (== /mnt/exports). So, I think this is expected and we need to either tell people to mount only the "real" exports underneath, or we need to arrange for the filesystems on the two nodes to be identical clones of each other. The drawback of the first approach is that the problem can still happen when a mount happens at the same time as failover. The second approach is more complicated. You can create a small filesystem to use as the pseudofilesystem and copy the filesystem image to both servers. The extra step is annoying but perhaps we could provide some help. Another problem is that using fsid=0 causes v3 and v4 clients to see different paths (e.g. v3 clients have to mount floating-ip:/mnt/exports/export1 to get the same filesystem that v4 clients mount using floating-ip:/export1). We could fix that with some changes to mountd. You can leave out the fsid=0 export entirely and recent rhel6 and rhel7 will construct a pseudofilesystem automatically, giving consistent paths with v3 and v4 (so, both could mount /mnt/exports/export1 in the above example). However the pseudofilesystem in that case is just a tightly restricted export of the server's actual /, so for example filehandles are constructed using inode numbers from the exported filesystem. Arranging for the root filesystem on both servers to be identical sounds like more of a pain. Perhaps we could find a different way to construct that pseudofilesystem. (In reply to J. Bruce Fields from comment #10) . . . > > I talked the Trond, unfortunately I think it's likely we do have a problem > here that we'll have to figure out how to fix. > > So I think the situation for now is: > > - we should tell people doing HA to mount individual exports, not /. This definitely gets us closer. I tested mounting using the individual exports and failover does not cause a stale file handler issue on the client. Now that the client can maintain access to the mount after the failover I've encountered another issue I can't explain. Part of our regression tests is for the client to hold a file lock on the nfs mount, and then to perform a server side export failover while the lock is still being held. For the rhel7 test setup, the lock reclaim fails for the NFSv4 mount kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed! I performed the same test using rhel6 and the lock reclaim happens after the failover as expected. On both rhel7 and rhel6 I'm mounting the client using the specific export and not the root export as we discussed. # mount -v -o "vers=4" 192.168.122.200:/export1 /nfsclientv4 from there I grab a lock on a file within the export using the fcntl lock util I've created here, https://raw.githubusercontent.com/davidvossel/phd/master/misc/fnctl_locker.c # fnctl_locker -f /nfsclientv4/clientsharefile After the lock is held I fail over export1 from one server to another. rhel6 works, rhel7 doesnt. Any ideas? -- David (In reply to David Vossel from comment #12) > After the lock is held I fail over export1 from one server to another. rhel6 > works, rhel7 doesnt. Any ideas? I guess the most likely reasons are that the target wasn't in a grace period, or that the reboot-recovery information didn't get moved over so the server doesn't think this client is allowed to reclaim. Oh, wait, I bet that's it: do the rhel7 servers have /var/lib/nfs/nfsdcltrack/main.sqlite file? We switched to a different system for tracking the reboot recovery information and I can't remember if that change made it to rhel7--probably so. So I bet we just need to copy over that nfsdcltrack directory on failover. (I'd recommend also continuing to copy /var/lib/nfs/nfsdcltrack/, just in case it's needed for backwards compatibility, but it will probably normally be empty (or even absent).) (In reply to J. Bruce Fields from comment #13) > (In reply to David Vossel from comment #12) > > After the lock is held I fail over export1 from one server to another. rhel6 > > works, rhel7 doesnt. Any ideas? > > I guess the most likely reasons are that the target wasn't in a grace > period, or that the reboot-recovery information didn't get moved over so the > server doesn't think this client is allowed to reclaim. > > Oh, wait, I bet that's it: do the rhel7 servers have > /var/lib/nfs/nfsdcltrack/main.sqlite file? yes, the file is present on the rhel7 cluster. The file has content in it as well indicating that it is used. > We switched to a different system for tracking the reboot recovery > information and I can't remember if that change made it to rhel7--probably > so. So I bet we just need to copy over that nfsdcltrack directory on > failover. I'm thinking it likely isn't that simple. We need to dynamically move exports between multiple active servers. An export might move from server A to server B. That doesn't mean that all the exports moved from server A though. That also doesn't mean that server B wasn't already hosting exports before the move occurred. If the main.sqlite file is required to track what clients can reclaim what locks, we actually have to merge the contents of one server's main.sqlite file with the contents of another. Do you have any thoughts on this? > (I'd recommend also continuing to copy /var/lib/nfs/nfsdcltrack/, just in > case it's needed for backwards compatibility, but it will probably normally > be empty (or even absent).) This bug has gotten a bit complicated. I believe there are three independent issues remaining: - clients mounting the server's "/" can get ESTALE after failover. Workaround is not to mount "/". - clients mounting could get ESTALE if they happen to mount at exactly the moment failover is in progress. Workaround is probably to retry the mount. I believe there is actually some logic in the VFS to retry at least one ESTALE so this may be hard to hit. - we depend on being able to merge the list of clients on failover. The first two issues really have the same root cause. My impression is that they're lower priority, but that the third is still a real obstacle? The third I think can be fixed for now in the 4.0-only case by a simple program that merges the two tables, throwing out any duplicates (possibly with a warning, since I don't believe there should be any duplicates in the 4.0 case.) NFSv4.0 client use a different client identifier for each distinct server IP address. Clients using version >=4.1 use the same client identifier to different servers. I think that may cause clients to be permitted to reclaim locks in some cases where they shouldn't, but I need to write out the case to be sure. For now we could disable >=4.1 on the server side. I think that's doable for 7.1. I'd just like to make sure we're in agreement on comment 16 & 17: so for now are you OK with: - mounting only "real" exports, not "/"? (And maybe dealing with the first two issues from comment 16 in a separate lower-priority bug). - possibly restricting to 4.0 and disabling NFS versions >=4.1? (In reply to J. Bruce Fields from comment #17) > NFSv4.0 client use a different client identifier for each distinct server IP > address. > > Clients using version >=4.1 use the same client identifier to > different servers. I forget that 4.0 clients may behave this way too, though it shouldn't be the default; see the "migration/nomigration" options in the nfs man page (which warns that "Some server features misbehave" if it's changed from the default), and see the (rather verbose) https://tools.ietf.org/html/draft-ietf-nfsv4-rfc3530-migration-update-05. > I think that may cause clients to be permitted to > reclaim locks in some cases where they shouldn't, but I need to write out > the case to be sure. Somewhat contrived, but: - Client A mounts exports from servers B and C. - B and C are rebooted, and simultaneously A loses contact with B. - Client A successfully reclaims locks on C, but B's grace period expires without A being able to reclaim locks on B. - We move exports from C to B. Simultaneously, the network between A and B comes back up. At this point, the merged view of B and C's client records make it appear that A should be able to reclaim any locks it wants to. But it should really only be permitted to reclaim locks on the exports that migrated from C. (In reply to J. Bruce Fields from comment #18) > I'd just like to make sure we're in agreement on comment 16 & 17: so for now > are you OK with: > - mounting only "real" exports, not "/"? (And maybe dealing with the > first two issues from comment 16 in a separate lower-priority bug). yes. based on the discussion in this issue, not mounting "/" in this scenario is reasonable. > - possibly restricting to 4.0 and disabling NFS versions >=4.1? Lets track the NFSv4 client state merge discussion in a separate issue. I'm not confident we'll be able to solve that for 7.1. (In reply to David Vossel from comment #20) > (In reply to J. Bruce Fields from comment #18) > > I'd just like to make sure we're in agreement on comment 16 & 17: so for now > > are you OK with: > > - mounting only "real" exports, not "/"? (And maybe dealing with the > > first two issues from comment 16 in a separate lower-priority bug). > > yes. based on the discussion in this issue, not mounting "/" in this > scenario is reasonable. > > > - possibly restricting to 4.0 and disabling NFS versions >=4.1? > > Lets track the NFSv4 client state merge discussion in a separate issue. I'm > not confident we'll be able to solve that for 7.1. OK, based on this I'm guessing we should close this bug, and later open some more specific set of bugs for the more specific issues we've identified (at a minimum, for the client-record merging). |
Created attachment 943207 [details] Active/Active HA-NFS use-case presentation Description of problem: We need to be able to support the HA NFSv4 Active/Active use-case, where NFSv4 exports move between two active servers with limited interruption on the client side. I have attached a PDF slide presentation that outlines exactly how this use-case works. Right now the Active-Active use-case appears to work as expected on rhel6. On rhel7 however we are seeing errors that cause the clients to experience 'stale filehandle' errors after an export relocates. This indicates a regression between rhel6 and rhel7. In var/log/messages on the client, there is an error that occurs after the server export moves that is likely related to this issue, "NFS: server 192.168.122.200 error: fileid changed" Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: This setup requires floating IPs and multiple NFS servers using pacemaker 1. Setup the Active-Active HA-NFS use-case as described by the attached presentation slides. 2. on a node outside of the cluster, mount one of the nfs export groups using the floating ip address associated with that group. 3. Place the node exporting the filesystem the client has mounted in standby using 'pcs cluster standby <nodename>' Actual results: The nfs mount on the client is no longer available after the export moves. Performing an ls on the mount directory presents us with the 'stale filehandle' error. The only way to restore the client is to umount and then mount the nfs export again. Expected results: The client maintains access to the nfs mount after the export migrates between active NFS servers.