In order to migrate clients using a filesystem from server A to server B, we need to first unmount the filesystem from server A. But that unmount can fail with -EBUSY, for reasons including: - An in-progress rpc is using the filesystem. - An rpc recently used the filesystem, and A's export caches still hold a reference. - A client holds a lock, or (for NFSv4 clients) an open or delegation, or (for NFSv4.1 and higher clients) a pNFS layout. We could handle the first two issues by temporarily stopping rpc.mountd, flushing the export caches (with exportfs -f), unmounting, then restarting rpc.mountd. But that's insufficient in the presence of NLM or NFSv4. The way we currently deal with this is roughly: - configure clients accessing the filesystem to all mount it through a single floating IP address. - shut down nfsd on both A and B. - unmount the filesystem from A and mount it on B. - move the floating IP address. - restart nfsd on both A and B. - allow the clients to reclaim their state during B's grace period. The shutdown of the server on filesystem A is what makes the unmount work reliably. One problem with this procedure is that all the other clients (including clients accessing different experts through different server IP addresses) are forced to wait for the server restart. We'd like to fix that.
Alternatives include: - extend nfsd's unlock_filesystem/unlock_ip interfaces. They provide a way to forcefully drop NLM locks, but don't handle v4 state. This may be the simplest fix. - teach umount to forcibly remove nfs locks; I believe that's essentially what https://bugzilla.redhat.com/show_bug.cgi?id=749044#c10 is suggesting. This may violate user expectations in some cases and shouldn't be the default behavior. (Client applications will likely crash as opposed to hanging waiting for the filesystem to return as they would on a reboot.) However we could provide an allow_force_umount export option. Implementation is somewhat tricky. Kinglong Mee has tried to do someting similar just for reference held by the export caches, and this could build on that work (which build in turn on Al Viro's mount pin work.) This might also have other uses. (E.g. for people that want to decommission some filesystem badly enough that they don't mind crashing client applications.) - finish nfsd containerization and run separate container for each floating IP: this allows independent shutdown and startup of nfs servers for each floating IP without affecting other floating IPs on the same machine. That also prevents unnecessary grace-period delays for other clients of server B (the target of the migration). It should also make client configuration more foolproof since only exports meant to be used over a given floating IP will be visible over that IP. I believe the last piece missing here is containerization of usermode helpers. Ian Kent has done some work on that, I'm not sure where it stands. The first two solutions force us to shut down the floating IP before unexporting and unmounting (to prevent the client from seeing spurious ESTALE errors), and I believe both leave us vulnerable to the ACK storm problems described in https://bugzilla.redhat.com/show_bug.cgi?id=1161795.
There's an existing solution similar to containerization -- use a virtual machine for each floating IP or export. That gives you all the features of the containerized setup, it works today, and creates a server architecture that is pNFS-ready (for some layouts). We rarely hear about this HA NFS architecture because the people that use it are not having any of these problems, so I think we tend to forget about it.
(In reply to Benjamin Coddington from comment #3) > There's an existing solution similar to containerization -- use a virtual > machine for each floating IP or export. That gives you all the features of > the containerized setup, it works today, and creates a server architecture > that is pNFS-ready (for some layouts). We rarely hear about this HA NFS > architecture because the people that use it are not having any of these > problems, so I think we tend to forget about it. I'm all for it. Can we figure out if we have users for which that isn't an adequate replacement for the NFS HA stuff? If so, can we fix any problems with the VM-based approach and deprecate the existing NFS HA agents?
*** Bug 1145930 has been marked as a duplicate of this bug. ***
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.