Description of problem: After moving a ganesha instance (with its VIP) to another protocol node, our wire trace shows that client sent reclaim_complete, but log messages from the nfs-ganesha server show that it never processed / accounted for the reclaim_complete and a full grace period is enforced. For our test scenario, we configured and ganesha cluster with 3 nfs-ganesha instances over a set of 6 protocol nodes, and then we placed 1 node into maintenance mode which changed the placement of the ganesha instance to another protocol node. The ganesha instance only had a single client, and it was a 4.1 client. The problem reproduced 100% using rados_cluster recovery backend, but it was less common (about 50%) using rados_ng and radon_kv recovery backends. With rados_ng, first time the instance was moved, reclaim_complete processing usually worked as expected, but if reclaim_complete exited early the first time, it would usually enforce a full grace period when repeating the experiment to return to its original node. When a nfs-ganesha instance is simply restarted without changing nodes, reclaim_complete is processed as expected, and the grace period exits early after the client sends reclaim_complete. I configured nfs-ganesha with additional logging and these messages popped up from rados_cluster and rados_ng backends: https://github.com/nfs-ganesha/nfs-ganesha/blob/c9ff03bb11397d525e8b768772a2a26b84628796/src/SAL/recovery/recovery_rados_cluster.c#L164 https://github.com/nfs-ganesha/nfs-ganesha/blob/c9ff03bb11397d525e8b768772a2a26b84628796/src/SAL/recovery/recovery_rados_ng.c#L312 Version-Release number of selected component (if applicable): How reproducible: 100% on rados_cluster backend, about 50% with rados_ng Steps to Reproduce: 1.configure a nfs-ganesha cluster for 4.1+ with 1 export accessible 1 client 2.setup the NFS4.1+ client, start a tcpdump capture, and mount the export. 3.open a file from the 4.1+ mount and start doing IO 4.move the nfs-ganesha (4.1+) instance (with its vip) to another node 5.observe that client IO is blocked because server enforces a full grace period even though the client sent reclaim_compolete and it was the only client. Actual results: After relocating the nfs-ganesha instance to another node, the client completed recovery and sent RECLAIM_COMPLETE. It was the only client of the server, and the server enforced a full grace period as if the client never sent RC. Expected results: The nfs-ganesha server should have exited grace early after its only client sent RECLAIM_COMPLETE. Instead it enforced a full grace period waiting for RC. Additional info: These experiments were performed on an Acadia Storage cluster, and the mechanism we were using to move the VIP used by the ganesha instance is unique to Acadia. The mechanics of initiating HA-NFS failover in your lab env will be different, but that's fine. This bz is about the rados_cluster and rados_ng recovery backends emitting log message linked above as they bail out early from reading in persisted lease info from their rados objects after restarting on another node. After these messages are logged, the nfs-ganesha instance doesn't do reclaim_complete accounting / process RECLAIM_COMPLETE correctly and a full grace period is enforced. The really strange part that I don't understand at all is that it isn't 100% reproducible for rados_ng. Typically the first failover works, but then the fallback will reproduce.
Please specify the severity of this bug. Severity is defined here: https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.
Severity set to medium. When the nfs-ganesha server does not process reclaim complete correctly after moving to another node, it enforces a full grace period. Until the nfs-ganesha server exits its grace period client workloads that create or destroy protocol state are blocked. So NFS client impact is a temporary workload disruption until the nfs-ganesha server exits its grace period.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat Ceph Storage 8.1 security, bug fix, and enhancement updates), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2025:9775