We have a two node NFS cluster backed by a GFS2 filesystem. We've noticed that RHEL NFS clients who request locks always hang, unless they access the "passive" node of the cluster directly. The problem appears to stem from the fact that when the NLM on the primary node transmits its "GRANT" response to the client, it does so via an asyncrhonous callback -- meaning that a new connection is established to the client. It appears that this connection is initiated via the machine's primary IP, and not the "cluster" IP over which the client first asked for the lock. The client, rightly, rejects this response and continues blocking forever. As an aside, it seems that Solaris 10 NFS clients are not as "secure" and happily accept a GRANT from any IP under the sun (no pun intended). This post[1] to linux-nfs seems to indicate there is a kernel patch to address this. I have been unable to find the kernel commit, but am curious if this has been backported to RHEL5's kernel or not. This is a show-stopper for us and I will be filing an SR as well. It sounds like this is a known (and already resolved) issue, but I can attach a packet dump if needed and steps to reproduce the problem. [1] http://markmail.org/message/nd4lvfpiv6gkacio
I should note the following: Servers are running RHEL 5.4 kernel 2.6.18-164.6.1.el5 with nfs-utils-1.0.9-42.el5. Clients are RHEL 5.4 as well -- fully patched and latest kernels. I know our server kernel isn't the latest, we just haven't rebooted in a while.
Opened SR #1988432 for this issue.
I believe this is a duplicate of bug 500653. Closing as such. Please reopen if I've misunderstood the problem you're having. *** This bug has been marked as a duplicate of bug 500653 ***