While trying to reproduce the problem for bug 240976, I've noticed that the RHEL5 kernel doesn't seem to attempt to reclaim locks when it gets a SM_NOTIFY. Looking at a packet capture, I see the SM_NOTIFY request come in, and the reply go out, but I don't see any NLM traffic at all. Checking /proc/locks on the server shows that the lock isn't reclaimed well after the grace period. I've done a binary search and have determined that the problem exists at least as far back as -19.el5, but haven't checked anything earlier since brew doesn't seem to have any earlier builds. -8.1.9.el5 seems to work correctly though, so the problem was likely introduced somewhere in between those two revs.
nlm_debug from -8.1.9.el5 kernel when SM_NOTIFY comes in: Nov 7 08:20:36 dhcp231-179 kernel: lockd: request from 7f000001 Nov 7 08:20:36 dhcp231-179 kernel: lockd: SM_NOTIFY called Nov 7 08:20:36 dhcp231-179 kernel: lockd: nlm_lookup_host(ac103b81, p=6, v=4) Nov 7 08:20:36 dhcp231-179 kernel: lockd: get host 172.16.59.129 Nov 7 08:20:36 dhcp231-179 kernel: lockd: get host 172.16.59.129 Nov 7 08:20:36 dhcp231-179 kernel: lockd: release host 172.16.59.129 Nov 7 08:20:36 dhcp231-179 kernel: nlmsvc_retry_blocked(0000000000000000, when=0) Nov 7 08:20:36 dhcp231-179 kernel: nlmsvc_retry_blocked(0000000000000000, when=0) Nov 7 08:20:36 dhcp231-179 kernel: lockd: rebind host 172.16.59.129 Nov 7 08:20:36 dhcp231-179 kernel: NLM: reclaiming locks for host 172.16.59.129lockd: call procedure 2 on 172.16.59.129 Nov 7 08:20:36 dhcp231-179 kernel: lockd: nlm_bind_host(ac103b81) Nov 7 08:20:36 dhcp231-179 kernel: lockd: server returns status 0 Nov 7 08:20:36 dhcp231-179 kernel: NLM: done reclaiming locks for host 172.16.59.129lockd: release host 172.16.59.129 ...nlm_debug from same event on -19.el5 kernel shows no messages from the kernel, but this shows up in the logs: Nov 7 08:28:59 dhcp231-179 rpc.statd[2030]: recv_rply: [10.11.231.179] service 100021 not registered
Running statd in debug mode, I see this: 11/07/2007 08:37:44 rpc.statd[1995]: Received SM_NOTIFY from dhcp59-129.rdu.redhat.com, state: 725 11/07/2007 08:37:44 rpc.statd[1995]: Waiting for reply... (timeo 5) 11/07/2007 08:37:44 rpc.statd[1995]: address mismatch: expected 10.11.231.179, got 127.0.0.1 11/07/2007 08:37:44 rpc.statd[1995]: recv_rply: [10.11.231.179] service 100021 not registered 11/07/2007 08:37:44 rpc.statd[1995]: Waiting for client connections.
When run on a 8.1.9.el5 kernel, it looks like this: 11/07/2007 08:44:59 rpc.statd[1956]: Version 1.0.9 Starting 11/07/2007 08:44:59 rpc.statd[1956]: Flags: No-Daemon Log-STDERR 11/07/2007 08:44:59 rpc.statd[1956]: New state: 237 11/07/2007 08:44:59 rpc.statd[1956]: Waiting for client connections. 11/07/2007 08:45:20 rpc.statd[1956]: Waiting for client connections. 11/07/2007 08:45:20 rpc.statd[1956]: MONITORING 172.16.59.129 for dhcp231-179.rdu.redhat.com 11/07/2007 08:45:20 rpc.statd[1956]: Waiting for client connections. 11/07/2007 08:45:37 rpc.statd[1956]: Received SM_NOTIFY from dhcp59-129.rdu.redhat.com, state: 727 11/07/2007 08:45:37 rpc.statd[1956]: Waiting for reply... (timeo 5) 11/07/2007 08:45:37 rpc.statd[1956]: address mismatch: expected 10.11.231.179, got 127.0.0.1 11/07/2007 08:45:37 rpc.statd[1956]: Waiting for reply... (timeo 5) 11/07/2007 08:45:37 rpc.statd[1956]: address mismatch: expected 10.11.231.179, got 127.0.0.1 11/07/2007 08:45:37 rpc.statd[1956]: Callback to dhcp231-179.rdu.redhat.com (for 1871380944) succeeded. 11/07/2007 08:45:37 rpc.statd[1956]: Waiting for client connections.
I think I see the problem...on more recent kernels, lockd is not registering a UDP port, and statd seems to expect one: 2.6.18-19.el5xen: # rpcinfo -p | grep nlockmgr 100021 1 tcp 45684 nlockmgr 100021 3 tcp 45684 nlockmgr 100021 4 tcp 45684 nlockmgr 2.6.18-8.1.9.el5xen: # rpcinfo -p | grep nlock 100021 1 udp 32770 nlockmgr 100021 3 udp 32770 nlockmgr 100021 4 udp 32770 nlockmgr 100021 1 tcp 47795 nlockmgr 100021 3 tcp 47795 nlockmgr 100021 4 tcp 47795 nlockmgr
Looks like upstream is likely broken as well... After talking with Steve, I think the right approach will be to fix statd and allow it to talk to lockd via TCP if UDP fails. That's going to take a bit of rework for statd though. It's got UDP concepts hardcoded in a lot of places. At the moment, I'm trying to do a temporary patch to statd to make it talk TCP to the kernel as a test to see whether that fixes this. A real fix will probably take a bit...
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
There is a workaround for this: If you set lockd to have a static UDP port then it will open a UDP socket even for TCP mounts.
...though this does not work if the patch for bug 313671 is in place (since it changes nfslock to use sysctls instead of module parameters). We may want to hold off on applying that patch to nfs-utils until this problem is resolved to allow for the easy workaround.
You can download this test kernel from http://people.redhat.com/dzickus/el5
Bikash, if you'd like to test this, pick up any kernel from the link in Comment #14 newer than Package: kernel-2.6.18-63.el5
*** Bug 444741 has been marked as a duplicate of this bug. ***
Proposing for 5.1.z EUS (see the bug #444741 for explanation)
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html