Description of problem: The default chkconfig line in the nfslock init script starts it long before lockd is up. This causes clients to try to recover their locks too early. Here's a sample network trace, that shows PROGRAM_NOT_AVAILABLE after the client attempted to recover locks on a reboot: 163.999543 172.16.57.30 -> 172.16.57.138 Portmap V2 GETPORT Call STAT(100024) V:1 UDP 164.000380 172.16.57.138 -> 172.16.57.30 Portmap V2 GETPORT Reply (Call In 73) Port:32768 164.000832 172.16.57.30 -> 172.16.57.138 STAT V1 NOTIFY Call 164.003574 172.16.57.138 -> 172.16.57.30 STAT V1 NOTIFY Reply (Call In 75) 164.004416 172.16.57.138 -> 172.16.57.30 TCP 32866 > sunrpc [SYN] Seq=0 Len=0 MSS=1460 TSV=407443 TSER=0 WS=0 164.004700 172.16.57.30 -> 172.16.57.138 TCP sunrpc > 32866 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 TSV=4294700213 TSER=407443 WS=2 164.004765 172.16.57.138 -> 172.16.57.30 TCP 32866 > sunrpc [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSV=407443 TSER=4294700213164.004947 172.16.57.138 -> 172.16.57.30 Portmap V2 GETPORT Call NLM(100021) V:1 TCP 164.005222 172.16.57.30 -> 172.16.57.138 TCP sunrpc > 32866 [ACK] Seq=1 Ack=61 Win=5792 Len=0 TSV=4294700214 TSER=407443 164.005832 172.16.57.30 -> 172.16.57.138 Portmap V2 GETPORT Reply (Call In 80) PROGRAM_NOT_AVAILABLE Changing nfslock.init chkconfig line to this: # chkconfig: 345 61 19 seems to fix the problem. Opening this for RHEL4, since that's where I originally noticed the problem, but it looks like FC has the same issue.
Created attachment 132827 [details] trivial patch to init script Trivial patch that seems to fix the problem.
What confuses me is that nfslock no longer brings lockd up (or down) . The kernel does that when the server is started or the client mounts a fs, so I'm not sure how or why this fix works...
Right -- lockd is now started by the 'nfs' script, so the only thing the nfslock script now does is start statd. The fix here is just to make sure that nfslock runs after the nfs script at boot time. Another (maybe better?) fix might be to do away with the nfslock script altogether and just have statd started by the 'nfs' script. Let me know if you think that's the way to go.
Created attachment 132873 [details] patch to move statd startup and shutdown into nfs.init Something like this patch might be actually be a better way to go (though I've not tested this patch as of yet). This moves the rpc.statd startup and shutdown into nfs.init. With something like this we can probably just remove nfslock.init from the package. Alternately, we may just want to do this in the devel and or FC trees, and just go with the chkconfig change for the existing RHEL releases. I'd be OK either way...
Moving the starting of rpc.statd into the nfs init script would me the nfs server would have to be started every time the system booted (since statd is also needed by the client as well) which is not the right thing to do... imho... It seems to be that maybe nfslock should always bring up lockd so its started the same time rpc.statd is... Maybe by doing a 'modprobe lockd ' could cause the server to come up.... Also, maybe the bug could be tied with https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=146773 since it may be related when it comes to failovers...
Good point, I hadn't considered the client-side use of nfslock. Would there be any harm to simply making nfslock start later here? It seems like that would take care of the server-side case. Also, I'm not clear on what the effect would be on the server in starting lockd up before mountd/exportfs, etc. I've not picked through the code enough to know if server-side lockd would allow the client to reclaim a lock on a filesystem that's not yet exported. All that said, 146773 does look like a thornier problem. I'll go ahead and make this BZ dependent on that one. Any fix for this would be affected by that case anyhow, and we can just try to be cognizant of this problem as well to make sure that it gets addressed.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
> Would there be any harm to simply making nfslock start later here? I believe so... statd has to be up and running before the netfs initscript runs others the client locking side would break...
After further review, this is not a server bug... If the client stop trying to recover its locks just because the server has not made it up (yet) then that is a client bug, because the client should *never* stop trying to recover its locks...