Bug 199586 - nfslock script starts statd before lockd is up so lock recovery fails
Summary: nfslock script starts statd before lockd is up so lock recovery fails
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: nfs-utils
Version: 4.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Steve Dickson
QA Contact: Ben Levenson
URL:
Whiteboard:
Depends On: 146773
Blocks: 198694
TreeView+ depends on / blocked
 
Reported: 2006-07-20 16:24 UTC by Jeff Layton
Modified: 2014-06-18 07:35 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-11-01 19:37:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
trivial patch to init script (460 bytes, patch)
2006-07-21 17:50 UTC, Jeff Layton
no flags Details | Diff
patch to move statd startup and shutdown into nfs.init (1.02 KB, patch)
2006-07-22 22:11 UTC, Jeff Layton
no flags Details | Diff

Description Jeff Layton 2006-07-20 16:24:07 UTC
Description of problem:

The default chkconfig line in the nfslock init script starts it long before
lockd is up. This causes clients to try to recover their locks too early. Here's
a sample network trace, that shows PROGRAM_NOT_AVAILABLE after the client
attempted to recover locks on a reboot:

163.999543 172.16.57.30 -> 172.16.57.138 Portmap V2 GETPORT Call STAT(100024)
V:1 UDP
164.000380 172.16.57.138 -> 172.16.57.30 Portmap V2 GETPORT Reply (Call In 73)
Port:32768
164.000832 172.16.57.30 -> 172.16.57.138 STAT V1 NOTIFY Call
164.003574 172.16.57.138 -> 172.16.57.30 STAT V1 NOTIFY Reply (Call In 75)
164.004416 172.16.57.138 -> 172.16.57.30 TCP 32866 > sunrpc [SYN] Seq=0 Len=0
MSS=1460 TSV=407443 TSER=0 WS=0
164.004700 172.16.57.30 -> 172.16.57.138 TCP sunrpc > 32866 [SYN, ACK] Seq=0
Ack=1 Win=5792 Len=0 MSS=1460 TSV=4294700213 TSER=407443 WS=2
164.004765 172.16.57.138 -> 172.16.57.30 TCP 32866 > sunrpc [ACK] Seq=1 Ack=1
Win=5840 Len=0 TSV=407443 TSER=4294700213164.004947 172.16.57.138 ->
172.16.57.30 Portmap V2 GETPORT Call NLM(100021) V:1 TCP
164.005222 172.16.57.30 -> 172.16.57.138 TCP sunrpc > 32866 [ACK] Seq=1 Ack=61
Win=5792 Len=0 TSV=4294700214 TSER=407443
164.005832 172.16.57.30 -> 172.16.57.138 Portmap V2 GETPORT Reply (Call In 80)
PROGRAM_NOT_AVAILABLE

Changing nfslock.init chkconfig line to this:

# chkconfig: 345 61 19

seems to fix the problem. Opening this for RHEL4, since that's where I
originally noticed the problem, but it looks like FC has the same issue.

Comment 2 Jeff Layton 2006-07-21 17:50:13 UTC
Created attachment 132827 [details]
trivial patch to init script

Trivial patch that seems to fix the problem.

Comment 3 Steve Dickson 2006-07-22 11:34:18 UTC
What confuses me is that nfslock no longer brings lockd up (or down) . The kernel
does that when the server is started or the client mounts a fs, so I'm not sure
how or why this fix works... 

Comment 4 Jeff Layton 2006-07-22 15:12:56 UTC
Right -- lockd is now started by the 'nfs' script, so the only thing the nfslock
script now does is start statd. The fix here is just to make sure that nfslock
runs after the nfs script at boot time.

Another (maybe better?) fix might be to do away with the nfslock script
altogether and just have statd started by the 'nfs' script. Let me know if you
think that's the way to go.


Comment 5 Jeff Layton 2006-07-22 22:11:50 UTC
Created attachment 132873 [details]
patch to move statd startup and shutdown into nfs.init

Something like this patch might be actually be a better way to go (though I've
not tested this patch as of yet).

This moves the rpc.statd startup and shutdown into nfs.init. With something
like this we can probably just remove nfslock.init from the package.

Alternately, we may just want to do this in the devel and or FC trees, and just
go with the chkconfig change for the existing RHEL releases.

I'd be OK either way...

Comment 6 Steve Dickson 2006-07-24 11:41:18 UTC
Moving the starting of rpc.statd into the nfs init script would me
the nfs server would have to be started every time the system
booted (since statd is also needed by the client as well) which
is not the right thing to do... imho... 

It seems to be that maybe nfslock should always bring up
lockd so its started the same time rpc.statd is... Maybe by
doing a 'modprobe lockd ' could cause the server to come
up....

Also, maybe the bug could be tied with 
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=146773
since it may be related when it comes to failovers...


Comment 7 Jeff Layton 2006-07-24 12:21:11 UTC
Good point, I hadn't considered the client-side use of nfslock.

Would there be any harm to simply making nfslock start later here? It seems like
that would take care of the server-side case.

Also, I'm not clear on what the effect would be on the server in starting lockd
up before mountd/exportfs, etc. I've not picked through the code enough to know
if server-side lockd would allow the client to reclaim a lock on a filesystem
that's not yet exported.

All that said, 146773 does look like a thornier problem. I'll go ahead and make
this BZ dependent on that one. Any fix for this would be affected by that case
anyhow, and we can just try to be cognizant of this problem as well to make sure
that it gets addressed.


Comment 8 RHEL Program Management 2006-08-18 15:03:08 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Steve Dickson 2006-09-07 08:16:57 UTC
> Would there be any harm to simply making nfslock start later here?
I believe so... statd has to be up and running before the netfs
initscript runs others the client locking side would break... 

Comment 11 Steve Dickson 2006-11-01 19:37:49 UTC
After further review, this is not a server bug... If the client stop trying to
recover its locks just because the server has not made it up (yet) then
that is a client bug, because the client should *never* stop trying to
recover its locks... 


Note You need to log in before you can comment on or make changes to this bug.