Bug 492671

Summary: NFS lock recovery on server reboot fails
Product: Red Hat Enterprise Linux 4 Reporter: Fabio Olive Leite <fleite>
Component: nfs-utilsAssignee: Steve Dickson <steved>
Status: CLOSED WONTFIX QA Contact: yanfu,wang <yanwang>
Severity: high Docs Contact:
Priority: high    
Version: 4.8CC: bfields, coughlan, jlayton, rwheeler
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 492669 Environment:
Last Closed: 2010-10-27 16:05:29 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Fabio Olive Leite 2009-03-27 22:44:56 UTC
+++ This bug was initially created as a clone of Bug #492669 +++

Description of problem:

When an NFS server boots, it starts the rpc.statd service before the kernel lockd service is up and running. If clients held locks when the server rebooted, they will try to recover their locks while the server lockd service is not yet available and will fail.

This problem is in fact in both client and server.
- Client should keep retrying until service is available;
- Server should not advertise its status change before NLM service is up.

This bugzilla is about the second part. This problem can easily be worked around by starting the nfslock service in rc.local. This bugzilla is a request for a change in service start ordering on boot so that nfs (which loads lockd module) starts before nfslock (which starts rpc.statd).

Version-Release number of selected component (if applicable):

4.8 packages.

How reproducible:

Always.

Steps to Reproduce:
1. Set up NFS client and server
2. Start tcpdump capture in client
3. Run program that locks file
4. Reboot server
5. After server comes up, stop capture and check that after NSM NOTIFY call the client tried to re-issue the lock and received a PROGRAM NOT AVAILABLE error from rpcbind at the server.

Actual results:

Client can't recover lock.

Expected results:

Client recovers lock.

Additional info:

NSM and NLM specifications are vague, and yes, NFS clients SHOULD retry the lock operation a few times with some wait period in between retries (vague eh?), but still, ensuring NSM is started _after_ NLM at the server leads to faster and more reliable lock recovery, regardless of what the clients do.