Bug 492669

Summary:	NFS lock recovery on server reboot fails
Product:	Red Hat Enterprise Linux 5	Reporter:	Fabio Olive Leite <fleite>
Component:	nfs-utils	Assignee:	Steve Dickson <steved>
Status:	CLOSED WONTFIX	QA Contact:	yanfu,wang <yanwang>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.3	CC:	chuck.lever, jlayton, jwest, mcermak, moshiro, rwheeler, stijn, yanwang
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	nfs-utils-1.0.9-40.el5	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	492671 (view as bug list)		Environment:
Last Closed:	2011-08-17 14:25:09 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Fabio Olive Leite 2009-03-27 22:39:27 UTC

Description of problem:

When an NFS server boots, it starts the rpc.statd service before the kernel lockd service is up and running. If clients held locks when the server rebooted, they will try to recover their locks while the server lockd service is not yet available and will fail.

This problem is in fact in both client and server.
- Client should keep retrying until service is available;
- Server should not advertise its status change before NLM service is up.

This bugzilla is about the second part. This problem can easily be worked around by starting the nfslock service in rc.local. This bugzilla is a request for a change in service start ordering on boot so that nfs (which loads lockd module) starts before nfslock (which starts rpc.statd).

Version-Release number of selected component (if applicable):

5.3 packages.

How reproducible:

Always.

Steps to Reproduce:
1. Set up NFS client and server
2. Start tcpdump capture in client
3. Run program that locks file
4. Reboot server
5. After server comes up, stop capture and check that after NSM NOTIFY call the client tried to re-issue the lock and received a PROGRAM NOT AVAILABLE error from rpcbind at the server.

Actual results:

Client can't recover lock.

Expected results:

Client recovers lock.

Additional info:

NSM and NLM specifications are vague, and yes, NFS clients SHOULD retry the lock operation a few times with some wait period in between retries (vague eh?), but still, ensuring NSM is started _after_ NLM at the server leads to faster and more reliable lock recovery, regardless of what the clients do.

Comment 3 Chuck Lever 2009-04-03 22:03:15 UTC

Today, statd is started in /etc/init.d/nfslock, and it runs sm-notify automatically. However, on Linux, lockd doesn't start NLM listeners until an NFS file system is mounted, or rpc.nfsd is run.

The NSM service is already split into rpc.statd and sm-notify in recent nfs-utils releases. It might be sufficient to start statd with "--no-notify" in /etc/init.d/nfslock, and then run /usr/sbin/sm-notify in /etc/init.d/netfs and /etc/init.d/nfs. It's OK to run it in both places as sm-notify checks for a pid file before notifying. This means either script may get there first, but it will notify only once.

Note that in the case of a system that has no NFS mounts in /etc/fstab (for example, a client that uses automounter to handle NFS mounting) still would send SM_NOTIFY requests without NLM listeners.

Additionally, the mount.nfs command checks for the presence of statd before allowing mounts to proceed. Mounts will fail (and lockd listeners will not be started) if statd is not already running when the first NFS mount is attempted.

statd and sm-notify share the same directory for storing monitored hosts. If sm-notify runs so much later that newly monitored hosts are included in the notify list, we end up notifying fresh hosts of our last reboot. This is why statd and sm-notify are run at the same time at system boot.

Finally, when sm-notify runs today, it attempts to pass the current NSM state (stored in /var/lib/nfs/statd/state) to lockd by writing the state number to /proc/sys/fs/nfs/nsm_local_state. This nearly always fails today, though there's no error message generated, because lockd hasn't populated /proc/sys/fs/nfs when statd first comes up. So client-side lockd probably sends an incorrect state number in its requests to servers.

I think straightening this out will require us to start lockd early in the boot process (as part of /etc/init.d/nfslock for example) and ensure that it's listening for requests before statd starts. The kernel's lockd should be changed so that starting listeners is part of init_lockd(), and not part of lockd_up(). If lockd is started first, statd can set the local NSM state (and maybe even the caller_name) appropriately, and run sm-notify.

It's worth noting that Solaris (the reference NFSv2/v3 implementation) sends SM_NOTIFY requests before its NLM service is up. Clients that are stopped by an unregistered NLM service on the server should always retry in any event, and lock recovery is no exception. This is what a 'hard' mount is, right? If a client gives up because NLM isn't registered, I would argue that the client is broken, or doesn't care about reclaiming locks.

Comment 4 Jeff Layton 2009-04-06 19:12:04 UTC

My idea was to start statd early with --no-notify, and then run sm-notify in a separate init script after nfs mounts have been done and nfsd has been started.

Why a separate init script? A machine won't necessarily have any nfs mounts and won't necessarily be a NFS server...

That however does not solve the problem of the client racing in with a new lock on one of the servers and then sm-notify running, which tells the server to drop that lock (ugh).

I think just plain isn't fixable w/o better kernel <-> statd communications.

Comment 5 Chuck Lever 2009-04-06 19:22:20 UTC

(In reply to comment #4)
> My idea was to start statd early with --no-notify, and then run sm-notify in a
> separate init script after nfs mounts have been done and nfsd has been started.

The problem is that statd and sm-notify use the same directory: /var/lib/nfs/statd/sm.  If sm-notify starts after lockd has asked for some host to be monitored, that host will be notified and deleted from /var/lib/nfs/statd.

And, when statd starts up, if there are hosts in /var/lib/nfs/statd/sm, it will reload them and assume that they are monitoed.  When sm-notify runs later, it will erase those hosts, but statd will still have them loaded in-core.  Hopefully the @priv field will help statd distinguish a fresh SM_MON request from one of these left-over hosts, but it could be pretty easy to confuse statd.

I think we can't and shouldn't separate sm-notify too far from statd, at least in their current incarnation.

> Why a separate init script? A machine won't necessarily have any nfs mounts and
> won't necessarily be a NFS server...
> 
> That however does not solve the problem of the client racing in with a new lock
> on one of the servers and then sm-notify running, which tells the server to
> drop that lock (ugh).
> 
> I think just plain isn't fixable w/o better kernel <-> statd communications.  

Starting lockd in init_lockd shouldn't be too hard (famous last words).  I'm thinking about coding up something to see what issues we might hit.

Comment 6 Chuck Lever 2009-04-17 00:41:44 UTC

I've been clued in on the secret to fixing this.  Apparently Neil Brown provided some notes in nfs-utils somewhere that describe the ordering constraints for starting NFSD.

What needs to happen is that nfsd has to be started before statd.  nfsd will necessarily start lockd and start the grace period.  To ensure that lockd has the proper port settings (and what ever other parameters are specified in /etc/sysconfig/nfs), you need to copy the logic from /etc/init.d/nfslock to /etc/init.d/nfsd and make sure that /etc/init.d/nfsd runs before nfslock does.

This will guarantee that lockd is up and listening on the server before statd.

On the client, statd and sm-notify can run before lockd comes up without problems.

Comment 7 Steve Dickson 2009-10-13 17:32:52 UTC

I guess there could be some communication between the nfslock and nfs startup
scripts using chkconfig.... 

in nfslock:
    if chkconfig nfs | grep yes
       exit and let the nfs server bring things up

in nfs:
    if chkconifg nfslock | grep yes
       service nfslock serverstart

Its a little cheesy but could work...

Comment 17 Martin Cermak 2010-06-30 07:44:49 UTC

=> VERIFIED

Comment 23 RHEL Program Management 2010-08-09 18:09:28 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

Comment 24 RHEL Program Management 2011-01-11 19:55:44 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

Comment 26 RHEL Program Management 2011-01-11 23:10:40 UTC

This request was erroneously denied for the current release of
Red Hat Enterprise Linux.  The error has been fixed and this
request has been re-proposed for the current release.

Comment 29 yanfu,wang 2011-05-19 12:31:14 UTC

I could reproduce it on RHEL5.3 against nfs-utils-1.0.9-37.el5 using the testcase /CoreOS/nfs-utils/Multihost/bz492669-NFS-lock-recovery-on-server-reboot-fails and run by manual. Pls refer to below test result:

client1:
07:56:00  cl1 write (185)
07:56:01  cl1 write (186)
07:56:02  cl1 write (187)
07:56:03  cl1 write (188)
07:56:04  cl1 write (189)
07:56:04  ### cl2 write ###
result_server not set, assuming developer mode.
Setting hp-xw6400-02.lab.bos.redhat.com to state DONE
:: [   PASS   ] :: Running 'rhts-sync-set -s DONE'
:: [   PASS   ] :: Running 'umount /bz492669import'
Running as user "root" and group "root". This could be dangerous.
tshark: "tmp_cl1.pcap" appears to have been cut short in the middle of a packet.
:: [   PASS   ] :: Running 'tshark -r tmp_cl1.pcap -N n -R "portmap or nlm or stat or tcp.flags.reset == 1" | sed "s/hp-xw6400-02.lab.bos.redhat.com/cl1/" | sed "s/ibm-ls21-04.rhts.eng.bos.redhat.com/srv/g" > netrafic.txt'
:: [07:56:05] ::  
 20   0.196125 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
 22   0.196353 srv -> cl1 Portmap V2 GETPORT Reply (Call In 20) Port:39361
 30   0.196593 cl1 -> srv NLM V4 LOCK Call FH:0x6c3c3da2 svid:0 pos:0-0
 32   0.196802 srv -> cl1 NLM V4 LOCK Reply (Call In 30) NLM_DENIED_GRACE_PERIOD
 35   5.197100 cl1 -> srv NLM V4 LOCK Call FH:0x6c3c3da2 svid:0 pos:0-0
 36   5.197309 srv -> cl1 NLM V4 LOCK Reply (Call In 35) NLM_DENIED_GRACE_PERIOD
 38  10.197133 cl1 -> srv NLM V4 LOCK Call FH:0x6c3c3da2 svid:0 pos:0-0
 39  10.197337 srv -> cl1 NLM V4 LOCK Reply (Call In 38) NLM_DENIED_GRACE_PERIOD
 41  15.197200 cl1 -> srv NLM V4 LOCK Call FH:0x6c3c3da2 svid:0 pos:0-0
 42  15.197397 srv -> cl1 NLM V4 LOCK Reply (Call In 41) NLM_DENIED_GRACE_PERIOD
 52  20.197264 cl1 -> srv NLM V4 LOCK Call FH:0x6c3c3da2 svid:0 pos:0-0
 53  20.197458 srv -> cl1 NLM V4 LOCK Reply (Call In 52) NLM_DENIED_GRACE_PERIOD
 55  25.197327 cl1 -> srv NLM V4 LOCK Call FH:0x6c3c3da2 svid:0 pos:0-0
 56  25.197624 srv -> cl1 NLM V4 LOCK Reply (Call In 55) NLM_DENIED_GRACE_PERIOD
 58  30.197391 cl1 -> srv NLM V4 LOCK Call FH:0x6c3c3da2 svid:0 pos:0-0
 59  30.197674 srv -> cl1 NLM V4 LOCK Reply (Call In 58) NLM_DENIED_GRACE_PERIOD
 61  35.197456 cl1 -> srv NLM V4 LOCK Call FH:0x6c3c3da2 svid:0 pos:0-0
 62  35.197663 srv -> cl1 NLM V4 LOCK Reply (Call In 61) NLM_DENIED_GRACE_PERIOD
 72  40.197524 cl1 -> srv NLM V4 LOCK Call FH:0x6c3c3da2 svid:0 pos:0-0
 73  40.198031 srv -> cl1 NLM V4 LOCK Reply (Call In 72)
 99 251.177559 srv -> cl1 Portmap V2 GETPORT Call STAT(100024) V:1 UDP
100 251.177737 cl1 -> srv Portmap V2 GETPORT Reply (Call In 99) Port:986
101 251.178058 srv -> cl1 STAT V1 NOTIFY Call
102 251.178825 cl1 -> srv STAT V1 NOTIFY Reply (Call In 101)
106 251.179235 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
108 251.207229 srv -> cl1 Portmap V2 GETPORT Reply (Call In 106) PROGRAM_NOT_AVAILABLE
116 254.207600 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
118 254.207884 srv -> cl1 Portmap V2 GETPORT Reply (Call In 116) PROGRAM_NOT_AVAILABLE
126 257.207588 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
128 257.207892 srv -> cl1 Portmap V2 GETPORT Reply (Call In 126) PROGRAM_NOT_AVAILABLE
136 260.207615 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
138 260.207915 srv -> cl1 Portmap V2 GETPORT Reply (Call In 136) Port:52486
146 260.208192 cl1 -> srv NLM V4 LOCK Call FH:0x6c3c3da2 svid:0 pos:0-0
148 260.209207 srv -> cl1 NLM V4 LOCK Reply (Call In 146)
179 421.672024 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
181 421.672301 srv -> cl1 Portmap V2 GETPORT Reply (Call In 179) Port:52486
184 421.672365 cl1 -> srv NLM V4 UNLOCK Call FH:0x6c3c3da2 svid:0 pos:0-0
187 421.672563 srv -> cl1 NLM V4 UNLOCK Reply (Call In 184)
207 422.153916 cl1 -> srv Portmap V2 GETPORT Call MOUNT(100005) V:1 TCP
209 422.154125 srv -> cl1 Portmap V2 GETPORT Reply (Call In 207) Port:684

client2:
  6   0.001760 cl2 -> srv Portmap V2 GETPORT Call NFS(100003) V:3 TCP
  8   0.001978 srv -> cl2 Portmap V2 GETPORT Reply (Call In 6) Port:2049
 21   0.002841 cl2 -> srv Portmap V2 GETPORT Call MOUNT(100005) V:3 UDP
 24   0.003018 srv -> cl2 Portmap V2 GETPORT Reply (Call In 21) Port:741
 65 244.893427 srv -> cl2 Portmap V2 GETPORT Call STAT(100024) V:1 UDP
 66 244.893760 cl2 -> srv Portmap V2 GETPORT Reply (Call In 65) Port:681
 67 244.894230 srv -> cl2 STAT V1 NOTIFY Call
 68 244.894674 cl2 -> srv STAT V1 NOTIFY Reply (Call In 67)
103 302.568766 cl2 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
105 302.569042 srv -> cl2 Portmap V2 GETPORT Reply (Call In 103) Port:52486
113 302.569371 cl2 -> srv NLM V4 LOCK Call FH:0x6c3c3da2 svid:0 pos:0-0
115 302.569537 srv -> cl2 NLM V4 LOCK Reply (Call In 113) NLM_DENIED_GRACE_PERIOD
118 307.568613 cl2 -> srv NLM V4 LOCK Call FH:0x6c3c3da2 svid:0 pos:0-0
119 307.568793 srv -> cl2 NLM V4 LOCK Reply (Call In 118) NLM_DENIED_GRACE_PERIOD
121 312.568992 cl2 -> srv NLM V4 LOCK Call FH:0x6c3c3da2 svid:0 pos:0-0
122 312.569899 srv -> cl2 NLM V4 LOCK Reply (Call In 121) NLM_BLOCKED
149 342.565274 cl2 -> srv NLM V4 LOCK Call FH:0x6c3c3da2 svid:0 pos:0-0
150 342.565464 srv -> cl2 NLM V4 LOCK Reply (Call In 149) NLM_BLOCKED
161 372.561850 cl2 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
163 372.562098 srv -> cl2 Portmap V2 GETPORT Reply (Call In 161) Port:52486
166 372.562179 cl2 -> srv NLM V4 LOCK Call FH:0x6c3c3da2 svid:0 pos:0-0
169 372.562352 srv -> cl2 NLM V4 LOCK Reply (Call In 166) NLM_BLOCKED
201 402.557803 cl2 -> srv NLM V4 LOCK Call FH:0x6c3c3da2 svid:0 pos:0-0
202 402.558005 srv -> cl2 NLM V4 LOCK Reply (Call In 201) NLM_BLOCKED
213 415.388934 srv -> cl2 Portmap V2 GETPORT Call NLM(100021) V:4 TCP
215 415.389248 cl2 -> srv Portmap V2 GETPORT Reply (Call In 213) Port:58938
222 415.389564 srv -> cl2 NLM V4 GRANTED_MSG Call FH:0x6c3c3da2 svid:0 pos:0-0
226 415.389887 cl2 -> srv NLM V4 GRANTED_MSG Reply (Call In 222)
230 415.390077 cl2 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
232 415.390263 srv -> cl2 Portmap V2 GETPORT Reply (Call In 230) Port:52486
244 415.390649 cl2 -> srv NLM V4 GRANTED_RES Call
246 415.390790 srv -> cl2 NLM V4 GRANTED_RES Reply (Call In 244)
249 415.430681 cl2 -> srv NLM V4 UNLOCK Call FH:0x6c3c3da2 svid:0 pos:0-0
250 415.430859 srv -> cl2 NLM V4 UNLOCK Reply (Call In 249)
270 416.390728 cl2 -> srv Portmap V2 GETPORT Call MOUNT(100005) V:1 TCP
272 416.390944 srv -> cl2 Portmap V2 GETPORT Reply (Call In 270) Port:684
300 416.421760 cl2 -> srv Portmap V2 GETPORT Call MOUNT(100005) V:1 TCP
302 416.421948 srv -> cl2 Portmap V2 GETPORT Reply (Call In 300) Port:684

and according to https://tcms.engineering.redhat.com/case/38325/?from_plan=1334#document:
if the bug bz492669 is present, then - after server reboot - the cl1 process
    * will not get the lock in the Grace period - which is good
    * but gets the lock just after Grace period expires (even if cl1 is still performing its writes) - which is bad


verified on RHEL5.7-Server-20110513.0 using automation, test passed on x86_64, ppc64 and i386, will test ia64 and s390x later.
x86_64: https://beaker.engineering.redhat.com/jobs/86897
ppc64: https://beaker.engineering.redhat.com/jobs/87088
i386: https://beaker.engineering.redhat.com/jobs/87084
client1:
:: [   LOG    ] :: 07:46:29  cl1 write (99)
:: [   LOG    ] :: 07:46:30  ### cl2 write ###
:: [   PASS   ] :: Running 'rhts-sync-set -s DONE'
:: [   PASS   ] :: Running 'umount /bz492669import'
:: [   LOG    ] :: 
:: [   PASS   ] :: Running 'tshark -r tmp_cl1.pcap -N n -R "portmap or nlm or stat or tcp.flags.reset == 1" | sed "s/ibm-js20-04.lab.bos.redhat.com/cl1/" | sed "s/ibm-js22-04.rhts.eng.bos.redhat.com/srv/g" > netrafic.txt'
:: [   LOG    ] :: \n  4   0.000287 cl1 -> srv Portmap V2 GETPORT Call NFS(100003) V:3 TCP
:: [   LOG    ] ::   6   0.010161 srv -> cl1 Portmap V2 GETPORT Reply (Call In 4) Port:2049
:: [   LOG    ] ::  21   0.011221 cl1 -> srv Portmap V2 GETPORT Call MOUNT(100005) V:3 UDP
:: [   LOG    ] ::  22   0.011455 srv -> cl1 Portmap V2 GETPORT Reply (Call In 21) Port:702
:: [   LOG    ] ::  63   0.664536 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
:: [   LOG    ] ::  65   0.664793 srv -> cl1 Portmap V2 GETPORT Reply (Call In 63) Port:48994
:: [   LOG    ] ::  73   0.665147 cl1 -> srv NLM V4 LOCK Call FH:0xb6ec9d70 svid:0 pos:0-0
:: [   LOG    ] ::  75   0.665944 srv -> cl1 NLM V4 LOCK Reply (Call In 73)
:: [   LOG    ] :: 109 138.069134 srv -> cl1 Portmap V2 GETPORT Call STAT(100024) V:1 UDP
:: [   LOG    ] :: 110 138.083975 cl1 -> srv Portmap V2 GETPORT Reply (Call In 109) Port:752
:: [   LOG    ] :: 111 138.084263 srv -> cl1 STAT V1 NOTIFY Call
:: [   LOG    ] :: 112 138.086254 cl1 -> srv STAT V1 NOTIFY Reply (Call In 111)
:: [   LOG    ] :: 116 138.088121 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
:: [   LOG    ] :: 118 138.105568 srv -> cl1 Portmap V2 GETPORT Reply (Call In 116) PROGRAM_NOT_AVAILABLE
:: [   LOG    ] :: 126 141.105789 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
:: [   LOG    ] :: 128 141.106127 srv -> cl1 Portmap V2 GETPORT Reply (Call In 126) PROGRAM_NOT_AVAILABLE
:: [   LOG    ] :: 136 144.106786 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
:: [   LOG    ] :: 138 144.107101 srv -> cl1 Portmap V2 GETPORT Reply (Call In 136) PROGRAM_NOT_AVAILABLE
:: [   LOG    ] :: 146 147.107770 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
:: [   LOG    ] :: 148 147.108099 srv -> cl1 Portmap V2 GETPORT Reply (Call In 146) PROGRAM_NOT_AVAILABLE
:: [   LOG    ] :: 156 150.108808 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
:: [   LOG    ] :: 158 150.109114 srv -> cl1 Portmap V2 GETPORT Reply (Call In 156) PROGRAM_NOT_AVAILABLE
:: [   LOG    ] :: 166 153.109884 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
:: [   LOG    ] :: 168 153.110294 srv -> cl1 Portmap V2 GETPORT Reply (Call In 166) Port:43113
:: [   LOG    ] :: 176 153.110731 cl1 -> srv NLM V4 LOCK Call FH:0xb6ec9d70 svid:0 pos:0-0
:: [   LOG    ] :: 178 153.112589 srv -> cl1 NLM V4 LOCK Reply (Call In 176)
:: [   LOG    ] :: 375 292.779228 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
:: [   LOG    ] :: 377 292.779508 srv -> cl1 Portmap V2 GETPORT Reply (Call In 375) Port:43113
:: [   LOG    ] :: 380 292.779588 cl1 -> srv NLM V4 UNLOCK Call FH:0xb6ec9d70 svid:0 pos:0-0
:: [   LOG    ] :: 383 292.779793 srv -> cl1 NLM V4 UNLOCK Reply (Call In 380)
:: [   LOG    ] :: 400 293.794630 cl1 -> srv Portmap V2 GETPORT Call MOUNT(100005) V:1 TCP
:: [   LOG    ] :: 402 293.794900 srv -> cl1 Portmap V2 GETPORT Reply (Call In 400) Port:823


client2 tries to get cl1's lock, but get "NLM_DENIED_GRACE_PERIOD" then it had to wait until cl1 unlocks (NLM_BLOCKED):
:: [   LOG    ] :: \n  4   0.000292 cl2 -> srv Portmap V2 GETPORT Call NFS(100003) V:3 TCP
:: [   LOG    ] ::   6   0.000638 srv -> cl2 Portmap V2 GETPORT Reply (Call In 4) Port:2049
:: [   LOG    ] ::  21   0.001704 cl2 -> srv Portmap V2 GETPORT Call MOUNT(100005) V:3 UDP
:: [   LOG    ] ::  22   0.001947 srv -> cl2 Portmap V2 GETPORT Reply (Call In 21) Port:820
:: [   LOG    ] ::  69   6.310415 cl2 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
:: [   LOG    ] ::  71   6.310706 srv -> cl2 Portmap V2 GETPORT Reply (Call In 69) Port:43113
:: [   LOG    ] ::  79   6.310993 cl2 -> srv NLM V4 LOCK Call FH:0xb6ec9d70 svid:0 pos:0-0
:: [   LOG    ] ::  81   6.311192 srv -> cl2 NLM V4 LOCK Reply (Call In 79) NLM_DENIED_GRACE_PERIOD
:: [   LOG    ] ::  90  11.311652 cl2 -> srv NLM V4 LOCK Call FH:0xb6ec9d70 svid:0 pos:0-0
:: [   LOG    ] ::  91  11.311905 srv -> cl2 NLM V4 LOCK Reply (Call In 90) NLM_DENIED_GRACE_PERIOD
:: [   LOG    ] ::  96  16.311647 cl2 -> srv NLM V4 LOCK Call FH:0xb6ec9d70 svid:0 pos:0-0
:: [   LOG    ] ::  97  16.311911 srv -> cl2 NLM V4 LOCK Reply (Call In 96) NLM_DENIED_GRACE_PERIOD
:: [   LOG    ] :: 102  21.311652 cl2 -> srv NLM V4 LOCK Call FH:0xb6ec9d70 svid:0 pos:0-0
:: [   LOG    ] :: 103  21.311893 srv -> cl2 NLM V4 LOCK Reply (Call In 102) NLM_DENIED_GRACE_PERIOD
:: [   LOG    ] :: 105  26.311648 cl2 -> srv NLM V4 LOCK Call FH:0xb6ec9d70 svid:0 pos:0-0
:: [   LOG    ] :: 106  26.311921 srv -> cl2 NLM V4 LOCK Reply (Call In 105) NLM_DENIED_GRACE_PERIOD
:: [   LOG    ] :: 111  31.311648 cl2 -> srv NLM V4 LOCK Call FH:0xb6ec9d70 svid:0 pos:0-0
:: [   LOG    ] :: 112  31.312710 srv -> cl2 NLM V4 LOCK Reply (Call In 111) NLM_BLOCKED
:: [   LOG    ] :: 126  61.312662 cl2 -> srv NLM V4 LOCK Call FH:0xb6ec9d70 svid:0 pos:0-0
:: [   LOG    ] :: 127  61.312929 srv -> cl2 NLM V4 LOCK Reply (Call In 126) NLM_BLOCKED
:: [   LOG    ] :: 147  91.312989 cl2 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
:: [   LOG    ] :: 149  91.313280 srv -> cl2 Portmap V2 GETPORT Reply (Call In 147) Port:43113
:: [   LOG    ] :: 152  91.313346 cl2 -> srv NLM V4 LOCK Call FH:0xb6ec9d70 svid:0 pos:0-0
:: [   LOG    ] :: 155  91.313571 srv -> cl2 NLM V4 LOCK Reply (Call In 152) NLM_BLOCKED
:: [   LOG    ] :: 169 121.313648 cl2 -> srv NLM V4 LOCK Call FH:0xb6ec9d70 svid:0 pos:0-0
:: [   LOG    ] :: 170 121.313982 srv -> cl2 NLM V4 LOCK Reply (Call In 169) NLM_BLOCKED
:: [   LOG    ] :: 175 121.875700 srv -> cl2 Portmap V2 GETPORT Call NLM(100021) V:4 TCP
:: [   LOG    ] :: 177 121.895231 cl2 -> srv Portmap V2 GETPORT Reply (Call In 175) Port:54063
:: [   LOG    ] :: 184 121.895669 srv -> cl2 NLM V4 GRANTED_MSG Call FH:0xb6ec9d70 svid:0 pos:0-0
:: [   LOG    ] :: 188 121.895993 cl2 -> srv NLM V4 GRANTED_MSG Reply (Call In 184)
:: [   LOG    ] :: 192 121.896221 cl2 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
:: [   LOG    ] :: 195 121.896467 srv -> cl2 Portmap V2 GETPORT Reply (Call In 192) Port:43113
:: [   LOG    ] :: 205 121.896750 cl2 -> srv NLM V4 GRANTED_RES Call
:: [   LOG    ] :: 208 121.896931 srv -> cl2 NLM V4 GRANTED_RES Reply (Call In 205)
:: [   LOG    ] :: 217 121.916430 cl2 -> srv NLM V4 UNLOCK Call FH:0xb6ec9d70 svid:0 pos:0-0
:: [   LOG    ] :: 218 121.916636 srv -> cl2 NLM V4 UNLOCK Reply (Call In 217)
:: [   LOG    ] :: 235 123.422368 cl2 -> srv Portmap V2 GETPORT Call MOUNT(100005) V:1 TCP
:: [   LOG    ] :: 237 123.422640 srv -> cl2 Portmap V2 GETPORT Reply (Call In 235) Port:823

Comment 30 yanfu,wang 2011-05-19 12:48:29 UTC

hi Steve,
I have two questions that need to confirm with you.
first, what's the name of the patch and where do I find it?

second, I could get the NLM service is not registered from client1 after the server gets rebooted
for example the below log:
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/05/870/87088/178690/1939488/8850090///test_log-Client.log
<snip>
:: [   LOG    ] ::  63   0.664536 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
:: [   LOG    ] ::  65   0.664793 srv -> cl1 Portmap V2 GETPORT Reply (Call In 63) Port:48994
:: [   LOG    ] ::  73   0.665147 cl1 -> srv NLM V4 LOCK Call FH:0xb6ec9d70 svid:0 pos:0-0
:: [   LOG    ] ::  75   0.665944 srv -> cl1 NLM V4 LOCK Reply (Call In 73)
:: [   LOG    ] :: 109 138.069134 srv -> cl1 Portmap V2 GETPORT Call STAT(100024) V:1 UDP
:: [   LOG    ] :: 110 138.083975 cl1 -> srv Portmap V2 GETPORT Reply (Call In 109) Port:752
:: [   LOG    ] :: 111 138.084263 srv -> cl1 STAT V1 NOTIFY Call
:: [   LOG    ] :: 112 138.086254 cl1 -> srv STAT V1 NOTIFY Reply (Call In 111)
:: [   LOG    ] :: 116 138.088121 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
:: [   LOG    ] :: 118 138.105568 srv -> cl1 Portmap V2 GETPORT Reply (Call In 116) PROGRAM_NOT_AVAILABLE
:: [   LOG    ] :: 126 141.105789 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
:: [   LOG    ] :: 128 141.106127 srv -> cl1 Portmap V2 GETPORT Reply (Call In 126) PROGRAM_NOT_AVAILABLE
:: [   LOG    ] :: 136 144.106786 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
:: [   LOG    ] :: 138 144.107101 srv -> cl1 Portmap V2 GETPORT Reply (Call In 136) PROGRAM_NOT_AVAILABLE
:: [   LOG    ] :: 146 147.107770 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
:: [   LOG    ] :: 148 147.108099 srv -> cl1 Portmap V2 GETPORT Reply (Call In 146) PROGRAM_NOT_AVAILABLE
:: [   LOG    ] :: 156 150.108808 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
:: [   LOG    ] :: 158 150.109114 srv -> cl1 Portmap V2 GETPORT Reply (Call In 156) PROGRAM_NOT_AVAILABLE
:: [   LOG    ] :: 166 153.109884 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
:: [   LOG    ] :: 168 153.110294 srv -> cl1 Portmap V2 GETPORT Reply (Call In 166) Port:43113
</snip>


and I also got the related info from Issue Track:
cl1 -> srv  Portmap V2 GETPORT Call NLM(100021) V:4 TCP
 srv -> cl1  Portmap V2 GETPORT Reply (Call In 60) PROGRAM_NOT_AVAILABLE

cl1 asks the server for the NLM port to re-issue its locks, and the server tells cl1 that the NLM service is not registered. We have to ensure lockd comes up and registers _before_ rpc.statd comes up!

The PROGRAM_NOT_AVAILABLE will not trigger if worked around by running nfslock from rc.local:
    * chkconfig nfslock off
    * echo "service nfslock start" >> /etc/rc.local
You could find the log tested by Martin Cermak previously:
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=12339202
<snip>
:: [   LOG    ] :: \n  9   0.001829 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
:: [   LOG    ] ::  11   0.002165 srv -> cl1 Portmap V2 GETPORT Reply (Call In 9) Port:42903
:: [   LOG    ] ::  18   0.002443 cl1 -> srv NLM V4 LOCK Call FH:0xf885616c svid:0 pos:0-0
:: [   LOG    ] ::  21   0.003299 srv -> cl1 NLM V4 LOCK Reply (Call In 18)
:: [   LOG    ] ::  62 137.799457 srv -> cl1 TCP 2049 > 1010 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
:: [   LOG    ] ::  80 145.334730 srv -> cl1 Portmap V2 GETPORT Call STAT(100024) V:1 UDP
:: [   LOG    ] ::  81 145.346426 cl1 -> srv Portmap V2 GETPORT Reply (Call In 80) Port:968
:: [   LOG    ] ::  82 145.346649 srv -> cl1 STAT V1 NOTIFY Call
:: [   LOG    ] ::  83 145.347898 cl1 -> srv STAT V1 NOTIFY Reply (Call In 82)
:: [   LOG    ] ::  87 145.349233 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
:: [   LOG    ] ::  89 145.452977 srv -> cl1 Portmap V2 GETPORT Reply (Call In 87) Port:56969
:: [   LOG    ] ::  97 145.453337 cl1 -> srv NLM V4 LOCK Call FH:0xf885616c svid:0 pos:0-0
:: [   LOG    ] ::  99 145.458867 srv -> cl1 NLM V4 LOCK Reply (Call In 97)
:: [   LOG    ] :: 137 349.857452 cl1 -> srv Portmap V2 GETPORT Call NLM(100021) V:4 TCP
:: [   LOG    ] :: 139 349.857770 srv -> cl1 Portmap V2 GETPORT Reply (Call In 137) Port:56969
:: [   LOG    ] :: 142 349.857844 cl1 -> srv NLM V4 UNLOCK Call FH:0xf885616c svid:0 pos:0-0
:: [   LOG    ] :: 145 349.858020 srv -> cl1 NLM V4 UNLOCK Reply (Call In 142)
:: [   LOG    ] :: 164 352.399896 cl1 -> srv Portmap V2 GETPORT Call MOUNT(100005) V:1 TCP
:: [   LOG    ] :: 166 352.400219 srv -> cl1 Portmap V2 GETPORT Reply (Call In 164) Port:898
</snip>

So if there's still problem about nfs/nfslock service start ordering on boot since I think the verified result should be accordant with the work around, isn't it?

Comment 31 yanfu,wang 2011-05-23 05:14:43 UTC

verified on s390x and ia64, and pasted remained job links to plus info for comment #29:
s390x: https://beaker.engineering.redhat.com/jobs/87102
ia64: https://beaker.engineering.redhat.com/jobs/87086

Comment 32 Steve Dickson 2011-05-23 11:27:49 UTC

(In reply to comment #30)
> hi Steve,
> I have two questions that need to confirm with you.
> first, what's the name of the patch and where do I find it?
I don't think there is a patch, at least I'm not aware of one.

> 
> second, I could get the NLM service is not registered from client1 after the
> server gets rebooted
> for example the below log:
> http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/05/870/87088/178690/1939488/8850090///test_log-Client.log
> <snip>
> 
> So if there's still problem about nfs/nfslock service start ordering on boot
> since I think the verified result should be accordant with the work around,
> isn't it?
Yes I believe so...

Comment 33 yanfu,wang 2011-05-24 00:54:30 UTC

(In reply to comment #32)
> (In reply to comment #30)
> > hi Steve,
> > I have two questions that need to confirm with you.
> > first, what's the name of the patch and where do I find it?
> I don't think there is a patch, at least I'm not aware of one.
> 
> > 
> > second, I could get the NLM service is not registered from client1 after the
> > server gets rebooted
> > for example the below log:
> > http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/05/870/87088/178690/1939488/8850090///test_log-Client.log
> > <snip>
> > 
> > So if there's still problem about nfs/nfslock service start ordering on boot
> > since I think the verified result should be accordant with the work around,
> > isn't it?
> Yes I believe so...

So I change the status to you, could you have a look at it if it will be fix in this release?

Comment 34 yanfu,wang 2011-05-24 05:14:26 UTC

(In reply to comment #32)
> (In reply to comment #30)
> > hi Steve,
> > I have two questions that need to confirm with you.
> > first, what's the name of the patch and where do I find it?
> I don't think there is a patch, at least I'm not aware of one.
If so, why include and ack the bug for this release to fix and testing? thanks for your reply in advance.

Comment 35 Ric Wheeler 2011-05-24 10:38:39 UTC

A devel_ack+ means that development is 90% certain that we *can* fix the issue in time for a release. It does not mean that we have the issue fixed.

Unfortunately, some of those issues are not resolved by the hoped for deadline.

Comment 36 yanfu,wang 2011-05-25 08:30:37 UTC

(In reply to comment #35)
> A devel_ack+ means that development is 90% certain that we *can* fix the issue
> in time for a release. It does not mean that we have the issue fixed.
> 
> Unfortunately, some of those issues are not resolved by the hoped for deadline.

got it, thank for your clarification.

Comment 37 yanfu,wang 2011-05-25 08:32:30 UTC

(In reply to comment #36)
> (In reply to comment #35)
> > A devel_ack+ means that development is 90% certain that we *can* fix the issue
> > in time for a release. It does not mean that we have the issue fixed.
> > 
> > Unfortunately, some of those issues are not resolved by the hoped for deadline.
> 
> got it, thank for your clarification.

hi Steve,
What's your plan to deal with the bug? Would you want to fix it in the current release? thanks.

Comment 39 RHEL Program Management 2011-08-17 14:25:09 UTC

Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.

Comment 40 yanfu,wang 2011-08-18 02:16:09 UTC

hi Steve,
Could I ask what's the reason?