Bug 493337

Summary: Problem with blocking locks on RHEL 5
Product: Red Hat Enterprise Linux 5 Reporter: Sachin Prabhu <sprabhu>
Component: kernelAssignee: Jeff Layton <jlayton>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 5.3CC: jlayton, staubach, steved, syeghiay, tao
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-04-17 10:15:46 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Sequence in which the test programs need to be run.
none
tcpdump taken when problem is detected. none

Description Sachin Prabhu 2009-04-01 13:56:53 UTC
On RHEL 5, when using blocking locks, we can end up with a lock on the file which is not owned by any client and cannot be released. I have tested this with kernel 2.6.18-133.el5 which contains the fix from bz 448929. This contains the patch 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5f50c0c6d644d6c8180d9079c13c5d9de3adeb34
which was expected to fix the issue on RHEL 5.

The test program works fine on 2.6.27.5-117.fc10.x86_64 kernel.

The problem here appears to be similar to the case we see here.
http://marc.info/?l=linux-nfs&m=120663578712912&w=2

Step to Reproduce:

To reproduce, please compile and use the attached programs. we will need 2 NFS clients mounting the same nfs share.

The test programs will have to be run on 2 different nfs clients over the same nfs share. The commands will have to be run in the sequence show in attached file reproducer_steps. A file named dlvcan2.tab will have to be created on the current working directory.

At the end of the set of reproducer steps, the process lockchk can be cancelled. However the lock on the file still exists and is never released. The locks held can be checked in /proc/locks on the nfs server. This can be cleared on the nfs server by running the command 'service nfslock restart'.

Comment 1 Sachin Prabhu 2009-04-01 13:58:22 UTC
Created attachment 337531 [details]
Sequence in which the test programs need to be run.

Comment 2 Sachin Prabhu 2009-04-01 14:00:52 UTC
Created attachment 337533 [details]
tcpdump taken when problem  is detected.

vm21: 192.168.122.21
vm22: 192.168.122.22

The following frame numbers show the locking activity leading up to the problem.

335: vm21 to vm11 unlock svid 1
336: vm11 to vm21 unlock granted

368: vm22 to vm11 lock svid 3
370: vm11 to vm22 lock granted.

374: vm21 to vm11 lock svid 2
375: vm11 to vm21 lock blocked (due to other client(vm22)holding lock.)

510: vm22 to vm11 unlock svid 3
511: vm11 to vm22 unlock granted

522: vm21 to vm11 cancel lock svid 2
523: vm21 to vm11 lock svid 3
524: vm11 to vm21 cancel granted
525: vm11 to vm21 lock granted

534: vm21 to vm11 unlock svid 4  <-- In this case, we are not sure why it calls unlock for svid 4.
535: vm11 to vm21 unlock granted

543: vm21 to vm11 lock svid 5
544: vm11 to vm21 lock blocked ( not sure why )

543 and 544 is then repeated with increasing number of svid.

Comment 4 Jeff Layton 2009-04-03 11:10:33 UTC
> 522: vm21 to vm11 cancel lock svid 2
> 523: vm21 to vm11 lock svid 3
> 524: vm11 to vm21 cancel granted
> 525: vm11 to vm21 lock granted
> 
> 534: vm21 to vm11 unlock svid 4  <-- In this case, we are not sure why it calls
> unlock for svid 4.
> 535: vm11 to vm21 unlock granted
> 
> 543: vm21 to vm11 lock svid 5
> 544: vm11 to vm21 lock blocked ( not sure why )
> 

The lock is probably being blocked because svid 3 is holding the lock. It never got released.

Comment 7 RHEL Program Management 2009-04-03 17:35:40 UTC
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.

Comment 8 Jeremy West 2009-04-03 17:40:26 UTC
This got closed too soon.  This needs to be re-flagged for 5.5.

Comment 9 Issue Tracker 2009-04-15 21:35:04 UTC
When the client process receives a signal, nlmclnt_block() waiting for a
response from the server returns with a -ERESTARTSYS. This is propagated
all the way back to do_setlk. An if condition causes a lock to be set on
the system even though the nfs lock is not set. 

For subsequent lock/unlock requests, the unlock function matches the old
lock and the unlock request sent is for this old lock. The server returns
success for the old lock which is interpreted as a successful unlock for
the new lock on the client. However the new lock set on the server is
never freed. We thus get into a condition where the server holds a lock on
a file which is not claimed by any client. All subsequent locks for this
file to the server are blocked.

This is fixed by upstream commit c4d7c402b788b73dc24f1e54a57f89d3dc5eb7b





This event sent from IssueTracker by sprabhu 
 issue 268852

Comment 10 Sachin Prabhu 2009-04-15 21:38:57 UTC
Upstream commit c4d7c402b788b73dc24f1e54a57f89d3dc5eb7b has been backported to RHEL 5 kernel version 2.6.18-138.

* Fri Apr 03 2009 Don Zickus <dzickus> [2.6.18-138.el5]
- [nfs] remove bogus lock-if-signalled case (Bryn M. Reeves ) [456288] 

The reproducer provided was successfully tested against this kernel version.

Comment 12 Sachin Prabhu 2009-04-17 08:56:01 UTC
Reporter has confirmed that the latest kernel doesn't show the problem with the locks.

Comment 13 Sachin Prabhu 2009-04-17 10:15:46 UTC
Closing this as dup of 456288. 

Note that the issues reported here are very different however the same patch fixes both issues.

*** This bug has been marked as a duplicate of bug 456288 ***