Bug 471254

Summary: lockd: fix reference count leaks in async locking case (impacts GFS2)
Product: Red Hat Enterprise Linux 5 Reporter: Mike Snitzer <snitzer>
Component: kernelAssignee: Jeff Layton <jlayton>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: medium    
Version: 5.2.zCC: adam.hough, dzickus, emcnabb, jpirko, staubach, steved, teigland
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 08:48:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 514700    
Attachments:
Description Flags
reproducer: lock and unlock a file in a tight loop none

Description Mike Snitzer 2008-11-12 17:11:07 UTC
Description of problem:
Upstream Linux has identified a very important fix to NFS lockd that enables async locking (with GFS2) to properly close open file descriptors (by updating ref counts as part of cleanup).  Without the fix proper cleanup is short-circuited.

Version-Release number of selected component (if applicable):
Affects 5.2.z (2.6.18-92.1.13.el5) and 5.3 beta (kernel-2.6.18-120.el5).  One would hope an errata kernel can be released for 5.2 and that this will get fixed in time for 5.3's release.

How reproducible:
Always

Steps to Reproduce:
1. use async locks with GFS2
  
Actual results:
Filesystem has open file descriptors associated with async locks.

Expected results:
Open file descriptors should be closed.

Additional info:
This issue has been fixed upstream since Linux 2.6.25, see commit:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b7e6b8694

Comment 1 Jeff Layton 2009-02-02 16:39:52 UTC
Gave a shot at reproducing this today, but ran across bug 483617. Once the GFS2 folks fix that I should be able to give this another shot.

Comment 2 Jeff Layton 2009-02-05 15:14:33 UTC
Looks like Abhi posted a patch for bug 483617 and it works, but I have yet to be able to reproduce this bug.

I assume this problem will manifest itself as being unable to unmount the fs. Can you describe a way to reliably reproduce this?

Comment 3 Mike Snitzer 2009-02-10 05:44:55 UTC
Truth of the matter is I reproduced this problem on IBRIX (simply ran a few async lock, F_SETLK, tests and then couldn't unmount).  The upstream fix I referenced in comment#0 was able to eliminate it for me.  Passing it off as a GFS2 issue was wrong but I figured I'd get it on your radar considering the upstream fix was specifically developed to address the same issue when using GFS2.

Comment 4 Jeff Layton 2009-02-12 18:43:19 UTC
Ok. I'll have to walk over the code and see if I can figure out a way to make this happen so that we can verify that the patch actually fixes it. If or someone else can provide a reliable reproducer, then that would be helpful.

Comment 5 Jeff Layton 2009-03-02 12:43:15 UTC
Patch makes sense now that I look over the code.

Looking at nlm4svc_proc_test codepath...

In order to reproduce this we'd need for nlmsvc_testlock to return nlm_drop_reply. It looks like it only does this when B_QUEUED is set, but B_TIMED_OUT and B_GOT_CALLBACK isn't set. So maybe this would occur if a lock or testlock raced in for this lock while we were waiting for the callback on the lock from the lower fs.

It may be hard to reproduce this reliably...we might just have to settle for review by inspection.

Comment 6 Jeff Layton 2009-03-03 15:57:55 UTC
I've added this patch to the RHEL5 test kernels on my people.redhat.com page:

http://people.redhat.com/jlayton

...if you could test them and verify whether they fix the problem for you, it would be helpful. Even more helpful would be if you had a way to reliably reproduce this problem so we could construct a regression test.

Regardless though, the refcount leak looks pretty obvious I think so in the absence of a regression test I'll still plan to propose this internally.

Comment 8 RHEL Program Management 2009-03-18 17:18:21 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Jeff Layton 2009-03-18 17:21:54 UTC
Created attachment 335737 [details]
reproducer: lock and unlock a file in a tight loop

Here's the program I used to reproduce this. It just sets a non-blocking lock and unlocks it in a tight loop. When run against a file on a GFS2 export, the filesystem becomes un-umountable.

Comment 10 Don Zickus 2009-04-16 18:38:55 UTC
in kernel-2.6.18-139.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 17 errata-xmlrpc 2009-09-02 08:48:07 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html