Description of problem: Upstream Linux has identified a very important fix to NFS lockd that enables async locking (with GFS2) to properly close open file descriptors (by updating ref counts as part of cleanup). Without the fix proper cleanup is short-circuited. Version-Release number of selected component (if applicable): Affects 5.2.z (2.6.18-92.1.13.el5) and 5.3 beta (kernel-2.6.18-120.el5). One would hope an errata kernel can be released for 5.2 and that this will get fixed in time for 5.3's release. How reproducible: Always Steps to Reproduce: 1. use async locks with GFS2 Actual results: Filesystem has open file descriptors associated with async locks. Expected results: Open file descriptors should be closed. Additional info: This issue has been fixed upstream since Linux 2.6.25, see commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b7e6b8694
Gave a shot at reproducing this today, but ran across bug 483617. Once the GFS2 folks fix that I should be able to give this another shot.
Looks like Abhi posted a patch for bug 483617 and it works, but I have yet to be able to reproduce this bug. I assume this problem will manifest itself as being unable to unmount the fs. Can you describe a way to reliably reproduce this?
Truth of the matter is I reproduced this problem on IBRIX (simply ran a few async lock, F_SETLK, tests and then couldn't unmount). The upstream fix I referenced in comment#0 was able to eliminate it for me. Passing it off as a GFS2 issue was wrong but I figured I'd get it on your radar considering the upstream fix was specifically developed to address the same issue when using GFS2.
Ok. I'll have to walk over the code and see if I can figure out a way to make this happen so that we can verify that the patch actually fixes it. If or someone else can provide a reliable reproducer, then that would be helpful.
Patch makes sense now that I look over the code. Looking at nlm4svc_proc_test codepath... In order to reproduce this we'd need for nlmsvc_testlock to return nlm_drop_reply. It looks like it only does this when B_QUEUED is set, but B_TIMED_OUT and B_GOT_CALLBACK isn't set. So maybe this would occur if a lock or testlock raced in for this lock while we were waiting for the callback on the lock from the lower fs. It may be hard to reproduce this reliably...we might just have to settle for review by inspection.
I've added this patch to the RHEL5 test kernels on my people.redhat.com page: http://people.redhat.com/jlayton ...if you could test them and verify whether they fix the problem for you, it would be helpful. Even more helpful would be if you had a way to reliably reproduce this problem so we could construct a regression test. Regardless though, the refcount leak looks pretty obvious I think so in the absence of a regression test I'll still plan to propose this internally.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 335737 [details] reproducer: lock and unlock a file in a tight loop Here's the program I used to reproduce this. It just sets a non-blocking lock and unlocks it in a tight loop. When run against a file on a GFS2 export, the filesystem becomes un-umountable.
in kernel-2.6.18-139.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html