Bug 612608
| Summary: | GFS2: kernel BUG at fs/gfs2/glock.c:173! running brawl w/flocks | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Nate Straz <nstraz> | ||||||
| Component: | kernel | Assignee: | Steve Whitehouse <swhiteho> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | low | ||||||||
| Version: | 6.0 | CC: | adas, arozansk, bmarzins, ddumas, nstraz, rpeterso, rwheeler, swhiteho, syeghiay, teigland | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | 6.0 | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | kernel-2.6.32-160.el6 | Doc Type: | Bug Fix | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | 604244 | Environment: | |||||||
| Last Closed: | 2011-12-06 12:24:56 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | 604244 | ||||||||
| Bug Blocks: | |||||||||
| Attachments: |
|
||||||||
|
Description
Nate Straz
2010-07-08 15:35:15 UTC
I've tried some test programs and not been able to recreate the problem. One program took flocks and tried to unlock them twice. I did variations on that theme. Another version forked hundreds of processes, each of which did open, non-blocking flock, then exit(0). The problem did not recreate for me. My next step is to use accordion or a simplified version of it. Note that I was previously able to run the entire brawl set on my cluster without failure. I suspect that you'll need to do flock(fd); close(fd); rather than using unflock in order to reproduce this issue since that is the most likely cause of the problem. We could just drop a dq_wait into the do_unflock() function rather than the non-waiting one we have now. The only issue is that it would potentially slow down flock using programs, but it might be a small enough delay that it won't matter. A faster solution (better, but more complicated) would be to not drop the ref to the flock glock once it has been touched once, except at close time. That would mean that there would still be a ref to the glock at close time which could then be used to wait on the pending demote (if any). In other words we'd only wait for the demote if we needed to close the fd, and not in the (faster) path of do_unflock. I'm not sure that the extra complexity is worth it. I tried a wide variety of programs to recreate this problem again today, including many different parameter combinations in accordion, and nothing seems to recreate this. I tried Steve's suggestions in comment #3 and nothing seems to make a difference. This issue has been proposed when we are only considering blocker issues in the current Red Hat Enterprise Linux release. It has been denied for the current Red Hat Enterprise Linux release. ** If you would still like this issue considered for the current release, ask your support representative to file as a blocker on your behalf. Otherwise ask that it be considered for the next Red Hat Enterprise Linux release. ** Maybe this is a duplicate of bug #537010. I'll ping Dave T. to see if it could be the same thing, and if there's a fix for RHEL6.0. There are no basts (blocking callbacks) for flocks, so the other bug shouldn't be a factor. I still think that it is simply due to a race where we are closing an fd and not waiting for the reply from the dlm at any stage. Normally the time taken for this sequence of operations is long enough that we don't see a problem, in some cases though the glock has vanished first. One simple fix is just to add the waiting _dq function into the do_unflock function as per comment #3. If that doesn't affect performance too much, then that should solve the problem. We should try to get this one in for rhel6 Created attachment 432349 [details] One possible fix (may have perf implications) This is what I was thinking of in comment #3 This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. If you would like it considered as an exception in the current release, please ask your support representative. This request was erroneously denied for the current release of Red Hat Enterprise Linux. The error has been fixed and this request has been re-proposed for the current release. This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. If you would like it considered as an exception in the current release, please ask your support representative. Clearing needinfo, since I can't see any questions which remain to be answered. The patch is not yet upstream, but there seems no reason not to push it upstream and include it in rhel6. Since we have no reproducer, I'd say that this wasn't greatly urgent though. Moving out to 6.2. Patch is posted upstream for -nmw. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Created attachment 504479 [details]
RHEL6 version of patch
Notes for QE: Since this bug cannot apparently be reproduced, the only testing that we need to do is a check for regressions in flock. Patch(es) available on kernel-2.6.32-160.el6 Verified that the patch is included in kernel-2.6.32-178.el6. I have not hit this during regression runs thus far. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2011-1530.html |