Bug 538484
| Summary: | gfs2 rename rgrp lock issue | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Allen Belletti <allen> | ||||||||||||
| Component: | kernel | Assignee: | Steve Whitehouse <swhiteho> | ||||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||||||||
| Severity: | medium | Docs Contact: | |||||||||||||
| Priority: | low | ||||||||||||||
| Version: | 5.4 | CC: | adas, bmarzins, cward, dzickus, jtluka, lwang, rpeterso, swhiteho | ||||||||||||
| Target Milestone: | rc | ||||||||||||||
| Target Release: | --- | ||||||||||||||
| Hardware: | x86_64 | ||||||||||||||
| OS: | Linux | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||
| Clone Of: | |||||||||||||||
| : | 547640 (view as bug list) | Environment: | |||||||||||||
| Last Closed: | 2010-03-30 07:46:32 UTC | Type: | --- | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Embargoed: | |||||||||||||||
| Bug Depends On: | |||||||||||||||
| Bug Blocks: | 526947, 547640 | ||||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Allen Belletti
2009-11-18 17:12:09 UTC
Reassigning to Steve Whitehouse, since he talked to you about it. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Conditions required to hit this bug: 1. The rename must result in the unlinking of an inode 2. The rename must require the allocation of a block in order to satisfy the space requirement for adding the new directory entry. 3. The resource group in which both the old inode and the new blocks are being allocated must be the same. At that point we land up trying to get two resource group locks at the same time. The second event is relatively unlikely since not only will the initial unlink have created some space in the target directory, but also we only need to allocate new blocks occasionally as a directory grows. In fact we only need to search the same resource group for the block allocation rather than actually select it, which does make it a bit more likely that this bug will trigger. On the other hand, in any large filesystem there will be a lot of resource groups, so the chances of this happening will reduce with filesystem size. We can drop the lock on the rgrp early with a very simple patch and that will prevent us from hitting this bug again. On the other hand, thats not quite the whole story as there is still an issue wrt the locking of the two resource groups and their relative ordering. That will need to be addresses in order to avoid distributed deadlock. Bearing in mind the complexity of that, and the likelihood of two nodes hitting this at the same time (considering that its tricky to hit even on a single node) it might be better to do the simple fix first. That will no doubt cover the majority of cases. The slightly odd thing about this bug is that the only time we need to add a new block to a directory is when there isn't enough space in it already. Given that the only time we unlink an inode is when there is a target inode directory entry with the same name as the source inode's directory entry, there should always be enough space (since the target inode's entry will have been removed). So there might be more to this issue than immediately apparent. Changing the name of this bug so that I don't confuse myself again. Also, I think I might have a fix for it now. Just testing the upstream version and a RHEL5 version will be on its way once I've done some testing upstream. Created attachment 373130 [details]
Proposed patch
This is an upstream patch aimed at fixing the reported issue.
Created attachment 373134 [details]
RHEL5 version of patch
This is the RHEL5 version of the original patch.
Allen, if we supply you with a test kernel with the patch from comment #11, are you in a position to see if it fixes the bug? Steve, I would be happy to. Of course, the issue is so relatively rare that it will be a bit difficult to know for sure. Thanks for all of the quick work on this! In case this is useful, here are copies of the fsck logs that I generated over the weekend. You'll note that numerous errors were corrected, despite fsck having been run pretty recently. Perhaps these contributed to triggering this bug. Created attachment 373253 [details]
fsck log of first filesystem
Created attachment 373254 [details]
fsck log of second filesystem
Created attachment 375402 [details]
Test kernel
Allen, please find attached a test kernel rpm. If you need the other bits and bobs (kernel headers, debug stuff, etc) then let me know and I'll attach that too. Let us know how you get on.
post2:/root # uname -a Linux post2.isye.gatech.edu 2.6.18-175.gfs2abhi.001 #1 SMP Tue Dec 1 09:59:50 EST 2009 x86_64 x86_64 x86_64 GNU/Linux Both nodes are up and running on the test kernel. Nothing unusual so far. Since this is such a rare issue, it may be a while before I can confidently state that "the problem is gone", but nothing is grossly broken. Thanks! Allen Allen, any more news? If you've not hit any further issues then I'm seriously considering pushing this patch into our next version... Hi Steve, I've seen no further occurrences of the problem described in this bug, and the patched kernel hasn't added any new problems that I can see. Should be safe to go for it, thanks. in kernel-2.6.18-180.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html |