Bug 482756
Summary: | GFS2: After gfs2_grow, new size is not seen immediately | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Robert Peterson <rpeterso> |
Component: | kernel | Assignee: | Ben Marzinski <bmarzins> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 5.3 | CC: | bturner, ctatman, cward, dejohnso, dzickus, edamato, jtluka, qcai, rpeterso, swhiteho, tao, tdunnon |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | 469773 | Environment: | |
Last Closed: | 2010-03-30 07:10:32 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 526947, 533192 | ||
Attachments: |
Description
Robert Peterson
2009-01-27 21:39:59 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Created attachment 367528 [details]
Fix to reinitialize the resource group index after growing the filesystem
This problem actually exists on both single node and cluster setups.
The first problem, which caused it to fail on cluster setups, is that the rindex list was supposed to get invalidated when nodes dropped their rindex glock, but the code to do that was in meta_go_inval() instead of inode_go_inval(). I can't see any reason why that code was in meta_go_inval(). It never got called during my testing, and I can't see any way that it could get called, but I dislike removing code that I don't understand (and like I said, I have no idea why that code was there). So if there's a reason for that meta_go_inval() code, someone please let me know, and I'll add it back.
The second problem is that one single node setups, the node never needs to drop the rindex glock. There are multiple ways to solve this. I could have added code that manually updated the rindex list when you grew the filesystem. Instead, I just forced the node to actually drop its rindex glock, which invalidates the rindex list. The next time the node needs to allocate memory, it will pick the glock back up and reinitialize the list. This is not the fastest way to do things, but it does mean that all nodes in a cluster do the same thing to invalidate and reinitialize their rindex list, and since growing a filesystem is a pretty rare event, the additional overhead seems acceptable.
Posted in kernel-2.6.18-174.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified. NOTE: From the customer: The hotfix given to me by Jeremy West and Linda did not fix the customer issue. Becasue ti was a grow issue, I did have them install the hotfix kernel on both nodes. They attempted to grow a gfs2 volume, and were still unable to use the new space immediately. New sosreports and stack traces for the grow have been attached to this ticket. Will be attaching the following: sosreport-mageshkumar.gajapathy.804042671-17973-811dea.tar.bz2 sosreport-mageshkumar.gajapathy.804042671-8183-a23186.tar.bz2 gfs2_grow.strace1 Created attachment 388382 [details]
gfs2_grow strace with hotfix installed
Event posted on 02-02-2010 04:14pm EST by dejohnso Verified from sosreport that hotfix is installed.. [dejohnso@dhcp242-193 mageshkumar.gajapathy.804042671-17973]$ cat uname Linux sbici 2.6.18-174.el5 #1 SMP Mon Nov 16 22:54:31 EST 2009 x86_64 x86_64 x86_64 GNU/Linux [dejohnso@dhcp242-193 mageshkumar.gajapathy.804042671-17973]$ This event sent from IssueTracker by dejohnso issue 336608 NOTE: Verified that hotfix has the code by extracting the src rpm and checking it. I went over linux-2.6-gfs2-drop-rindex-glock-on-grows.patch line by line and it is all there. So why are they not seeing the grow? Are they reproducing this the same way as before? Do they still have to wait for the filesystem to be remounted to see the space, or does it appear if they wait a little bit. Would it be possible to get a copy of all the commands that they run, and the output of all of them, including running lvdisplay and vgdisplay both at the start and the end of the testing? Created attachment 388868 [details]
vgdisplay of the customer's system
Thanks, but I would really like this in the context of running all of the commands. I'd also like to see what they used when they created the filesystem. Also, looking at the vgdisplay command, it looks like they don't have clvmd running. However, they do have two nodes, right? Or are they testing with just a single node now? If they are running in a cluster with two nodes accessing the storage, they need to have clvmd running, or things can go very wrong. I'm not saying that this is the cause of their issue, but live-growing a shared volume in a cluster without clvmd running is a bad idea. If the customer isn't running IO on both nodes (assuming that they are actually using both nodes), can they try doing some IO on the node that they didn't grow the filesystem on, after the grow completes, and see if that makes them able to see the new space? This shouldn't be necessary to see the new space, but if this clears up the problem, that narrows down where the it could be. Also, are they mounting the filesystem with any mount options? ~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative. If this problem is still reproduceable, I need the information from the debug kernel to have a chance at solving it, since I am unable to reproduce it myself. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html From the information in the last two comments, this doesn't look like the original bug. I trust that the output from Comment #57 is from the command that caused the error in Comment #56, meaning the filesystem didn't grow the full size that it was supposed to. After this happened, did the customer unmount and remount the filesystem? If so, did it fix the problem? If unmounting and remounting didn't fix the problem, then this is a completely different bug than was originally reported. This actually sounds a lot like bz #469773, which was a problem in the gfs2 utils, that caused filesystems to grow less than they should. It was fixed in gfs2-utils-0.1.58-1.el5. According to the sosreports from the time of the original bug the customer was using gfs2-utils-0.1.53-1.el5_3.3-x86_64. Can you check if they are currently using an updated gfs2-utils package? If they are not, could they try using gfs2-utils-0.1.58-1.el5 or newer, and seeing if that solves their problem? If they saw this while using gfs2-utils-0.1.58-1.el5 or a newer version, and the problem did not fix itself when they unmounted and remounted the filesystem, can you please either open a new bug or reopen #469773. If remounting the filesystem did fix the problem, then we can probably keep the discussion under this bugzilla for now. In that case, I'd really like them to run my debug kernel, so I can see what happened to the resource group index. The only entry I saw was: Apr 20 15:54:46 sbidb kernel: GFS2: fsid=pbi_prd:ora_pbi_saporg.0: File system extended by 256160 blocks. This can be found in the file messages.debugkernel. Created attachment 409306 [details]
messages from gfs2_grow with the debug kernel
|