Bug 494885
| Summary: | GFS2: gfs2_grow changes to rindex read in wrong by the kernel | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Nate Straz <nstraz> | ||||||
| Component: | kernel | Assignee: | Ben Marzinski <bmarzins> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | low | ||||||||
| Version: | 5.3 | CC: | bmarzins, cluster-maint, dzickus, edamato, jtluka | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2009-09-02 08:54:21 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Nate Straz
2009-04-08 14:34:30 UTC
I'm not sure if this is related, but while trying to simplify the test case that hits this, I ran growfs on only one filesystem. I was still using a three node cluster, but only one node was actually mounting the filesystem. In this setup, I don't get any errors while growing the filesystem, but I hang on unmount.
Specifically, the super_block's s_umount rw_semaphore is blocked.... boy is it blocked! Looking at it, I see
s_umount = {
count = 851954,
wait_lock = {
raw_lock = {
slock = 1
}
},
wait_list = {
next = 0xf5f1af08,
prev = 0xf5f1af08
}
},
Just under a million readers! I cannot see a single process that is actually in a section of code that would have a read lock on this, and the process list looks completely normal. So it appears we are failing to unlock this semaphore in some fairly common call path.
In fact, I hit this unmount hang problem with one node, lock_nolock, and no IO to the filesystem during the grow. Just to document the similest reproducer for the hang This is using the 2.6.18-150.el5 kernel # vgcreate test /dev/sdb1 # lvcreate -l 100%FREE test # mkfs.gfs2 -p lock_nolock -j1 /dev/test/lvol0 # mount -t gfs2 /dev/test/lvol0 /mnt/test # vgextend test /dev/sdb2 # lvextend -l +100%FREE test/lvol0 # gfs2_grow /mnt/test # umount /mnt/test For what it's worth, You can reproduce the "file system thinks it is larger than volume" error without doing any IO during the resizes. However, you must run the grows on different nodes to hit it. Also, the resource groups are set up correctly. It seems like only the statfs information is bad. Switching to statfs_slow breaks things worse. This bug is caused by the following: In adjust_fs_space() you calculate the new space based on the resource groups, but nothing guarantees that other nodes will sync their statfs changes before you do this, so you can get into the following situation filesystem size starts at 1G For both nodes A and B, the master size is 1G and the local size is 0 1. Node A grows the fs by 1G Node A now has a local size of 1G Node B local size is unchanged. master size is unchanged 2. Node B grows the fs by 1G Node B now has a local size of 2G Node A local size is unchanged master size is unchanged 3. Node A syncs its statfs changes Node A now has a local size of 0 Node B local size is unchanged master is now 2G 4. Node B syncs its statfs changes Node B how has a local size of 0 Node A local size is unchanged Master is now 4G Now everyone will see a 4G filesystem. It seems sensible to immediately sync filesystem size changes (calls to adjust_fs_space()). They should be infrequent enough that there are no real performance concerns. I came up with two reasonable solutions 1. When ever you do a write the sd_rindex, in adjust_fs_space you need to first read in the current value of the master statfs info (the first part of gfs2_statfs_sync) then update the values, and sync them back to disk (the second part of gfs2_statfs_sync) As a manual workaround, you can already approximate this by running # echo 1 > /sys/fs/gfs2/<fs>/statfs_sync before and then again after growing the filesystem. Doing this does work around the problem. 2. Add a variable, hanging somewhere off the superblock, and initialize it to the existing size and number of free blocks before you do any writes to the sd_rindex. Then, in adjust_fs_space, you look at the new size, subtract the old size, and only add change of the local grow to the local statfs information. They both seem about equally complicated and invasive to me (one adds an extra variable and an extra call to gfs2_ri_total(), the other forces you to grab the statfs_inode glock on writes to the sd_rindex). Steve, if you have a preference, let me know, otherwise I'll probably start working on option 1. It seems like after a grow, you should have the correct fileystem size, since you will know it. Option 2 doesn't guarantee this. It might need to wait all the nodes have synced their statfs information. Created attachment 347793 [details]
RHEL 5 fix.
This RHEL5 patch fixes the problem for me. I haven't given in much testing besides the growfs test yet.
In super.h: extern int gfs2_statfs_init(struct gfs2_sbd *sdp); extern void gfs2_statfs_change(struct gfs2_sbd *sdp, s64 total, s64 free, s64 dinodes); +void gfs2_statfs_change_in(struct gfs2_statfs_change_host *sc, const void *buf); +void update_statfs(struct gfs2_sbd *sdp, struct buffer_head *m_bh, + struct buffer_head *l_bh); extern int gfs2_statfs_sync(struct gfs2_sbd *sdp); extern int gfs2_statfs_i(struct gfs2_sbd *sdp, struct gfs2_statfs_change_host *sc); extern int gfs2_statfs_slow(struct gfs2_sbd *sdp, struct gfs2_statfs_change_host *sc); Please make these extern like the others. We need an upstream patch too, but that doesn't look too tricky to do. Sure. Another question. How exact to we need to be about how many blocks we reserve? In my patch, I only reserve one extra block on rindex writes, but that's clearly wrong, since we need to write to both the local and the master block, but our rindex write may not need to write to the local block (which the patch was counting on for one of the two reserved blocks). The easy fix is just to do: + if (&ip->i_inode == sdp->sd_rindex) + rblocks += 2 * RES_STATFS; The less pretty, but more exact method would be something along the lines of: + if (&ip->i_inode == sdp->sd_rindex) + rblocks += RES_STATFS + ((ind_blocks || data_blocks)? 0 : 1) Do we care if we reserve slightly too many blocks? No, it shouldn't be an issue as we refund any extra blocks at transaction commit time. A number of functions reserve more blocks than they need for this reason. *** Bug 485924 has been marked as a duplicate of this bug. *** Making this an RC blocker since we hit it often in our testing. GFS is able to pass this test so a failure in GFS2 is considered a regression. Posted Created attachment 349126 [details]
Updated RHEL5 fix.
in kernel-2.6.18-156.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified. Patch is in -158.el5. Adding SanityOnly. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html |