Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 494885

Summary: GFS2: gfs2_grow changes to rindex read in wrong by the kernel
Product: Red Hat Enterprise Linux 5 Reporter: Nate Straz <nstraz>
Component: kernelAssignee: Ben Marzinski <bmarzins>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: low    
Version: 5.3CC: bmarzins, cluster-maint, dzickus, edamato, jtluka
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 08:54:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
RHEL 5 fix.
none
Updated RHEL5 fix. none

Description Nate Straz 2009-04-08 14:34:30 UTC
Description of problem:

While testing gfs2_grow I found that the kernel can think that the file system is larger than the block device after a gfs2_grow.


lvextend -l +100%FREE growfs/grow1 on dash-02
  Extending logical volume grow1 to 698.16 GB
  Logical volume grow1 successfully resized
growing grow1 on dash-01
FS: Mount Point: /mnt/grow1
FS: Device:      /dev/mapper/growfs-grow1
FS: Size:        610057173 (0x245cbbd5)
FS: RG size:     523654 (0x7fd86)
DEV: Size:       732069888 (0x2ba28000)
The file system grew by 119153MB.
gfs2_grow complete.
verifying grow
getsize on dash-02
waiting for GFS file system size to update
getsize on dash-02
waiting for GFS file system size to update
getsize on dash-03
waiting for GFS file system size to update
getsize on dash-01
file system thinks it is larger than volume
fs = 873382, lv = 749639.57


Version-Release number of selected component (if applicable):
gfs2-utils-0.1.53-1.el5_3.2
kernel-2.6.18-128.el5

How reproducible:
easily

Steps to Reproduce:
1. growfs -2 -g 1 -i 5

  
Actual results:
See above

Expected results:
After a gfs2_grow, statfs should show that the file system's size is similar to the block device's size.

Additional info:

Comment 2 Ben Marzinski 2009-05-28 04:06:49 UTC
I'm not sure if this is related, but while trying to simplify the test case that hits this, I ran growfs on only one filesystem.  I was still using a three node cluster, but only one node was actually mounting the filesystem.  In this setup, I don't get any errors while growing the filesystem, but I hang on unmount.

Specifically, the super_block's s_umount rw_semaphore is blocked.... boy is it blocked! Looking at it, I see

  s_umount = {
    count = 851954, 
    wait_lock = {
      raw_lock = {
        slock = 1
      }
    }, 
    wait_list = {
      next = 0xf5f1af08, 
      prev = 0xf5f1af08
    }
  }, 


Just under a million readers!  I cannot see a single process that is actually in a section of code that would have a read lock on this, and the process list looks completely normal.  So it appears we are failing to unlock this semaphore in some fairly common call path.

Comment 3 Ben Marzinski 2009-05-28 19:16:22 UTC
In fact, I hit this unmount hang problem with one node, lock_nolock, and no IO to the filesystem during the grow.

Comment 4 Ben Marzinski 2009-05-28 22:17:38 UTC
Just to document the similest reproducer for the hang
This is using the 2.6.18-150.el5 kernel

# vgcreate test /dev/sdb1
# lvcreate -l 100%FREE test
# mkfs.gfs2 -p lock_nolock -j1 /dev/test/lvol0
# mount -t gfs2 /dev/test/lvol0 /mnt/test
# vgextend test /dev/sdb2
# lvextend -l +100%FREE test/lvol0
# gfs2_grow /mnt/test
# umount /mnt/test

Comment 5 Ben Marzinski 2009-06-09 21:28:40 UTC
For what it's worth, You can reproduce the "file system thinks it is larger than volume" error without doing any IO during the resizes.  However, you must run the
grows on different nodes to hit it.  Also, the resource groups are set up correctly. It seems like only the statfs information is bad.  Switching to statfs_slow breaks things worse.

Comment 6 Ben Marzinski 2009-06-10 18:36:38 UTC
This bug is caused by the following:

In adjust_fs_space() you calculate the new space based on the resource groups, but nothing guarantees that other nodes will sync their statfs changes before you do this, so you can get into the following situation

filesystem size starts at 1G

For both nodes A and B, the master size is 1G and the local size is 0

1. Node A grows the fs by 1G
Node A now has a local size of 1G
Node B local size is unchanged.
master size is unchanged

2. Node B grows the fs by 1G
Node B now has a local size of 2G
Node A local size is unchanged
master size is unchanged

3. Node A syncs its statfs changes
Node A now has a local size of 0
Node B local size is unchanged
master is now 2G

4. Node B syncs its statfs changes
Node B how has a local size of 0
Node A local size is unchanged
Master is now 4G

Now everyone will see a 4G filesystem.

It seems sensible to immediately sync filesystem size changes (calls to adjust_fs_space()). They should be infrequent enough that there are no real performance concerns.

Comment 7 Ben Marzinski 2009-06-11 04:57:50 UTC
I came up with two reasonable solutions

1. When ever you do a write the sd_rindex, in adjust_fs_space you need to first read in the current value of the master statfs info (the first part of gfs2_statfs_sync) then update the values, and sync them back to disk (the second part of gfs2_statfs_sync)

As a manual workaround, you can already approximate this by running

# echo 1 > /sys/fs/gfs2/<fs>/statfs_sync

before and then again after growing the filesystem. Doing this does work around the problem.

2. Add a variable, hanging somewhere off the superblock, and initialize it to the
existing size and number of free blocks before you do any writes to the sd_rindex. Then, in adjust_fs_space, you look at the new size, subtract the old size, and only add change of the local grow to the local statfs information.

They both seem about equally complicated and invasive to me (one adds an extra variable and an extra call to gfs2_ri_total(), the other forces you to grab the statfs_inode glock on writes to the sd_rindex).  Steve, if you have a preference, let me know, otherwise I'll probably start working on option 1.  It seems like after a grow, you should have the correct fileystem size, since you will know it. Option 2 doesn't guarantee this. It might need to wait all the nodes have synced their statfs information.

Comment 8 Ben Marzinski 2009-06-14 02:05:13 UTC
Created attachment 347793 [details]
RHEL 5 fix.

This RHEL5 patch fixes the problem for me.  I haven't given in much testing besides the growfs test yet.

Comment 9 Steve Whitehouse 2009-06-15 08:36:37 UTC
In super.h:

extern int gfs2_statfs_init(struct gfs2_sbd *sdp);
 extern void gfs2_statfs_change(struct gfs2_sbd *sdp,
 			s64 total, s64 free, s64 dinodes);
+void gfs2_statfs_change_in(struct gfs2_statfs_change_host *sc, const void *buf);
+void update_statfs(struct gfs2_sbd *sdp, struct buffer_head *m_bh,
+		   struct buffer_head *l_bh);
 extern int gfs2_statfs_sync(struct gfs2_sbd *sdp);
 extern int gfs2_statfs_i(struct gfs2_sbd *sdp, struct gfs2_statfs_change_host *sc);
 extern int gfs2_statfs_slow(struct gfs2_sbd *sdp, struct gfs2_statfs_change_host *sc);


Please make these extern like the others. We need an upstream patch too, but that doesn't look too tricky to do.

Comment 10 Ben Marzinski 2009-06-15 23:14:52 UTC
Sure. Another question. How exact to we need to be about how many blocks we reserve?  In my patch, I only reserve one extra block on rindex writes, but that's clearly wrong, since we need to write to both the local and the master block, but our rindex write may not need to write to the local block (which the patch was counting on for one of the two reserved blocks). The easy fix is just to do:

+	if (&ip->i_inode == sdp->sd_rindex)
+		rblocks += 2 * RES_STATFS;

The less pretty, but more exact method would be something along the lines of:

+	if (&ip->i_inode == sdp->sd_rindex)
+		rblocks += RES_STATFS + ((ind_blocks || data_blocks)? 0 : 1)

Do we care if we reserve slightly too many blocks?

Comment 11 Steve Whitehouse 2009-06-16 08:45:15 UTC
No, it shouldn't be an issue as we refund any extra blocks at transaction commit time. A number of functions reserve more blocks than they need for this reason.

Comment 12 Ben Marzinski 2009-06-23 15:14:47 UTC
*** Bug 485924 has been marked as a duplicate of this bug. ***

Comment 13 Nate Straz 2009-06-23 15:41:01 UTC
Making this an RC blocker since we hit it often in our testing.  GFS is able to pass this test so a failure in GFS2 is considered a regression.

Comment 14 Ben Marzinski 2009-06-23 18:18:37 UTC
Posted

Comment 15 Ben Marzinski 2009-06-23 18:19:30 UTC
Created attachment 349126 [details]
Updated RHEL5 fix.

Comment 17 Don Zickus 2009-06-30 20:22:19 UTC
in kernel-2.6.18-156.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 19 Jan Tluka 2009-07-20 15:25:00 UTC
Patch is in -158.el5. Adding SanityOnly.

Comment 21 errata-xmlrpc 2009-09-02 08:54:21 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html