Bug 1268045 - GFS2: fsck.gfs2 requires too much memory on large file systems
GFS2: fsck.gfs2 requires too much memory on large file systems
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: gfs2-utils (Show other bugs)
7.1
Unspecified Unspecified
high Severity unspecified
: rc
: ---
Assigned To: Robert Peterson
cluster-qe@redhat.com
Milan Navratil
:
Depends On: 1153316 1184482 1271674
Blocks: 1111393 1165285
  Show dependency treegraph
 
Reported: 2015-10-01 12:48 EDT by Nate Straz
Modified: 2016-11-04 02:30 EDT (History)
8 users (show)

See Also:
Fixed In Version: gfs2-utils-3.1.9-1.el7
Doc Type: Release Note
Doc Text:
fsck.gfs2 has been enhanced to require considerably less memory on large file systems Prior to this update, the Global File System 2 (GFS2) file system checker, fsck.gfs2, required a large amount of memory to run on large file systems, and running fsck.gfs2 on file systems larger than 100 TB was therefore impractical. With this update, fsck.gfs2 has been enhanced to run in considerably less memory, which allows for better scalability and makes running fsck.gf2 practical to run on much larger file systems.
Story Points: ---
Clone Of: 1153316
Environment:
Last Closed: 2016-11-04 02:30:17 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Output from valgrind's massif memory profiling tool, 10TB 80% full GFS2 fsck.gfs2 (46.24 KB, application/x-gzip)
2015-11-09 15:22 EST, Nate Straz
no flags Details
Collection of 40 memory patches (56.02 KB, application/octet-stream)
2016-05-06 13:32 EDT, Robert Peterson
no flags Details
My latest fsck nightmare test script (15.02 KB, text/plain)
2016-05-09 13:09 EDT, Robert Peterson
no flags Details

  None (edit)
Description Nate Straz 2015-10-01 12:48:03 EDT
+++ This bug was initially created as a clone of Bug #1153316 +++

Description of problem:

fsck.gfs2 uses too much memory on large (>100TB) file systems.
Comment 1 Robert Peterson 2015-10-01 13:12:49 EDT
Reassigning to myself.
Comment 2 Robert Peterson 2015-10-26 13:48:56 EDT
Hey Nate, can I get some vmstat statistics for this?
I'm curious if the memory rises immediately due to the large
number of bitmaps, or if it rises slowly throughout pass1,
due to the directory tree or inode tree built by pass1, or
something else. I'm assuming memory doesn't grow much after
pass1 is complete, right?
Comment 3 Nate Straz 2015-11-09 15:22 EST
Created attachment 1091949 [details]
Output from valgrind's massif memory profiling tool, 10TB 80% full GFS2 fsck.gfs2

Here is some memory profiling using valgrind's massif tool.  

Pass times from fsck.gfs2 output:

pass1 completed in 12h5m5.417s
pass1b completed in 0.000s
pass1c completed in 1d44m8.176s
pass2 completed in 14m29.265s
pass3 completed in 0.099s
pass4 completed in 3.894s
pass5 completed in 1m17.906s
check_statfs completed in 0.004s
Comment 4 Robert Peterson 2016-03-02 10:55:20 EST
Based on this information, the vast majority of memory is
taken for the inode tree:

struct inode_info
{
        struct osi_node node;
        struct gfs2_inum di_num;
        uint32_t   di_nlink;    /* the number of links the inode
				 * thinks it has */
        uint32_t   counted_links; /* the number of links we've found */
};

struct gfs2_inum {
	__be64 no_formal_ino;
	__be64 no_addr;
};

struct osi_node {
	unsigned long  osi_parent_color;
	struct osi_node *osi_left;
	struct osi_node *osi_right;
};

The inode tree lookup is used in these places:

link.c:24:	ii = inodetree_find(ip->i_di.di_num.no_addr);
link.c:40:	ii = inodetree_find(no.no_addr);
link.c:73:	ii = inodetree_find(inode_no);
metawalk.c:88:					ii = inodetree_find(blk);
pass1b.c:348:				ii = inodetree_find(ip->i_di.di_num.no_addr);
pass2.c:188:	ii = inodetree_find(entry.no_addr);
pass2.c:597:	ii = inodetree_find(entry->no_addr);
pass4.c:186:		if (!(ii = inodetree_find(lf_dip->i_di.di_num.no_addr))) {

There isn't much room for "give" here. If we change it from
a rb_tree to a linked list, it will crush performance, unless
we go with a hash table of linked lists, which may be acceptable.

Another thought is that we can investigate whether we can get
away with ONLY adding directory inodes to the tree, rather than
all inodes. That can potentially give us big savings.
Comment 5 Robert Peterson 2016-03-02 12:27:38 EST
I've analyzed the code with regard to the biggest memory hog,
the inode tree: 84.93% memory is used. Of that:

->52.36% (2,484,701,408B) 0x40835C: inodetree_insert (inode_hash.c:50)
->17.91% (849,760,776B) 0x426E2E: bget (buf.c:30)
->14.14% (671,089,146B) 0x42227B: gfs2_bmap_create (util.c:531)
->00.52% (24,529,057B) in 1+ places, all below ms_print's threshold (01.00%)

Based on this, I have a plan that would greatly reduce fsck.gfs2's
memory requirements.

1. The biggest saving will be gained by using the inodetree ONLY
   for inodes that have a link count greater than 1. This would be
   directories and hard linked dinodes.
2. In pass2, it currently counts inode links for every directory's
   dentries to figure out what's linked and what's not. Instead, we
   can keep a bitmap, just like the main bitmap, and use that bitmap
   to indicate all dinodes that have a link count of "1", which
   should be the vast majority. When we process a dentry, look in
   that new bitmap: If it's 0, set it to 1. If it's 1, we need a
   way to keep a larger count, so fall back to the inodetree.
   In theory there shouldn't be nearly as many inodes in the tree,
   so it will take up a lot less space. Call it 50% savings.
3. We can also add a link counter specifically for directories and
   stick it in the dirtree because we already know they will already
   have a link count greater than 1. That way, the inodetree will
   ONLY contain non-directory dinodes that have a link counter
   greater than 1, which really ought to be very few.
4. When pass1 is complete, the blockmap should be completely in sync
   with the bitmap. At that point, we should be able to immediately
   skip to pass5 and fix up any bits that don't match.
   The purpose of pass5 is basically to "free" blocks that weren't
   found as "referenced" in pass1. Once this process is complete,
   we should be able to free the blockmap altogether and do all
   bit manipulation on the rgrp bitmaps directly. That frees up
   another 14% of the memory, which means that in its place,
   pass2 can re-use that memory for the new bitmap I talked about
   in item 2 above.
Comment 6 Robert Peterson 2016-03-02 12:37:43 EST
Just to clarify my last comment: The new order of things would be:

1. Pass1 - Gets the blockmap in sync with the bitmaps
2. Pass5 - Gets the blockmap finalized, rgrps written out
   At this point we free the blockmap altogether and use the rgrps
3. Pass1b - Same as before, but use rgrps for any bit changes needed
4. Pass1c - Same as before, but use rgrps for any bit changes needed
5. Pass2 - Allocate a new blockmap for all inodes with link count 1.
   Link count for directories are kept in directory tree, not inode tree.
   If link count is already 1, insert an entry into the inodetree
   because we've got a special exception to the rule.
6. Pass3 - Same as before
7. Pass4 - Instead of traversing the inodetree, traverse the inode
   bitmap, just like pass1 does today. For every dinode:
   1. Check if it's in the dirtree, and if so, verify its link count
      from that, then continue.
   2. Check if it's in the bitmap as link count 1. If so, verify its
      inode link count is also 1, then continue.
   3. Check if it's in the inodetree, and if so, verify its link
      count from that, then continue.
8. The rest (syncing statfs) is business as usual.
Comment 7 Robert Peterson 2016-05-06 13:32 EDT
Created attachment 1154716 [details]
Collection of 40 memory patches

This tarball contains the following memory-related patches:

Bob Peterson (40):
  fsck.gfs2: Move pass5 to immediately follow pass1
  fsck.gfs2: Convert block_type to bitmap_type after pass1 and 5
  fsck.gfs2: Change bitmap_type variables to int
  fsck.gfs2: Use di_entries to determine if lost+found was created
  fsck.gfs2: pass1b shouldn't complain about non-bitmap blocks
  fsck.gfs2: Change all fsck_blockmap_set to fsck_bitmap_set
  fsck.gfs2: Move set_ip_blockmap to pass1
  fsck.gfs2: Remove unneeded parameter instree from set_ip_blockmap
  fsck.gfs2: Move leaf repair to pass2
  fsck.gfs2: Eliminate astate code
  fsck.gfs2: Move reprocess code to pass1
  fsck.gfs2: Separate out functions that may only be done after pass1
  fsck.gfs2: Divest check_metatree from fsck_blockmap_set
  fsck.gfs2: eliminate fsck_blockmap_set from check_eattr_entries
  fsck.gfs2: Move blockmap stuff to pass1.c
  fsck: make pass1 call bitmap reconciliation AKA pass5
  fsck.gfs2: make blockmap global variable only to pass1
  fsck.gfs2: Add wrapper function pass1_check_metatree
  fsck.gfs2: pass counted_links into fix_link_count in pass4
  fsck.gfs2: refactor pass4 function scan_inode_list
  fsck.gfs2: More refactoring of pass4 function scan_inode_list
  fsck.gfs2: Fix white space problems
  fsck.gfs2: move link count info for directories to directory tree
  fsck.gfs2: Use bitmaps instead of linked list for inodes w/nlink == 1
  fsck.gfs2: Refactor check_n_fix_bitmap to make it more readable
  fsck.gfs2: adjust rgrp inode count when fixing bitmap
  fsck.gfs2: blocks cannot be UNLINKED in pass1b or after that
  fsck.gfs2: Add error checks to get_next_leaf
  fsck.gfs2: re-add a non-allocating repair_leaf to pass1
  libgfs2: Allocate new GFS1 metadata as type 3, not type 1
  fsck.gfs2: Undo partially done metadata records
  fsck.gfs2: Eliminate redundant code in _fsck_bitmap_set
  fsck.gfs2: Fix inode counting bug
  fsck.gfs2: Adjust bitmap for lost+found after adding to dirtree
  GFS2: Add initialization checks for GFS1 used metadata
  fsck.gfs2: Use BLKST constants to make pass5 more clear
  fsck.gfs2: Fix GFS1 "used meta" accounting bug
  fsck.gfs2: pass1b is too noisy wrt gfs1 non-dinode metadata
  fsck.gfs2: Fix rgrp dinode accounting bug
  fsck.gfs2: Fix rgrp accounting in check_n_fix_bitmap

A few of them are cleanups, and a few are bug fixes that may be
done under the guise of another bz. Most of the bugs I've fixed
aren't going to be found on customer systems, in practice, because
pass5 is currently performed as the last thing, and that
artificially compensates for bugs: a bug-then-pass5 will never be
seen, but the patch set necessarily changes the order so that the
pass5 processing is done after pass1, so the new order of things,
pass5-then-bug, is a problem, and needs to be fixed.
Comment 8 Robert Peterson 2016-05-08 17:38:17 EDT
These patches all passed my fsck nightmare test.
Comment 9 Robert Peterson 2016-05-09 13:06:42 EDT
My plan is to push the 40 patches to the upstream gfs2-utils
git tree, and rely upon the fact that bug #1271674 will pull
in those changes.
Comment 10 Robert Peterson 2016-05-09 13:09 EDT
Created attachment 1155403 [details]
My latest fsck nightmare test script

For the record, this is my latest version of the fsck nightmare
tests that passed. All the metadata should be on both systems
gfs-i24c-01 and gfs-a16c-01 (as well as elsewhere).
Comment 11 Robert Peterson 2016-05-13 09:21:01 EDT
These patches are all pushed to the master branch of the upstream
gfs2-utils git repo. They should be picked up automatically as
per comment #9. Changing status to POST.
Comment 15 Nate Straz 2016-08-10 12:09:53 EDT
A comparison of the rhel7.2 fsck.gfs2 vs rhel7.3 fsck.gfs2 shows memory usage has improved again for populated file systems.  We're now under 20GB for a 256TB file system.

Memory usage for fsck.gfs2 @ empty
        FS Size  3.1.8-6.el7/fsc  3.1.9-2.el7/fsc  
            16G           1.35MB           1.34MB  
            32G           2.12MB           2.11MB  
            64G           4.42MB           4.41MB  
           128G           9.01MB           9.00MB  
           256G          18.18MB          18.17MB  
           512G          36.54MB          36.53MB  
             1T          73.27MB          73.23MB  
             2T         146.71MB         146.64MB  
             4T         293.59MB         293.45MB  
             8T         587.34MB         587.08MB  
            16T        1174.85MB        1174.34MB  
            32T        2349.86MB        2348.84MB  
            64T        4699.90MB        4697.87MB  
           128T        9399.93MB        6413.20MB  
           256T       15173.68MB       12647.30MB  
Memory usage for fsck.gfs2 @ 80% full
        FS Size  3.1.8-6.el7/fsc  3.1.9-2.el7/fsc  
            16G           3.89MB           1.96MB  
            32G           7.16MB           3.26MB  
            64G          13.84MB           6.82MB  
           128G          27.17MB          12.07MB  
           256G          53.66MB          23.22MB  
           512G         106.78MB          45.62MB  
             1T         212.83MB          90.96MB  
             2T         425.07MB         179.81MB  
             4T         849.71MB         358.11MB  
             8T        1678.98MB         694.16MB  
            16T        3336.94MB        1419.47MB  
            32T        6654.73MB        2801.54MB  
            64T       13276.15MB        5056.02MB  
           128T       26539.59MB        9539.27MB  
           256T       43733.45MB       18164.10MB
Comment 17 errata-xmlrpc 2016-11-04 02:30:17 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2438.html

Note You need to log in before you can comment on or make changes to this bug.