Hide Forgot
+++ This bug was initially created as a clone of Bug #1153316 +++ Description of problem: fsck.gfs2 uses too much memory on large (>100TB) file systems.
Reassigning to myself.
Hey Nate, can I get some vmstat statistics for this? I'm curious if the memory rises immediately due to the large number of bitmaps, or if it rises slowly throughout pass1, due to the directory tree or inode tree built by pass1, or something else. I'm assuming memory doesn't grow much after pass1 is complete, right?
Created attachment 1091949 [details] Output from valgrind's massif memory profiling tool, 10TB 80% full GFS2 fsck.gfs2 Here is some memory profiling using valgrind's massif tool. Pass times from fsck.gfs2 output: pass1 completed in 12h5m5.417s pass1b completed in 0.000s pass1c completed in 1d44m8.176s pass2 completed in 14m29.265s pass3 completed in 0.099s pass4 completed in 3.894s pass5 completed in 1m17.906s check_statfs completed in 0.004s
Based on this information, the vast majority of memory is taken for the inode tree: struct inode_info { struct osi_node node; struct gfs2_inum di_num; uint32_t di_nlink; /* the number of links the inode * thinks it has */ uint32_t counted_links; /* the number of links we've found */ }; struct gfs2_inum { __be64 no_formal_ino; __be64 no_addr; }; struct osi_node { unsigned long osi_parent_color; struct osi_node *osi_left; struct osi_node *osi_right; }; The inode tree lookup is used in these places: link.c:24: ii = inodetree_find(ip->i_di.di_num.no_addr); link.c:40: ii = inodetree_find(no.no_addr); link.c:73: ii = inodetree_find(inode_no); metawalk.c:88: ii = inodetree_find(blk); pass1b.c:348: ii = inodetree_find(ip->i_di.di_num.no_addr); pass2.c:188: ii = inodetree_find(entry.no_addr); pass2.c:597: ii = inodetree_find(entry->no_addr); pass4.c:186: if (!(ii = inodetree_find(lf_dip->i_di.di_num.no_addr))) { There isn't much room for "give" here. If we change it from a rb_tree to a linked list, it will crush performance, unless we go with a hash table of linked lists, which may be acceptable. Another thought is that we can investigate whether we can get away with ONLY adding directory inodes to the tree, rather than all inodes. That can potentially give us big savings.
I've analyzed the code with regard to the biggest memory hog, the inode tree: 84.93% memory is used. Of that: ->52.36% (2,484,701,408B) 0x40835C: inodetree_insert (inode_hash.c:50) ->17.91% (849,760,776B) 0x426E2E: bget (buf.c:30) ->14.14% (671,089,146B) 0x42227B: gfs2_bmap_create (util.c:531) ->00.52% (24,529,057B) in 1+ places, all below ms_print's threshold (01.00%) Based on this, I have a plan that would greatly reduce fsck.gfs2's memory requirements. 1. The biggest saving will be gained by using the inodetree ONLY for inodes that have a link count greater than 1. This would be directories and hard linked dinodes. 2. In pass2, it currently counts inode links for every directory's dentries to figure out what's linked and what's not. Instead, we can keep a bitmap, just like the main bitmap, and use that bitmap to indicate all dinodes that have a link count of "1", which should be the vast majority. When we process a dentry, look in that new bitmap: If it's 0, set it to 1. If it's 1, we need a way to keep a larger count, so fall back to the inodetree. In theory there shouldn't be nearly as many inodes in the tree, so it will take up a lot less space. Call it 50% savings. 3. We can also add a link counter specifically for directories and stick it in the dirtree because we already know they will already have a link count greater than 1. That way, the inodetree will ONLY contain non-directory dinodes that have a link counter greater than 1, which really ought to be very few. 4. When pass1 is complete, the blockmap should be completely in sync with the bitmap. At that point, we should be able to immediately skip to pass5 and fix up any bits that don't match. The purpose of pass5 is basically to "free" blocks that weren't found as "referenced" in pass1. Once this process is complete, we should be able to free the blockmap altogether and do all bit manipulation on the rgrp bitmaps directly. That frees up another 14% of the memory, which means that in its place, pass2 can re-use that memory for the new bitmap I talked about in item 2 above.
Just to clarify my last comment: The new order of things would be: 1. Pass1 - Gets the blockmap in sync with the bitmaps 2. Pass5 - Gets the blockmap finalized, rgrps written out At this point we free the blockmap altogether and use the rgrps 3. Pass1b - Same as before, but use rgrps for any bit changes needed 4. Pass1c - Same as before, but use rgrps for any bit changes needed 5. Pass2 - Allocate a new blockmap for all inodes with link count 1. Link count for directories are kept in directory tree, not inode tree. If link count is already 1, insert an entry into the inodetree because we've got a special exception to the rule. 6. Pass3 - Same as before 7. Pass4 - Instead of traversing the inodetree, traverse the inode bitmap, just like pass1 does today. For every dinode: 1. Check if it's in the dirtree, and if so, verify its link count from that, then continue. 2. Check if it's in the bitmap as link count 1. If so, verify its inode link count is also 1, then continue. 3. Check if it's in the inodetree, and if so, verify its link count from that, then continue. 8. The rest (syncing statfs) is business as usual.
Created attachment 1154716 [details] Collection of 40 memory patches This tarball contains the following memory-related patches: Bob Peterson (40): fsck.gfs2: Move pass5 to immediately follow pass1 fsck.gfs2: Convert block_type to bitmap_type after pass1 and 5 fsck.gfs2: Change bitmap_type variables to int fsck.gfs2: Use di_entries to determine if lost+found was created fsck.gfs2: pass1b shouldn't complain about non-bitmap blocks fsck.gfs2: Change all fsck_blockmap_set to fsck_bitmap_set fsck.gfs2: Move set_ip_blockmap to pass1 fsck.gfs2: Remove unneeded parameter instree from set_ip_blockmap fsck.gfs2: Move leaf repair to pass2 fsck.gfs2: Eliminate astate code fsck.gfs2: Move reprocess code to pass1 fsck.gfs2: Separate out functions that may only be done after pass1 fsck.gfs2: Divest check_metatree from fsck_blockmap_set fsck.gfs2: eliminate fsck_blockmap_set from check_eattr_entries fsck.gfs2: Move blockmap stuff to pass1.c fsck: make pass1 call bitmap reconciliation AKA pass5 fsck.gfs2: make blockmap global variable only to pass1 fsck.gfs2: Add wrapper function pass1_check_metatree fsck.gfs2: pass counted_links into fix_link_count in pass4 fsck.gfs2: refactor pass4 function scan_inode_list fsck.gfs2: More refactoring of pass4 function scan_inode_list fsck.gfs2: Fix white space problems fsck.gfs2: move link count info for directories to directory tree fsck.gfs2: Use bitmaps instead of linked list for inodes w/nlink == 1 fsck.gfs2: Refactor check_n_fix_bitmap to make it more readable fsck.gfs2: adjust rgrp inode count when fixing bitmap fsck.gfs2: blocks cannot be UNLINKED in pass1b or after that fsck.gfs2: Add error checks to get_next_leaf fsck.gfs2: re-add a non-allocating repair_leaf to pass1 libgfs2: Allocate new GFS1 metadata as type 3, not type 1 fsck.gfs2: Undo partially done metadata records fsck.gfs2: Eliminate redundant code in _fsck_bitmap_set fsck.gfs2: Fix inode counting bug fsck.gfs2: Adjust bitmap for lost+found after adding to dirtree GFS2: Add initialization checks for GFS1 used metadata fsck.gfs2: Use BLKST constants to make pass5 more clear fsck.gfs2: Fix GFS1 "used meta" accounting bug fsck.gfs2: pass1b is too noisy wrt gfs1 non-dinode metadata fsck.gfs2: Fix rgrp dinode accounting bug fsck.gfs2: Fix rgrp accounting in check_n_fix_bitmap A few of them are cleanups, and a few are bug fixes that may be done under the guise of another bz. Most of the bugs I've fixed aren't going to be found on customer systems, in practice, because pass5 is currently performed as the last thing, and that artificially compensates for bugs: a bug-then-pass5 will never be seen, but the patch set necessarily changes the order so that the pass5 processing is done after pass1, so the new order of things, pass5-then-bug, is a problem, and needs to be fixed.
These patches all passed my fsck nightmare test.
My plan is to push the 40 patches to the upstream gfs2-utils git tree, and rely upon the fact that bug #1271674 will pull in those changes.
Created attachment 1155403 [details] My latest fsck nightmare test script For the record, this is my latest version of the fsck nightmare tests that passed. All the metadata should be on both systems gfs-i24c-01 and gfs-a16c-01 (as well as elsewhere).
These patches are all pushed to the master branch of the upstream gfs2-utils git repo. They should be picked up automatically as per comment #9. Changing status to POST.
https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=498444
A comparison of the rhel7.2 fsck.gfs2 vs rhel7.3 fsck.gfs2 shows memory usage has improved again for populated file systems. We're now under 20GB for a 256TB file system. Memory usage for fsck.gfs2 @ empty FS Size 3.1.8-6.el7/fsc 3.1.9-2.el7/fsc 16G 1.35MB 1.34MB 32G 2.12MB 2.11MB 64G 4.42MB 4.41MB 128G 9.01MB 9.00MB 256G 18.18MB 18.17MB 512G 36.54MB 36.53MB 1T 73.27MB 73.23MB 2T 146.71MB 146.64MB 4T 293.59MB 293.45MB 8T 587.34MB 587.08MB 16T 1174.85MB 1174.34MB 32T 2349.86MB 2348.84MB 64T 4699.90MB 4697.87MB 128T 9399.93MB 6413.20MB 256T 15173.68MB 12647.30MB Memory usage for fsck.gfs2 @ 80% full FS Size 3.1.8-6.el7/fsc 3.1.9-2.el7/fsc 16G 3.89MB 1.96MB 32G 7.16MB 3.26MB 64G 13.84MB 6.82MB 128G 27.17MB 12.07MB 256G 53.66MB 23.22MB 512G 106.78MB 45.62MB 1T 212.83MB 90.96MB 2T 425.07MB 179.81MB 4T 849.71MB 358.11MB 8T 1678.98MB 694.16MB 16T 3336.94MB 1419.47MB 32T 6654.73MB 2801.54MB 64T 13276.15MB 5056.02MB 128T 26539.59MB 9539.27MB 256T 43733.45MB 18164.10MB
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2438.html