Bug 1268045
| Summary: | GFS2: fsck.gfs2 requires too much memory on large file systems | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Nate Straz <nstraz> | ||||||||
| Component: | gfs2-utils | Assignee: | Robert Peterson <rpeterso> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||||||
| Severity: | unspecified | Docs Contact: | Milan Navratil <mnavrati> | ||||||||
| Priority: | high | ||||||||||
| Version: | 7.1 | CC: | agruenba, cluster-maint, cluster-qe, gfs2-maint, mnavrati, nstraz, rpeterso, swhiteho | ||||||||
| Target Milestone: | rc | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | gfs2-utils-3.1.9-1.el7 | Doc Type: | Release Note | ||||||||
| Doc Text: |
fsck.gfs2 has been enhanced to require considerably less memory on large file systems
Prior to this update, the Global File System 2 (GFS2) file system checker, fsck.gfs2, required a large amount of memory to run on large file systems, and running fsck.gfs2 on file systems larger than 100 TB was therefore impractical. With this update, fsck.gfs2 has been enhanced to run in considerably less memory, which allows for better scalability and makes running fsck.gf2 practical to run on much larger file systems.
|
Story Points: | --- | ||||||||
| Clone Of: | 1153316 | Environment: | |||||||||
| Last Closed: | 2016-11-04 06:30:17 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | 1153316, 1184482, 1271674 | ||||||||||
| Bug Blocks: | 1111393, 1165285, 1497636 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
Nate Straz
2015-10-01 16:48:03 UTC
Reassigning to myself. Hey Nate, can I get some vmstat statistics for this? I'm curious if the memory rises immediately due to the large number of bitmaps, or if it rises slowly throughout pass1, due to the directory tree or inode tree built by pass1, or something else. I'm assuming memory doesn't grow much after pass1 is complete, right? Created attachment 1091949 [details]
Output from valgrind's massif memory profiling tool, 10TB 80% full GFS2 fsck.gfs2
Here is some memory profiling using valgrind's massif tool.
Pass times from fsck.gfs2 output:
pass1 completed in 12h5m5.417s
pass1b completed in 0.000s
pass1c completed in 1d44m8.176s
pass2 completed in 14m29.265s
pass3 completed in 0.099s
pass4 completed in 3.894s
pass5 completed in 1m17.906s
check_statfs completed in 0.004s
Based on this information, the vast majority of memory is
taken for the inode tree:
struct inode_info
{
struct osi_node node;
struct gfs2_inum di_num;
uint32_t di_nlink; /* the number of links the inode
* thinks it has */
uint32_t counted_links; /* the number of links we've found */
};
struct gfs2_inum {
__be64 no_formal_ino;
__be64 no_addr;
};
struct osi_node {
unsigned long osi_parent_color;
struct osi_node *osi_left;
struct osi_node *osi_right;
};
The inode tree lookup is used in these places:
link.c:24: ii = inodetree_find(ip->i_di.di_num.no_addr);
link.c:40: ii = inodetree_find(no.no_addr);
link.c:73: ii = inodetree_find(inode_no);
metawalk.c:88: ii = inodetree_find(blk);
pass1b.c:348: ii = inodetree_find(ip->i_di.di_num.no_addr);
pass2.c:188: ii = inodetree_find(entry.no_addr);
pass2.c:597: ii = inodetree_find(entry->no_addr);
pass4.c:186: if (!(ii = inodetree_find(lf_dip->i_di.di_num.no_addr))) {
There isn't much room for "give" here. If we change it from
a rb_tree to a linked list, it will crush performance, unless
we go with a hash table of linked lists, which may be acceptable.
Another thought is that we can investigate whether we can get
away with ONLY adding directory inodes to the tree, rather than
all inodes. That can potentially give us big savings.
I've analyzed the code with regard to the biggest memory hog, the inode tree: 84.93% memory is used. Of that: ->52.36% (2,484,701,408B) 0x40835C: inodetree_insert (inode_hash.c:50) ->17.91% (849,760,776B) 0x426E2E: bget (buf.c:30) ->14.14% (671,089,146B) 0x42227B: gfs2_bmap_create (util.c:531) ->00.52% (24,529,057B) in 1+ places, all below ms_print's threshold (01.00%) Based on this, I have a plan that would greatly reduce fsck.gfs2's memory requirements. 1. The biggest saving will be gained by using the inodetree ONLY for inodes that have a link count greater than 1. This would be directories and hard linked dinodes. 2. In pass2, it currently counts inode links for every directory's dentries to figure out what's linked and what's not. Instead, we can keep a bitmap, just like the main bitmap, and use that bitmap to indicate all dinodes that have a link count of "1", which should be the vast majority. When we process a dentry, look in that new bitmap: If it's 0, set it to 1. If it's 1, we need a way to keep a larger count, so fall back to the inodetree. In theory there shouldn't be nearly as many inodes in the tree, so it will take up a lot less space. Call it 50% savings. 3. We can also add a link counter specifically for directories and stick it in the dirtree because we already know they will already have a link count greater than 1. That way, the inodetree will ONLY contain non-directory dinodes that have a link counter greater than 1, which really ought to be very few. 4. When pass1 is complete, the blockmap should be completely in sync with the bitmap. At that point, we should be able to immediately skip to pass5 and fix up any bits that don't match. The purpose of pass5 is basically to "free" blocks that weren't found as "referenced" in pass1. Once this process is complete, we should be able to free the blockmap altogether and do all bit manipulation on the rgrp bitmaps directly. That frees up another 14% of the memory, which means that in its place, pass2 can re-use that memory for the new bitmap I talked about in item 2 above. Just to clarify my last comment: The new order of things would be:
1. Pass1 - Gets the blockmap in sync with the bitmaps
2. Pass5 - Gets the blockmap finalized, rgrps written out
At this point we free the blockmap altogether and use the rgrps
3. Pass1b - Same as before, but use rgrps for any bit changes needed
4. Pass1c - Same as before, but use rgrps for any bit changes needed
5. Pass2 - Allocate a new blockmap for all inodes with link count 1.
Link count for directories are kept in directory tree, not inode tree.
If link count is already 1, insert an entry into the inodetree
because we've got a special exception to the rule.
6. Pass3 - Same as before
7. Pass4 - Instead of traversing the inodetree, traverse the inode
bitmap, just like pass1 does today. For every dinode:
1. Check if it's in the dirtree, and if so, verify its link count
from that, then continue.
2. Check if it's in the bitmap as link count 1. If so, verify its
inode link count is also 1, then continue.
3. Check if it's in the inodetree, and if so, verify its link
count from that, then continue.
8. The rest (syncing statfs) is business as usual.
Created attachment 1154716 [details]
Collection of 40 memory patches
This tarball contains the following memory-related patches:
Bob Peterson (40):
fsck.gfs2: Move pass5 to immediately follow pass1
fsck.gfs2: Convert block_type to bitmap_type after pass1 and 5
fsck.gfs2: Change bitmap_type variables to int
fsck.gfs2: Use di_entries to determine if lost+found was created
fsck.gfs2: pass1b shouldn't complain about non-bitmap blocks
fsck.gfs2: Change all fsck_blockmap_set to fsck_bitmap_set
fsck.gfs2: Move set_ip_blockmap to pass1
fsck.gfs2: Remove unneeded parameter instree from set_ip_blockmap
fsck.gfs2: Move leaf repair to pass2
fsck.gfs2: Eliminate astate code
fsck.gfs2: Move reprocess code to pass1
fsck.gfs2: Separate out functions that may only be done after pass1
fsck.gfs2: Divest check_metatree from fsck_blockmap_set
fsck.gfs2: eliminate fsck_blockmap_set from check_eattr_entries
fsck.gfs2: Move blockmap stuff to pass1.c
fsck: make pass1 call bitmap reconciliation AKA pass5
fsck.gfs2: make blockmap global variable only to pass1
fsck.gfs2: Add wrapper function pass1_check_metatree
fsck.gfs2: pass counted_links into fix_link_count in pass4
fsck.gfs2: refactor pass4 function scan_inode_list
fsck.gfs2: More refactoring of pass4 function scan_inode_list
fsck.gfs2: Fix white space problems
fsck.gfs2: move link count info for directories to directory tree
fsck.gfs2: Use bitmaps instead of linked list for inodes w/nlink == 1
fsck.gfs2: Refactor check_n_fix_bitmap to make it more readable
fsck.gfs2: adjust rgrp inode count when fixing bitmap
fsck.gfs2: blocks cannot be UNLINKED in pass1b or after that
fsck.gfs2: Add error checks to get_next_leaf
fsck.gfs2: re-add a non-allocating repair_leaf to pass1
libgfs2: Allocate new GFS1 metadata as type 3, not type 1
fsck.gfs2: Undo partially done metadata records
fsck.gfs2: Eliminate redundant code in _fsck_bitmap_set
fsck.gfs2: Fix inode counting bug
fsck.gfs2: Adjust bitmap for lost+found after adding to dirtree
GFS2: Add initialization checks for GFS1 used metadata
fsck.gfs2: Use BLKST constants to make pass5 more clear
fsck.gfs2: Fix GFS1 "used meta" accounting bug
fsck.gfs2: pass1b is too noisy wrt gfs1 non-dinode metadata
fsck.gfs2: Fix rgrp dinode accounting bug
fsck.gfs2: Fix rgrp accounting in check_n_fix_bitmap
A few of them are cleanups, and a few are bug fixes that may be
done under the guise of another bz. Most of the bugs I've fixed
aren't going to be found on customer systems, in practice, because
pass5 is currently performed as the last thing, and that
artificially compensates for bugs: a bug-then-pass5 will never be
seen, but the patch set necessarily changes the order so that the
pass5 processing is done after pass1, so the new order of things,
pass5-then-bug, is a problem, and needs to be fixed.
These patches all passed my fsck nightmare test. My plan is to push the 40 patches to the upstream gfs2-utils git tree, and rely upon the fact that bug #1271674 will pull in those changes. Created attachment 1155403 [details]
My latest fsck nightmare test script
For the record, this is my latest version of the fsck nightmare
tests that passed. All the metadata should be on both systems
gfs-i24c-01 and gfs-a16c-01 (as well as elsewhere).
These patches are all pushed to the master branch of the upstream gfs2-utils git repo. They should be picked up automatically as per comment #9. Changing status to POST. A comparison of the rhel7.2 fsck.gfs2 vs rhel7.3 fsck.gfs2 shows memory usage has improved again for populated file systems. We're now under 20GB for a 256TB file system.
Memory usage for fsck.gfs2 @ empty
FS Size 3.1.8-6.el7/fsc 3.1.9-2.el7/fsc
16G 1.35MB 1.34MB
32G 2.12MB 2.11MB
64G 4.42MB 4.41MB
128G 9.01MB 9.00MB
256G 18.18MB 18.17MB
512G 36.54MB 36.53MB
1T 73.27MB 73.23MB
2T 146.71MB 146.64MB
4T 293.59MB 293.45MB
8T 587.34MB 587.08MB
16T 1174.85MB 1174.34MB
32T 2349.86MB 2348.84MB
64T 4699.90MB 4697.87MB
128T 9399.93MB 6413.20MB
256T 15173.68MB 12647.30MB
Memory usage for fsck.gfs2 @ 80% full
FS Size 3.1.8-6.el7/fsc 3.1.9-2.el7/fsc
16G 3.89MB 1.96MB
32G 7.16MB 3.26MB
64G 13.84MB 6.82MB
128G 27.17MB 12.07MB
256G 53.66MB 23.22MB
512G 106.78MB 45.62MB
1T 212.83MB 90.96MB
2T 425.07MB 179.81MB
4T 849.71MB 358.11MB
8T 1678.98MB 694.16MB
16T 3336.94MB 1419.47MB
32T 6654.73MB 2801.54MB
64T 13276.15MB 5056.02MB
128T 26539.59MB 9539.27MB
256T 43733.45MB 18164.10MB
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2438.html |