Ric was running some tests of fsck by filling a 1T filesystem with 20k files,
using his fs_mark benchmark (in fedora9, epel, and rawhide):
fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0 -l
This creates around 40-50 million inodes before the filesystem fills, depending
on the fs.
for comparison, on my test box, ext3 fsck ran in 60s, xfs in 12s, ext4 in 6s.
I stopped fsck.gfs2 after about 19 hours somewhere inside pass2 (I did not get
any % complete notices, so I don't know if that means it was still in early
phases or what). pass1 took on the order of 17 hours.
I blktraced fsck.gfs2 for about 20 mins and throughput was less than 0.5MB/s
(I'll attach a graph)
Created attachment 311755 [details]
seekwatcher during pass2
A semi-random datapoint; about 20 mins of seekwatcher during pass2.
The writes are interesting too, I'd not have expected much write activity for a
I'd have not expected to see _any_ writes :-) It does look like its a bug, and
it sounds as if it won't be too tricky to speed that phase up anyway.
I'm guessing that the main problem is that your fsck out of
memory and started swapping. I've devised some plans to reduce
the memory footprint and I'll do some profiling as well to see
where we can make other improvements.
It wouldn't be swapping to the device being repaired, though....
Sorry if I misled you. I didn't mean to imply that the writes were
related to swapping. I'm just saying that the most common reason
for fsck.gfs2 to run very slowly is when it runs out of ram and
starts swapping memory to disk.
The problem with fsck.gfs2 writing to the disk is actually not
surprising, given what I know about how it works. I just opened
bug #500484 to fix that issue. It's a complex problem, so I'd like
to treat it separately from this one.
So to summarize: fsck.gfs2 has three problems:
(1) It uses way too much memory, and often that causes it to slow down.
(2) It has its own problems with slowness
(3) It writes to the file system when it should not.
My plan is to use this bug record to fix problems #1 and #2, and
bug #500484 to work on #3.
(In reply to comment #5)
> Sorry if I misled you. I didn't mean to imply that the writes were
> related to swapping. I'm just saying that the most common reason
> for fsck.gfs2 to run very slowly is when it runs out of ram and
> starts swapping memory to disk.
Gotcha, my misunderstanding.
Without meddling too much I'd add
4) the read IO that it does doesn't seem very linear, and seeks kill....
maybe, based on the attachment in comment #1 ?
Created attachment 343669 [details]
This preliminary patch is a starting point for disentangling the
issues fsck has with libgfs2/buf.c. Rather than read in all the
rgrp bitmaps into memory, all in a non-volatile linked list, this
patch keeps the same linked list of rgrps, but only reads in the
bitmaps on an as-needed basis, and only for short-term. This is
also an attempt to eliminate the fairly new non-volatile linked
buffer list and change all the code in gfs2-utils so that they
should not depend on the list being non-volatile. The whole
non-volatile thing was a stop-gap temporary solution so that we
could minimize the possibility of regression. In addition to
getting rid of the non-volatile linked list, it also reduces the
remaining linked list of buffers from 128MB to 1MB so that much
fewer buffers will be kept in memory. That could drive a lot of
bugs out of their hiding places.
This could have a negative performance impact, because some code that
used to assume that given buffers were always in memory will now have
to occasionally re-read them from disk. However, it may also have a
positive performance impact because there are much fewer buffers on
the linked list to run when searching, and the memory usage should be
considerably less, which might keep fsck.gfs2 from swapping.
Since the changes are pervasive, we need to thoroughly test all the
gfs2-utils functions as well as we can. I've only tested fsck.gfs2
with this patch, so buyer beware. The patch is likely to be revised.
Additional work needs to be done: The bitmaps in memory are still the
biggest memory hog of the whole process. When I ran fsck.gfs2 today
on a 9TB file system, it used 1.2GB of memory and most of that was
for the huge in-core bitmap.
We've talked about making fsck.gfs2 work on a per-rg basis and
Steve had some good ideas for improving it as well. So this patch
is just a start.
Created attachment 344521 [details]
Preliminary patch #2
This patch fixes a few bugs that the first patch had.
Created attachment 344527 [details]
Preliminary patch #2--upstream
This is the upstream version of the Preliminary Patch. Actually,
I'm doing my development on the upstream code and testing it on
Fedora (system "camel"). Preliminary results are encouraging:
mkfs on 1TB stays at 31s (but uses much less memory). The fsck
on 1TB goes from 1m26s down to 50s (and uses less memory). This
patch clears the way for me to eliminate buf.c in lieu of using
mmap buffer management. I believe Steve's suggestion of letting
vfs manage the buffers will gain us a significant performance
Created attachment 365112 [details]
Preliminary patch #3--upstream
I know this is outrageously huge, but it's also a major
performance improvement. Performance gains were realized
by several methods, but the major improvements are these:
1. buf.c was given a lobotomy. There are no more linked lists
of buffers to search through, which saves a lot of time.
Instead, the code is just a means for reading and writing
on an as-needed basis. The penalty is that the changes are
pervasive; virtually all user space tool is affected.
And the code that assumed the linked lists were in place had
to be reworked so they no longer had that requirement.
Which could mean there are bugs. Many areas of gfs2-utils
have not been tested with the new code.
2. The duplicates blocks list has been totally scrapped for a
new red-black tree taken largely from the kernel's rbtree.c
and rbtree.h. This gives a huge performance gain in many
areas, but especially in the excruciatingly slow pass1b where
duplicate blocks are resolved. The code in pass1b has been
reworked so it only needs to make one pass through the file
system, and quickly scans the rbtree for what it needs.
3. The generic block status code (block_list.c) has been retooled
completely so that it only does nibble manipulation. It used
to keep track of blocks in three ways: (1) a nibble to indicate
block type, (2) a linked list for dinodes with extended
attributes, and (3) a linked list for blocks that are
duplicates. Now it only keeps (1), and that saves a tremendous
amount of time. In addition block_list.c has been streamlined
and fsck code that required the other two lists use and
manipulate them on an as-needed basis only rather than let
block_list.c do all the work.
4. All RGs are kept in memory (as they were before) but now their
buffers (buf.c) are linked to the structure in memory, so it
never has to read and write the disk to get the information.
5. Almost all ondisk.c code has been transformed to use buffer
headers (buf.c) to ensure that data structure changes will
cause the buffer to be written to disk, and only if it is
really changed. (In other words, there should be no more writes
done to a file system unless absolutely necessary).
6. A bunch of small optimizations of the code has been done.
Basically, I profiled the code with valgrind and figured out
where we were wasting the most time, and retooled them.
This version of the code has _not_ finished testing. It still has
a lot of testing that needs to be done. I'm sure there are bugs.
The good news is that preliminary tests show significant speed
improvement. In one case, west-0406 (from Nate's "nsew" cluster)
went from 49 hours to fsck to 5 minutes 30 seconds. That's a
special case with a lot of very big data files. (This was converted
from gfs via gfs2_convert).
In another case, 1773738-metasave, I've never actually gotten
through fsck.gfs2 with the old fsck because it takes prohibitively
long. I've killed it after several hours. In one case, I let the
old version run overnight and after > 12 hours, it was 1% into pass1b.
With the new code, it gets all the way to pass2 within about an hour.
(Which is most of the way through the fsck).
To be sure: This patch is still preliminary and proof-of-concept.
I'm sure there are bugs. But it's very close. The problem here
is that it takes a long time to test all the different scenarios,
especially with multiple-terabyte metadata sets.
*** Bug 500484 has been marked as a duplicate of this bug. ***
The fix isn't ready for 5.5. Retargeting for 5.6.
Created attachment 385954 [details]
First 20 STABLE3 patches
This tarball contains the first twenty gfs2-utils patches to
speed things up and fix a bunch of bugs I found in testing.
So far these patches are mostly "reform" patches. That is,
most of them merely reorganize, simplify and reshape the code.
Some of them improve performance though.
The patches that fix real bugs and really speed things up are
yet to come. (I'll post them when they've been separated out
from each other, which should be shortly.) My goal here is to
separate the "noise" from the fixes that make a real difference
because frankly, most of the "noise patches" are huge.
1. fb1fd632 Remove nvbuf_list and use fewer buffers
2. 3bf7c1db Eliminate bad_block linked block list
3. 132c5949 Simplify bitmap/block list structures
4. e7d40b55 Streamline the bitmap code by always using 4-bit size per block
5. 2c35613f Misc blocklist optimizations
6. 2d0a9030 Separate eattr_block list from the rest for efficiency
7. 30761ada gfs2: remove update_flags everywhere
8. a7b3bc3d fsck.gfs2: give comfort when processing lots of data blocks
9. a2e35f9d make query() count errors_found, errors_fixed
10. bcab5aac attach buffers to rgrp_list structs
11. 4d4250a0 Make struct_out functions operate on bh's
12. d97e46be Attach bh's to inodes
13. 179e8f85 gfs2: Remove buf_lists
14. efc48665 fsck.gfs2: Verify rgrps free space against bitmap
15. 4a50aa66 bit_map -> block_map for clarity
16. 0bd10812 Move duplicate code from libgfs2 to fsck.gfs2
17. f4b1e5e7 libgfs2, fsck.gfs2: simplify block_query code
18. 80c8c107 gfs2: libgfs2 and fsck.gfs2 cleanups
19. 0bb39a86 libgfs2: fs_bits speed up bitmap operations
20. 3247bc3e libgfs2: gfs2_log reform
Created attachment 386054 [details]
STABLE3 patches 21 - 30
This tarball contains STABLE3 patches 21 through 30. There are
still many more to come.
Created attachment 386247 [details]
STABLE3 patches 31 - 40
This tarball contains STABLE3 patches 31 through 40. I forgot to
list the previous set of patches, so I'm listing them here along
with this set of ten:
21. 7a89d6f fsck.gfs2: convert dup_list to a rbtree
22. f8a1efa fsck.gfs2: convert dir_info list to rbtree
23. 287cce1 fsck.gfs2: convert inode hash to rbtree
24. 7b32b47 fsck.gfs2: pass1 should use gfs2_special_add not _set
25. 17fa2f9 libgfs2: Remove unneeded sdp parameter in gfs2_block_set
26. 49944e7 libgfs2: dir_split_leaf needs to zero out the new leaf
27. 6cc2baf libgfs2: dir_split_leaf needs to check for allocation failure
28. 898b355 libgfs2: Set block range based on rgrps, not device size
29. 4926dcb fsck.gfs2: should use the libgfs2 is_system_directory
30. b718f95 fsck.gfs2: Journal replay should report what it's doing
31. 9391ded fsck.gfs2: fix directories that have odd number of pointers.
32. aedd06c libgfs2: Get rid of useless constants
33. bb0667a fsck.gfs2: link.c should log why it's making a change for debugging
34. dc990cf fsck.gfs2: Enforce consistent behavior in directory processing
35. bb05c32 fsck.gfs2: enforce consistency between bitmap and blockmap
36. a374bdf fsck.gfs2: metawalk needs to check for no valid leaf blocks
37. 17d9ff8 fsck.gfs2: metawalk was not checking many directories
38. 0a1b673 fsck.gfs2: separate check_data function in check_metatree
39. 0afb359 lost+found link count and connections were not properly managed
40. 8a5d6bf fsck.gfs2: reprocess lost+found and other inode metadata when blocks are added
There are still several more patches to come.
Created attachment 386732 [details]
STABLE3 patches 41 - 57
This tarball contains STABLE3 patches 41 through 57. Here is
a list of these patches:
41. bbde081 Misc cleanups
42. f56043a fsck.gfs2: Check for massive amounts of pointer corruption
43. af99475 fsck.gfs2: use gfs2_meta_inval vs. gfs2_inval_inode
44. da84bfd Eliminate unnecessary block_list from gfs2_edit
45. d57579f fsck.gfs2: rename gfs2_meta_other to gfs2_meta_rgrp.
46. c395eee Create a standard metadata delete interface
47. cb328f4 fsck.gfs2: cleanup: refactor pass3
48. 43b3b80 fsck.gfs2: Make pass1 undo its work for unrecoverable inodes
49. 0d098b2 fsck.gfs2: Overhaul duplicate reference processing
50. 0c8b1af fsck.gfs2: invalidate invalid mode inodes
51. 8697753 fsck.gfs2: Force intermediate lost+found inode updates
52. c490dd0 fsck.gfs2: Free metadata list memory we don't need
53. 8664ae2 fsck.gfs2: Don't add extended attrib blocks to list twice
54. 71c293b fsck.gfs2: small parameter passing optimization
55. 4b1b6e0 fsck.gfs2: Free, don't invalidate, dinodes with bad depth
56. aa34c47 Misc cleanups
57. c5f6413 fsck.gfs2: If journal replay fails, give option to reinitialize journal
I don't have any more patches planned at the moment.
I did one more patch to clean up white-space errors introduced by
the previous patches. These patches have passed the toughest of
my tests with flying colors. It has fixed all of the gfs2 metadata
sets I own that could fix within the boundaries of the 9TB device
on system kool.
All 58 patches have been pushed to the master branch of the
gfs2-utils git tree and the STABLE3 branch of the cluster git tree.
Most of the bugs I discovered exist in the original gfs_fsck, back
from the old days. This especially holds true for the bugs where
lost+found accounting and bitmap setting is done improperly.
Unfortunately, STABLE3 has diverged from the RHEL branches so much
that it will be a nightmare to try to get these patches to apply
to RHEL and even harder to go back to the original gfs_fsck.
Created attachment 394829 [details]
Here are all the crosswrite patches for RHEL55 in one tarball.
I'm testing them now and they may need some revisions.
Created attachment 395124 [details]
RHEL55 patches (two more)
This tarball is the same as the previous one, containing the
same 65 patches. However, in testing all the GFS2 metadata in
my collection, I discovered two more patches were necessary
(bringing it up to 67). They are:
43afa82 Convert check_statfs function to the new rgrp method
af9445e GFS2: Dinode #33129 (0x8169) has bad height Found 1, Expected >= 2
The first patch is necessary because it fixes a discrepancy between
the new code and code added since I started this project.
The second patch is a crosswrite from STABLE3 that had previously
not been in RHEL55, but is now necessary.
This version has passed all the tests I've thrown at it and is
ready to be QE tested. Due to limited disk space on my RHEL5
test box, I can't test bigger file systems with this code, which
excludes about a dozen metadata sets I have. However, the
STABLE3 equivalent code has been tested on them all (I have more
storage available there).
This pre-release version (WARNING: not QE tested) is available
on my people page at this location:
Created attachment 396079 [details]
This is another patch that has to be shipped. This fixes a problem
where mkfs.gfs2 was unable to make any gfs2 file systems.
Created attachment 396370 [details]
This patch fixes some regressions introduced by the previous
patches. With this patch and the previous ones, I was able
to run mkfs.gfs2, fsck.gfs2 and the growfs test successfully,
verifying they did the right things.
Created attachment 400558 [details]
osi_tree "each_safe" patch
There were a few places in fsck.gfs2 where each element of the
new rbtrees were being processed. Some of those places deleted
items from the rbtree, which meant I needed to use the equivalent
of osi_list_foreach_safe. In other words, I needed to prevent
rbtree deletes from interfering with the next() function.
The result was a segfault in fsck.gfs2 in a few places. This
patch implements the "safe" rbtree traversal and fixes the problem.
Patch #70 for anyone out there counting.
All 70 patches (crosswritten from STABLE3) were pushed to the RHEL56
branch of the cluster.git tree. It was sanity-check tested on
system roth-01. Customers in the field have vouched for the patches
working properly on a RHEL5.5 prototype. Changing status to POST.
The RHEL56 patch IDs are as follows:
I can't add this to 5.6 until I get a QA ack.
Erg, this bug record once had both QA_ACK and PM_ACK. I'll see
if I can round them up again.
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
This is a major speed improvement for fsck.gfs2 compared with previous versions. There are also a large number of improvements to the internal structure of fsck.gfs2 which although not directly evident to the user, should make further development much easier and improve reliability.
Any chance of getting a back-port of this patch for 5.5? The fsck.gfs2 performance is very problematic to us and after testing the pre-release version mentioned in comment 22, we would very much like to move to this in our environment.
There is already a test version of this on my people page and
it's received very positive feedback:
Are you asking for a z-stream 5.5.z?
Build 2767711 successful. This fix is in gfs2-utils-0.1.62-22.el5.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.