Ric was running some tests of fsck by filling a 1T filesystem with 20k files, using his fs_mark benchmark (in fedora9, epel, and rawhide): fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0 -l fill.txt This creates around 40-50 million inodes before the filesystem fills, depending on the fs. for comparison, on my test box, ext3 fsck ran in 60s, xfs in 12s, ext4 in 6s. I stopped fsck.gfs2 after about 19 hours somewhere inside pass2 (I did not get any % complete notices, so I don't know if that means it was still in early phases or what). pass1 took on the order of 17 hours. I blktraced fsck.gfs2 for about 20 mins and throughput was less than 0.5MB/s (I'll attach a graph) -Eric
Created attachment 311755 [details] seekwatcher during pass2 A semi-random datapoint; about 20 mins of seekwatcher during pass2. The writes are interesting too, I'd not have expected much write activity for a clean fs.
I'd have not expected to see _any_ writes :-) It does look like its a bug, and it sounds as if it won't be too tricky to speed that phase up anyway.
I'm guessing that the main problem is that your fsck out of memory and started swapping. I've devised some plans to reduce the memory footprint and I'll do some profiling as well to see where we can make other improvements.
It wouldn't be swapping to the device being repaired, though....
Sorry if I misled you. I didn't mean to imply that the writes were related to swapping. I'm just saying that the most common reason for fsck.gfs2 to run very slowly is when it runs out of ram and starts swapping memory to disk. The problem with fsck.gfs2 writing to the disk is actually not surprising, given what I know about how it works. I just opened bug #500484 to fix that issue. It's a complex problem, so I'd like to treat it separately from this one. So to summarize: fsck.gfs2 has three problems: (1) It uses way too much memory, and often that causes it to slow down. (2) It has its own problems with slowness (3) It writes to the file system when it should not. My plan is to use this bug record to fix problems #1 and #2, and bug #500484 to work on #3.
(In reply to comment #5) > Sorry if I misled you. I didn't mean to imply that the writes were > related to swapping. I'm just saying that the most common reason > for fsck.gfs2 to run very slowly is when it runs out of ram and > starts swapping memory to disk. Gotcha, my misunderstanding. Without meddling too much I'd add 4) the read IO that it does doesn't seem very linear, and seeks kill.... maybe, based on the attachment in comment #1 ?
Created attachment 343669 [details] Preliminary patch This preliminary patch is a starting point for disentangling the issues fsck has with libgfs2/buf.c. Rather than read in all the rgrp bitmaps into memory, all in a non-volatile linked list, this patch keeps the same linked list of rgrps, but only reads in the bitmaps on an as-needed basis, and only for short-term. This is also an attempt to eliminate the fairly new non-volatile linked buffer list and change all the code in gfs2-utils so that they should not depend on the list being non-volatile. The whole non-volatile thing was a stop-gap temporary solution so that we could minimize the possibility of regression. In addition to getting rid of the non-volatile linked list, it also reduces the remaining linked list of buffers from 128MB to 1MB so that much fewer buffers will be kept in memory. That could drive a lot of bugs out of their hiding places. This could have a negative performance impact, because some code that used to assume that given buffers were always in memory will now have to occasionally re-read them from disk. However, it may also have a positive performance impact because there are much fewer buffers on the linked list to run when searching, and the memory usage should be considerably less, which might keep fsck.gfs2 from swapping. Since the changes are pervasive, we need to thoroughly test all the gfs2-utils functions as well as we can. I've only tested fsck.gfs2 with this patch, so buyer beware. The patch is likely to be revised. Additional work needs to be done: The bitmaps in memory are still the biggest memory hog of the whole process. When I ran fsck.gfs2 today on a 9TB file system, it used 1.2GB of memory and most of that was for the huge in-core bitmap. We've talked about making fsck.gfs2 work on a per-rg basis and Steve had some good ideas for improving it as well. So this patch is just a start.
Created attachment 344521 [details] Preliminary patch #2 This patch fixes a few bugs that the first patch had.
Created attachment 344527 [details] Preliminary patch #2--upstream This is the upstream version of the Preliminary Patch. Actually, I'm doing my development on the upstream code and testing it on Fedora (system "camel"). Preliminary results are encouraging: mkfs on 1TB stays at 31s (but uses much less memory). The fsck on 1TB goes from 1m26s down to 50s (and uses less memory). This patch clears the way for me to eliminate buf.c in lieu of using mmap buffer management. I believe Steve's suggestion of letting vfs manage the buffers will gain us a significant performance improvement.
Created attachment 365112 [details] Preliminary patch #3--upstream I know this is outrageously huge, but it's also a major performance improvement. Performance gains were realized by several methods, but the major improvements are these: 1. buf.c was given a lobotomy. There are no more linked lists of buffers to search through, which saves a lot of time. Instead, the code is just a means for reading and writing on an as-needed basis. The penalty is that the changes are pervasive; virtually all user space tool is affected. And the code that assumed the linked lists were in place had to be reworked so they no longer had that requirement. Which could mean there are bugs. Many areas of gfs2-utils have not been tested with the new code. 2. The duplicates blocks list has been totally scrapped for a new red-black tree taken largely from the kernel's rbtree.c and rbtree.h. This gives a huge performance gain in many areas, but especially in the excruciatingly slow pass1b where duplicate blocks are resolved. The code in pass1b has been reworked so it only needs to make one pass through the file system, and quickly scans the rbtree for what it needs. 3. The generic block status code (block_list.c) has been retooled completely so that it only does nibble manipulation. It used to keep track of blocks in three ways: (1) a nibble to indicate block type, (2) a linked list for dinodes with extended attributes, and (3) a linked list for blocks that are duplicates. Now it only keeps (1), and that saves a tremendous amount of time. In addition block_list.c has been streamlined and fsck code that required the other two lists use and manipulate them on an as-needed basis only rather than let block_list.c do all the work. 4. All RGs are kept in memory (as they were before) but now their buffers (buf.c) are linked to the structure in memory, so it never has to read and write the disk to get the information. 5. Almost all ondisk.c code has been transformed to use buffer headers (buf.c) to ensure that data structure changes will cause the buffer to be written to disk, and only if it is really changed. (In other words, there should be no more writes done to a file system unless absolutely necessary). 6. A bunch of small optimizations of the code has been done. Basically, I profiled the code with valgrind and figured out where we were wasting the most time, and retooled them. This version of the code has _not_ finished testing. It still has a lot of testing that needs to be done. I'm sure there are bugs. The good news is that preliminary tests show significant speed improvement. In one case, west-0406 (from Nate's "nsew" cluster) went from 49 hours to fsck to 5 minutes 30 seconds. That's a special case with a lot of very big data files. (This was converted from gfs via gfs2_convert). In another case, 1773738-metasave, I've never actually gotten through fsck.gfs2 with the old fsck because it takes prohibitively long. I've killed it after several hours. In one case, I let the old version run overnight and after > 12 hours, it was 1% into pass1b. With the new code, it gets all the way to pass2 within about an hour. (Which is most of the way through the fsck). To be sure: This patch is still preliminary and proof-of-concept. I'm sure there are bugs. But it's very close. The problem here is that it takes a long time to test all the different scenarios, especially with multiple-terabyte metadata sets.
*** Bug 500484 has been marked as a duplicate of this bug. ***
The fix isn't ready for 5.5. Retargeting for 5.6.
Created attachment 385954 [details] First 20 STABLE3 patches This tarball contains the first twenty gfs2-utils patches to speed things up and fix a bunch of bugs I found in testing. So far these patches are mostly "reform" patches. That is, most of them merely reorganize, simplify and reshape the code. Some of them improve performance though. The patches that fix real bugs and really speed things up are yet to come. (I'll post them when they've been separated out from each other, which should be shortly.) My goal here is to separate the "noise" from the fixes that make a real difference because frankly, most of the "noise patches" are huge. 1. fb1fd632 Remove nvbuf_list and use fewer buffers 2. 3bf7c1db Eliminate bad_block linked block list 3. 132c5949 Simplify bitmap/block list structures 4. e7d40b55 Streamline the bitmap code by always using 4-bit size per block 5. 2c35613f Misc blocklist optimizations 6. 2d0a9030 Separate eattr_block list from the rest for efficiency 7. 30761ada gfs2: remove update_flags everywhere 8. a7b3bc3d fsck.gfs2: give comfort when processing lots of data blocks 9. a2e35f9d make query() count errors_found, errors_fixed 10. bcab5aac attach buffers to rgrp_list structs 11. 4d4250a0 Make struct_out functions operate on bh's 12. d97e46be Attach bh's to inodes 13. 179e8f85 gfs2: Remove buf_lists 14. efc48665 fsck.gfs2: Verify rgrps free space against bitmap 15. 4a50aa66 bit_map -> block_map for clarity 16. 0bd10812 Move duplicate code from libgfs2 to fsck.gfs2 17. f4b1e5e7 libgfs2, fsck.gfs2: simplify block_query code 18. 80c8c107 gfs2: libgfs2 and fsck.gfs2 cleanups 19. 0bb39a86 libgfs2: fs_bits speed up bitmap operations 20. 3247bc3e libgfs2: gfs2_log reform
Created attachment 386054 [details] STABLE3 patches 21 - 30 This tarball contains STABLE3 patches 21 through 30. There are still many more to come.
Created attachment 386247 [details] STABLE3 patches 31 - 40 This tarball contains STABLE3 patches 31 through 40. I forgot to list the previous set of patches, so I'm listing them here along with this set of ten: Patches 21-30: 21. 7a89d6f fsck.gfs2: convert dup_list to a rbtree 22. f8a1efa fsck.gfs2: convert dir_info list to rbtree 23. 287cce1 fsck.gfs2: convert inode hash to rbtree 24. 7b32b47 fsck.gfs2: pass1 should use gfs2_special_add not _set 25. 17fa2f9 libgfs2: Remove unneeded sdp parameter in gfs2_block_set 26. 49944e7 libgfs2: dir_split_leaf needs to zero out the new leaf 27. 6cc2baf libgfs2: dir_split_leaf needs to check for allocation failure 28. 898b355 libgfs2: Set block range based on rgrps, not device size 29. 4926dcb fsck.gfs2: should use the libgfs2 is_system_directory 30. b718f95 fsck.gfs2: Journal replay should report what it's doing Patches 31-40: 31. 9391ded fsck.gfs2: fix directories that have odd number of pointers. 32. aedd06c libgfs2: Get rid of useless constants 33. bb0667a fsck.gfs2: link.c should log why it's making a change for debugging 34. dc990cf fsck.gfs2: Enforce consistent behavior in directory processing 35. bb05c32 fsck.gfs2: enforce consistency between bitmap and blockmap 36. a374bdf fsck.gfs2: metawalk needs to check for no valid leaf blocks 37. 17d9ff8 fsck.gfs2: metawalk was not checking many directories 38. 0a1b673 fsck.gfs2: separate check_data function in check_metatree 39. 0afb359 lost+found link count and connections were not properly managed 40. 8a5d6bf fsck.gfs2: reprocess lost+found and other inode metadata when blocks are added There are still several more patches to come.
Created attachment 386732 [details] STABLE3 patches 41 - 57 This tarball contains STABLE3 patches 41 through 57. Here is a list of these patches: 41. bbde081 Misc cleanups 42. f56043a fsck.gfs2: Check for massive amounts of pointer corruption 43. af99475 fsck.gfs2: use gfs2_meta_inval vs. gfs2_inval_inode 44. da84bfd Eliminate unnecessary block_list from gfs2_edit 45. d57579f fsck.gfs2: rename gfs2_meta_other to gfs2_meta_rgrp. 46. c395eee Create a standard metadata delete interface 47. cb328f4 fsck.gfs2: cleanup: refactor pass3 48. 43b3b80 fsck.gfs2: Make pass1 undo its work for unrecoverable inodes 49. 0d098b2 fsck.gfs2: Overhaul duplicate reference processing 50. 0c8b1af fsck.gfs2: invalidate invalid mode inodes 51. 8697753 fsck.gfs2: Force intermediate lost+found inode updates 52. c490dd0 fsck.gfs2: Free metadata list memory we don't need 53. 8664ae2 fsck.gfs2: Don't add extended attrib blocks to list twice 54. 71c293b fsck.gfs2: small parameter passing optimization 55. 4b1b6e0 fsck.gfs2: Free, don't invalidate, dinodes with bad depth 56. aa34c47 Misc cleanups 57. c5f6413 fsck.gfs2: If journal replay fails, give option to reinitialize journal I don't have any more patches planned at the moment.
I did one more patch to clean up white-space errors introduced by the previous patches. These patches have passed the toughest of my tests with flying colors. It has fixed all of the gfs2 metadata sets I own that could fix within the boundaries of the 9TB device on system kool. All 58 patches have been pushed to the master branch of the gfs2-utils git tree and the STABLE3 branch of the cluster git tree. Most of the bugs I discovered exist in the original gfs_fsck, back from the old days. This especially holds true for the bugs where lost+found accounting and bitmap setting is done improperly. Unfortunately, STABLE3 has diverged from the RHEL branches so much that it will be a nightmare to try to get these patches to apply to RHEL and even harder to go back to the original gfs_fsck.
Created attachment 394829 [details] RHEL55 patches Here are all the crosswrite patches for RHEL55 in one tarball. I'm testing them now and they may need some revisions.
Created attachment 395124 [details] RHEL55 patches (two more) This tarball is the same as the previous one, containing the same 65 patches. However, in testing all the GFS2 metadata in my collection, I discovered two more patches were necessary (bringing it up to 67). They are: 43afa82 Convert check_statfs function to the new rgrp method af9445e GFS2: Dinode #33129 (0x8169) has bad height Found 1, Expected >= 2 The first patch is necessary because it fixes a discrepancy between the new code and code added since I started this project. The second patch is a crosswrite from STABLE3 that had previously not been in RHEL55, but is now necessary.
This version has passed all the tests I've thrown at it and is ready to be QE tested. Due to limited disk space on my RHEL5 test box, I can't test bigger file systems with this code, which excludes about a dozen metadata sets I have. However, the STABLE3 equivalent code has been tested on them all (I have more storage available there). This pre-release version (WARNING: not QE tested) is available on my people page at this location: http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/
Created attachment 396079 [details] Another patch This is another patch that has to be shipped. This fixes a problem where mkfs.gfs2 was unable to make any gfs2 file systems.
Created attachment 396370 [details] Regressions patch This patch fixes some regressions introduced by the previous patches. With this patch and the previous ones, I was able to run mkfs.gfs2, fsck.gfs2 and the growfs test successfully, verifying they did the right things.
Created attachment 400558 [details] osi_tree "each_safe" patch There were a few places in fsck.gfs2 where each element of the new rbtrees were being processed. Some of those places deleted items from the rbtree, which meant I needed to use the equivalent of osi_list_foreach_safe. In other words, I needed to prevent rbtree deletes from interfering with the next() function. The result was a segfault in fsck.gfs2 in a few places. This patch implements the "safe" rbtree traversal and fixes the problem. Patch #70 for anyone out there counting.
All 70 patches (crosswritten from STABLE3) were pushed to the RHEL56 branch of the cluster.git tree. It was sanity-check tested on system roth-01. Customers in the field have vouched for the patches working properly on a RHEL5.5 prototype. Changing status to POST. The RHEL56 patch IDs are as follows: 0fb59c1416efb28367ca5fb06221edb9ecf49423 2040434f74cb6cd43e3b7781c7f6c11d3adfaa2b 4fbd5fa224b6dbcf0998c4cb352a92c4b49f6cdc 3f47f6f9b2dc9f8e1dbfc27158d6bc1f64ad5cac d6e3399a259167c1762651586ea062cf7d0c6c30 8639e740d7b51c046d2cffd5b96c6319214b839f f6cdc874f5caa267d13f0f6741765bfaa2c24284 e9316e5043402ba0cdc66fe70e90538fd88a4797 e7282c8a25ee830c068b3598814bc285d4398e24 6c6638cd6055c03ef74351dc7a179c5a178b0b16 bf0822c4fd7bd701f36ddaef045fb8d587b47f96 965554a2d3d18d877bb4ce3f1cf369775b882e7a 8bc98f2fad5ecf63e42bd44e93f5b5349c7f1253 a6d56a0938ad396bda77a04102c644df3202ec93 1a6be82a56458a55a6e6a489c0672b1bec444704 c418417c6c16447f388fc568dfdcc9ad0a6a0138 0895e0b00958a36645c5250b988c4ba4cc2939ab 2ac3e54d005e9fdd3103bf5a78502f92ac8a4aec b486ce1bf1519f42a252a0b49ea177a1c87b4593 8d3521017ebe5e9b8d8e190f1b33af601b461972 9b487b83bb443ab2f55e39d2f81f26e67b6ad791 7d8b8ae7f3f3a425c11b58def216d45881e36ac9 7079b2141b66a78e0484d56025ad3e6c238a021c b4eba17564b216b75eb89f6b06a62b572a69ac2e 221038f9a959a0ac4f457afec470e7d3135b090a cb66bfe79239309ace78c06f3548fb402be4e201 4d41d1c30a8a80aef09abf8fb18facb3f68ea5bb 4284f2110fc813cb30d3611d16728d4640aed6f3 10d21248358791b837772aabe06a524a01381356 1faaeeacae5c868e72e327e821ba118da7cbd6cd c952ad46246b8619b723852ab967a69f2bb80290 5a2b3a48aea599e441cfb893884c14e41336ae8e ac9968edda6cd6d42e91adc6dc8440f8ed480011 ff0e3708e441d15d620a611a6d25cbc9d89d57f1 afc7e08118c14ab73ea58907529ed456fea3c602 5ea3c62f38d726c39a520b60de0acecb0ca046ae 7c8cb8d5d6d1e08c5f83eef844ceff89a6567d64 daa3ee9620a1448f944a426df54d96b26e2816c6 9107ed8993d8efc19be0c34e4874db3f229c1869 0e489a26b950339de685d22a4a2269310689885e 03fff858449f5491fcd5b134975819bf9d1842db d5225059548822d41cc15fa8c754e2ec9a4d2c90 aaaf96f17ca0167f3fad3f2f63b9258d2e3ffcc1 a04bba916c90b702c59773894861c061ffe71d03 17bcd7e515bf1c03e9464377dbb1cfbd3a27a7e3 5b7eb0c7649899cc57a473317ea30e91528853b9 ae16ca4e3c1a4dfdecafb124947fd43c836de8c5 d770495f365c8ded3ecc3e83a67fdc02b7dceb4d 84eebbec2ab855b10d23efbfc71487b5fa300a84 d454ee8d7944deb1bbdc26e5d84a79f08e46ee91 c30fd6a2560452a111dad8584951765d9ddb787b ee54c19b7f4783f3a609fbab09b45a87bea6de79 ca8891b7a427fd7a8c4a8d42949f2f8318e56c3c ff3e9ddd806878935a8c5e0702b84d9230409e08 424bf2626e6b4f6b935d200452774de6d898cbc4 d7710e6d9a4df4b28e715382bb028b817e514ba0 cd6e9217b85f13c85b250180cf5590526d697c61 0827d8708cef94e2f063a8fd9071f468f7eeec54 800465eaed0ecdee931feb52f9e30467f624c00e 259dcbe89d0edf896f4a504bc2d627df0d6e4e9f 7704574fe2fc2cc831f3cb46f8fca526c09c6d4b 97f9ce088161a48df490b304c1866aa4a0694d33 20c699b363c8ddd38ad29c35a20b5978384645e6 7ccf5c0c60ff29fc6e8a1ef0fea01d510f2df79b 1b06b8a4e4f9b36566b6ea15a094243ae06dc0e2 27ddb92542bfa62d94e7e2496e717f84aaef53f6 efc08c2e2456e59cc5098ec89071343f2d72757c 7b9e48f534a1b3aa0ff2138b9038379bf3d33ab7 4c15e60d5a1750cf43316e8530727afbc63fe57c bbd213cf02b9f427342d75e4c05d4cf95d80d8ef b83b1ee7f3cf0cf04ed52932e5c1e75259b9ecd2
I can't add this to 5.6 until I get a QA ack.
Erg, this bug record once had both QA_ACK and PM_ACK. I'll see if I can round them up again.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: This is a major speed improvement for fsck.gfs2 compared with previous versions. There are also a large number of improvements to the internal structure of fsck.gfs2 which although not directly evident to the user, should make further development much easier and improve reliability.
Any chance of getting a back-port of this patch for 5.5? The fsck.gfs2 performance is very problematic to us and after testing the pre-release version mentioned in comment 22, we would very much like to move to this in our environment.
There is already a test version of this on my people page and it's received very positive feedback: http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/fsck.gfs2 Are you asking for a z-stream 5.5.z?
Build 2767711 successful. This fix is in gfs2-utils-0.1.62-22.el5.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0135.html