Description of problem: I was doing some testing for bug 223893: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=223893 and even I was surprised at how long it took to run gfs2_fsck on an 8G partition. Now granted, because I was unit testing, I had gdb debugging enabled, which slowed it down from its normal time. Just for the fun of it, I profiled gfs2_fsck with gprof to see where it was spending its time. I was surprised to find that most of its cpu time was spent doing bitmap manipulation. Digging into the code, I discovered that it was using some extremely inefficient math in the bitmap manipulation routines. Just for the heck of it, I rewrote a couple of functions to be longer and more complex, but more cpu efficient. For example, rather than doing 64-bit divides and 64-bit modulus, I used masking and shifting. In doing so, I greatly improved the efficiency of these functions. Then, when doing various testing, I discovered that the Makefile for libgfs2 accidentally still had debugging turned on for our RHEL5 gfs2 tools. So that should get turned off. Version-Release number of selected component (if applicable): RHEL5 How reproducible: Always Steps to Reproduce: 1. gfs2_fsck Actual results: Very slow Expected results: A tiny bit less slow. Additional info: I have three ideas for improving gfs2_fsck at the moment, but I'm sure there will be more: 1. Change the libgfs2 Makefile so it uses -O2 rather than -ggdb 2. Change the bitmap.c function to do more efficient math 3. Maybe change the libgfs2 buffering scheme so it finds block buffers more efficiently. Perhaps using a btree rather than a linked list. (Just something to experiment with).
Created attachment 151303 [details] First go at a fix This patch probably needs a little cleanup: I have long horrid comments in there explaining the math reduction (mostly for my own sanity).
Created attachment 151304 [details] First go at a fix This patch probably needs a little cleanup: I have long horrid comments in there explaining the math reduction (mostly for my own sanity).
Here's some gprof profiling data for gfs2_fsck: Flat profile excerpts: Each sample counts as 0.01 seconds. Without the patch: -------------------------------------------------------- % cumulative self self total time seconds seconds calls ms/call ms/call name 31.57 1.19 1.19 gfs2_bitmap_get 27.06 2.21 1.02 __udivdi3 13.53 2.72 0.51 __umoddi3 2.92 2.83 0.11 bget_generic 2.65 2.93 0.10 139301 0.00 0.00 inode_hash_search 2.65 3.03 0.10 gfs2_bitmap_set 2.39 3.12 0.09 gfs2_block_check 2.12 3.20 0.08 bfind 1.99 3.28 0.08 26594 0.00 0.00 build_and_check_metalist 1.86 3.35 0.07 gfs2_bitmap_clear 1.06 3.39 0.04 160 0.25 0.34 check_block_status 0.80 3.42 0.03 4247564 0.00 0.00 warm_fuzzy_stuff 0.80 3.45 0.03 gfs2_block_mark 0.80 3.48 0.03 gfs2_disk_hash 0.80 3.51 0.03 gfs2_find_next_block_type 0.80 3.54 0.03 __divdi3 With the patch: -------------------------------------------------------- % cumulative self self total time seconds seconds calls ms/call ms/call name 30.48 0.57 0.57 gfs2_bitmap_get 11.23 0.78 0.21 bget_generic 7.49 0.92 0.14 26594 0.01 0.01 build_and_check_metalist 5.88 1.03 0.11 139301 0.00 0.00 inode_hash_search 5.35 1.13 0.10 gfs2_block_check 4.81 1.22 0.09 gfs2_disk_hash 4.28 1.30 0.08 gfs2_bitmap_set 3.74 1.37 0.07 gfs2_bitmap_clear 3.21 1.43 0.06 217620 0.00 0.00 hash_more_internal 2.14 1.47 0.04 print_fsck_log 1.60 1.50 0.03 add_buffer 1.60 1.53 0.03 bfind 1.60 1.56 0.03 gfs2_bitfit 1.60 1.59 0.03 gfs2_block_set 1.60 1.62 0.03 gfs2_find_next_block_type 1.34 1.65 0.03 1880692 0.00 0.00 check_data 1.07 1.67 0.02 83750 0.00 0.00 check_dentry 1.07 1.69 0.02 55551 0.00 0.00 handle_di 1.07 1.71 0.02 160 0.13 0.22 check_block_status 1.07 1.73 0.02 gfs2_check_meta Call graph excerpts: granularity: each sample hit covers 2 byte(s) for 0.27% of 3.77 seconds (Note: This is only the top 4 in the list): Without fix: ------------------------------------------------------- index % time self children called name [1] 31.6 1.19 0.00 gfs2_bitmap_get [1] [2] 27.1 1.02 0.00 __udivdi3 [2] [3] 13.5 0.51 0.00 __umoddi3 [3] [4] 9.3 0.00 0.35 main [4] With the fix: ------------------------------------------------------- [1] 30.5 0.57 0.00 gfs2_bitmap_get [1] [2] 25.1 0.00 0.47 main [2] 0.00 0.29 1/1 pass1 [3] 0.01 0.12 1/1 pass2 [10] 0.00 0.03 1/1 pass5 [27] 0.00 0.01 1/1 pass4 [44] 0.00 0.01 1/1 destroy [41] 0.00 0.00 1/1 pass3 [56] 0.00 0.00 1/1 read_cmdline [74] 0.00 0.00 1/1 initialize [71] 0.00 0.00 1/1 pass1b [72] 0.00 0.00 1/1 pass1c [73] 0.00 0.29 1/1 main [2] [3] 15.3 0.00 0.29 1 pass1 [3] 0.01 0.28 55551/55551 scan_meta [4] 0.00 0.00 55551/4247564 warm_fuzzy_stuff [38] 0.01 0.28 55551/55551 pass1 [3] [4] 15.3 0.01 0.28 55551 scan_meta [4] 0.02 0.26 55551/55551 handle_di [5] As you can see from the data, this change is saving lots of cpu time. However, with gfs2_fsck, the bottleneck is probably more in IO rather than cpu time. So I don't know if this will make a different in real run time. Also, the above profile was taken before I fixed some things in the patch that made it run slower.
Okay, here are some real numbers: Without the fix: real 2m20.487s user 0m6.144s sys 0m1.464s With the fix: real 2m18.141s user 0m3.356s sys 0m1.476s So the cpu time was almost cut in half, but it only saved 2 seconds of "real time". This is approximately a one percent increase in speed for 8G. If this was an 8-hour gfs_fsck, this would equate to saving about 5 minutes.
devel & 5.3 ACKs please.
Deferring to 5.4; there are some other things that would make gfs_fsck more efficient. For example, getting rid of the buf.c interface that mimics kernel buffer_heads would likely improve performance by a lot.
There are two major things we can do to improve the speed of gfs2_fsck. (1) Reduce the memory requirements. A huge part of the slowdown on many customer systems is the fact that it uses way too much memory. That pushes the system into using swap space, which drags the system down to its knees. I'm working on the memory issue. (2) Make gfs2_fsck multi-threaded. For a lot of its operation, gfs2_fsck is using 1% of the CPU, and mostly what it's doing is waiting for disk IO. So if I make it create several threads, all working on separate Resource Groups simultaneously, maybe it can use the cpu(s) more effectively.
I've decided to close this bug. The userland bitmap speedup code (attachments 1 and 2) has been incorporated into the fix for bug #404611. I did this to try to get my fsck testing faster for that bug. At one point I had a 2TB file system that had been running benchp14 for days, and I tried to run gfs2_fsck on it. After days of no progress, problems running out of memory (swapping) and CPU gridlocked at 100%, I killed it. So I reduced the memory requirements, found a machine that had more memory, and to reduce the thing from being CPU-bound, I implemented the fix attached below. It did help, but not enough for that particular file system. It managed to make it to 7% through pass 1 after running overnight, but it was now page-bound rather than being CPU bound. I ran the code on a smaller file system and my gfs2_fsck time went from 19 seconds down to 6 seconds with this fix. This fix has never been shipped because there are larger issues with speeding up gfs2_fsck. With bug #404611, I decided it was better to get something out in the field while I worked on the other issues. The other issues will be addressed in bug #455300.