Optimise the loop in gfs2_bitfit by removing tests which are not required from the fast path and applying the applicable techniques from the following tutorial: http://lwn.net/Articles/250967/
A bit more on this. The optimization is a bit more than a "nice to have" as the performance issues is related to a key target area.
I wrote a tiny little C program that allows me to experiment with this. I can basically plug in gfs2_bitfit implementations and test them in userspace. My first finding is that the current method from my 25 Jan 2008 optimization, which can be seen here: http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=5fdc2eeb5d1d3800367f471690b01fcd1fd5b963 is approximately six times faster than it was before 25 Jan, at least on my x86_64 platform. It would be interesting to try this program on other architectures like ppc. From here it's easy to experiment to see what works more efficiently.
Sounds good. I know this might be a silly question, but did you use the same -O level as the kernel when doing your user mode tests?
Created attachment 297261 [details] First go at a patch This patch is untested in a real gfs2 file system, but I have tested it in userspace using a test program. I've studied this problem for some time now and here's what I've found: I ported the code to user-space so I could debug it better and do timing tests easier using gdb and such. So I have one test program that tests all of the algorithms at once. The performance of gfs2_bitfit of course varies greatly depending on how fragmented with metadata vs. data the gfs2 file system is. It's not right to optimize for either nearly-full or nearly-empty bitmaps, so I took a good look at Barry's specsfs test system and saw just how fragmented those bitmaps typically are at any given moment. I'm taking that as a good indication that customers' file systems will be similarly fragmented as well. Before looking at this fragmentation, I made bad assumptions and got bad numbers for doing it. I tested six algorithms under conditions similar to Barry's. They are: 1. The algorithm we're using today. 2. The algorithm we were using prior to my 25 Jan 2008 optimization. 3. Optimized version of #1 4. More optimized variation of #3 5. New word alignment algorithm 6. Method that tries to use memcmp of large chunks for shortcuts Algorithm 6 performed so horribly that I threw it out altogether and never looked back. I was just curious. Algorithm 3 outperformed all the others on a bitmap that contains most used blocks and rare metadata/data fragmentation. Algorithm 4 performed better than 3 with metadata fragmentation. Algorithm 5 outperformed all the others when there was periodic metadata allocations typical of the specsfs file systems. My test program searches a 50MB bitmap for a "free" block that doesn't appear until 100 bytes (400 blocks) from the end, and starting with a "goal" block that's 25 percent into the bitmap. This is a typical run on my x86_64 box. The numbers are in microseconds: 1. 679633 2. 972714 3. 685488 4. 242310 5. 169487 So for this performance run, algorithm 5 performed 4 times faster than the one we're using today. On my 32-bit test box I tested 20MB bitmaps and got similar ratios: 1. 1997501 2. 1351896 3. 1603124 4. 773022 5. 477915 Again, algorithm 5 performed almost 4 times faster than 1. It's interesting to note that the algorithm we're using today actually performed worse than before the rewrite, on my 32-bit system under some types of metadata/data fragmentation.
I did a few quick tests on my roth cluster using the upstream kernel and this faster bitfit. It seems to work okay. I just did some basic copies, looked at the bitmaps, then did gfs2_fsck, which saw no problems. I also ran one of my "hell" tests on it and it passed.
This version passed all my "hell" tests with flying colors so I posted it to cluster-devel for inclusion in the nmw git tree kernel.
Created attachment 297459 [details] Second attempt at a patch This version is much faster thanks to Steve Whitehouse's suggestions. This version is typically 6 times faster than today's algorithm on my x86_64 test box, depending on the metadata / data fragmentation. In cases of extreme fragmentation, this algorithm is up to 22 times faster. Here are performance numbers from a few consecutive runs of my test program: [root@roth-01 /home/bob]# ./bmapspeed trandfactor = 91 tdiff[0] = 972208 - 1.000000 times today's algorithm tdiff[1] = 1133521 - 0.857689 times today's algorithm tdiff[2] = 854765 - 1.137398 times today's algorithm tdiff[3] = 354196 - 2.744831 times today's algorithm tdiff[4] = 194574 - 4.996598 times today's algorithm tdiff[5] = 139421 - 6.973182 times today's algorithm [root@roth-01 /home/bob]# ./bmapspeed trandfactor = 116 tdiff[0] = 823955 - 1.000000 times today's algorithm tdiff[1] = 1133362 - 0.727001 times today's algorithm tdiff[2] = 709204 - 1.161803 times today's algorithm tdiff[3] = 324066 - 2.542553 times today's algorithm tdiff[4] = 199325 - 4.133726 times today's algorithm tdiff[5] = 136314 - 6.044537 times today's algorithm [root@roth-01 /home/bob]# ./bmapspeed trandfactor = 85 tdiff[0] = 1002704 - 1.000000 times today's algorithm tdiff[1] = 1130705 - 0.886795 times today's algorithm tdiff[2] = 884881 - 1.133151 times today's algorithm tdiff[3] = 366433 - 2.736391 times today's algorithm tdiff[4] = 199193 - 5.033832 times today's algorithm tdiff[5] = 138418 - 7.244029 times today's algorithm [root@roth-01 /home/bob]# ./bmapspeed trandfactor = 28 tdiff[0] = 2295333 - 1.000000 times today's algorithm tdiff[1] = 1129925 - 2.031403 times today's algorithm tdiff[2] = 2203636 - 1.041612 times today's algorithm tdiff[3] = 441685 - 5.196764 times today's algorithm tdiff[4] = 198743 - 11.549253 times today's algorithm tdiff[5] = 138429 - 16.581301 times today's algorithm [root@roth-01 /home/bob]# ./bmapspeed trandfactor = 49 tdiff[0] = 1429324 - 1.000000 times today's algorithm tdiff[1] = 1131068 - 1.263694 times today's algorithm tdiff[2] = 1309161 - 1.091786 times today's algorithm tdiff[3] = 287547 - 4.970749 times today's algorithm tdiff[4] = 191206 - 7.475309 times today's algorithm tdiff[5] = 138709 - 10.304480 times today's algorithm
Created attachment 297460 [details] Test Program I'm using This is the test program I'm using to gather performance numbers and test algorithms.
Created attachment 297533 [details] Third attempt at a patch This version has a few more of Steve's suggestions. They turned out to not make a whole lot of performance improvement, except for one: I added a prefetch to the section that parses an unsigned long at a time, and the performance increased about 24% faster than without prefetch. The prefetch statement isn't in userspace, so that made the timing a bit more tricky to test. What I did is use a toggle to throw half of the bitfits through the algorithm with the prefetch and half to the one without prefetch, then I did some big copy operations to see how it performed. This version is averaging about 8 times faster than the one we're using today, but it's quite often 10 - 15 times faster. The algorithm might still have room for improvement, especially by experimenting more with prefetch, but I need to draw the line somewhere or we'll never ship it.
Created attachment 297654 [details] RHEL52 patch Here is the RHEL5 patch I posted to rhkernel-list.
Patch now appears in the upstream git kernel for gfs2. The RHEL52 patch was tested on my roth cluster and posted to rhkernel-list. Reassigning to Don Zickus for inclusion in the kernel.
in kernel-2.6.18-86.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html