+++ This bug was initially created as a clone of Bug #435456 +++ I think we should port the improved GFS2 bitfit algorithm back to GFS, so I cloned the GFS2 bug. Should be an easy port. Optimise the loop in gfs2_bitfit by removing tests which are not required from the fast path and applying the applicable techniques from the following tutorial: http://lwn.net/Articles/250967/ -- Additional comment from rkenna on 2008-02-29 09:41 EST -- Needed post beta. -- Additional comment from rkenna on 2008-02-29 09:42 EST -- A bit more on this. The optimization is a bit more than a "nice to have" as the performance issues is related to a key target area. -- Additional comment from rpeterso on 2008-03-04 17:28 EST -- I wrote a tiny little C program that allows me to experiment with this. I can basically plug in gfs2_bitfit implementations and test them in userspace. My first finding is that the current method from my 25 Jan 2008 optimization, which can be seen here: http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=5fdc2eeb5d1d3800367f471690b01fcd1fd5b963 is approximately six times faster than it was before 25 Jan, at least on my x86_64 platform. It would be interesting to try this program on other architectures like ppc. From here it's easy to experiment to see what works more efficiently. -- Additional comment from swhiteho on 2008-03-05 09:50 EST -- Sounds good. I know this might be a silly question, but did you use the same -O level as the kernel when doing your user mode tests? -- Additional comment from rpeterso on 2008-03-07 18:40 EST -- Created an attachment (id=297261) First go at a patch This patch is untested in a real gfs2 file system, but I have tested it in userspace using a test program. I've studied this problem for some time now and here's what I've found: I ported the code to user-space so I could debug it better and do timing tests easier using gdb and such. So I have one test program that tests all of the algorithms at once. The performance of gfs2_bitfit of course varies greatly depending on how fragmented with metadata vs. data the gfs2 file system is. It's not right to optimize for either nearly-full or nearly-empty bitmaps, so I took a good look at Barry's specsfs test system and saw just how fragmented those bitmaps typically are at any given moment. I'm taking that as a good indication that customers' file systems will be similarly fragmented as well. Before looking at this fragmentation, I made bad assumptions and got bad numbers for doing it. I tested six algorithms under conditions similar to Barry's. They are: 1. The algorithm we're using today. 2. The algorithm we were using prior to my 25 Jan 2008 optimization. 3. Optimized version of #1 4. More optimized variation of #3 5. New word alignment algorithm 6. Method that tries to use memcmp of large chunks for shortcuts Algorithm 6 performed so horribly that I threw it out altogether and never looked back. I was just curious. Algorithm 3 outperformed all the others on a bitmap that contains most used blocks and rare metadata/data fragmentation. Algorithm 4 performed better than 3 with metadata fragmentation. Algorithm 5 outperformed all the others when there was periodic metadata allocations typical of the specsfs file systems. My test program searches a 50MB bitmap for a "free" block that doesn't appear until 100 bytes (400 blocks) from the end, and starting with a "goal" block that's 25 percent into the bitmap. This is a typical run on my x86_64 box. The numbers are in microseconds: 1. 679633 2. 972714 3. 685488 4. 242310 5. 169487 So for this performance run, algorithm 5 performed 4 times faster than the one we're using today. On my 32-bit test box I tested 20MB bitmaps and got similar ratios: 1. 1997501 2. 1351896 3. 1603124 4. 773022 5. 477915 Again, algorithm 5 performed almost 4 times faster than 1. It's interesting to note that the algorithm we're using today actually performed worse than before the rewrite, on my 32-bit system under some types of metadata/data fragmentation. -- Additional comment from rpeterso on 2008-03-07 23:26 EST -- I did a few quick tests on my roth cluster using the upstream kernel and this faster bitfit. It seems to work okay. I just did some basic copies, looked at the bitmaps, then did gfs2_fsck, which saw no problems. I also ran one of my "hell" tests on it and it passed. -- Additional comment from rpeterso on 2008-03-08 00:21 EST -- This version passed all my "hell" tests with flying colors so I posted it to cluster-devel for inclusion in the nmw git tree kernel. -- Additional comment from rpeterso on 2008-03-10 12:44 EST -- Created an attachment (id=297459) Second attempt at a patch This version is much faster thanks to Steve Whitehouse's suggestions. This version is typically 6 times faster than today's algorithm on my x86_64 test box, depending on the metadata / data fragmentation. In cases of extreme fragmentation, this algorithm is up to 22 times faster. Here are performance numbers from a few consecutive runs of my test program: [root@roth-01 /home/bob]# ./bmapspeed trandfactor = 91 tdiff[0] = 972208 - 1.000000 times today's algorithm tdiff[1] = 1133521 - 0.857689 times today's algorithm tdiff[2] = 854765 - 1.137398 times today's algorithm tdiff[3] = 354196 - 2.744831 times today's algorithm tdiff[4] = 194574 - 4.996598 times today's algorithm tdiff[5] = 139421 - 6.973182 times today's algorithm [root@roth-01 /home/bob]# ./bmapspeed trandfactor = 116 tdiff[0] = 823955 - 1.000000 times today's algorithm tdiff[1] = 1133362 - 0.727001 times today's algorithm tdiff[2] = 709204 - 1.161803 times today's algorithm tdiff[3] = 324066 - 2.542553 times today's algorithm tdiff[4] = 199325 - 4.133726 times today's algorithm tdiff[5] = 136314 - 6.044537 times today's algorithm [root@roth-01 /home/bob]# ./bmapspeed trandfactor = 85 tdiff[0] = 1002704 - 1.000000 times today's algorithm tdiff[1] = 1130705 - 0.886795 times today's algorithm tdiff[2] = 884881 - 1.133151 times today's algorithm tdiff[3] = 366433 - 2.736391 times today's algorithm tdiff[4] = 199193 - 5.033832 times today's algorithm tdiff[5] = 138418 - 7.244029 times today's algorithm [root@roth-01 /home/bob]# ./bmapspeed trandfactor = 28 tdiff[0] = 2295333 - 1.000000 times today's algorithm tdiff[1] = 1129925 - 2.031403 times today's algorithm tdiff[2] = 2203636 - 1.041612 times today's algorithm tdiff[3] = 441685 - 5.196764 times today's algorithm tdiff[4] = 198743 - 11.549253 times today's algorithm tdiff[5] = 138429 - 16.581301 times today's algorithm [root@roth-01 /home/bob]# ./bmapspeed trandfactor = 49 tdiff[0] = 1429324 - 1.000000 times today's algorithm tdiff[1] = 1131068 - 1.263694 times today's algorithm tdiff[2] = 1309161 - 1.091786 times today's algorithm tdiff[3] = 287547 - 4.970749 times today's algorithm tdiff[4] = 191206 - 7.475309 times today's algorithm tdiff[5] = 138709 - 10.304480 times today's algorithm -- Additional comment from rpeterso on 2008-03-10 12:46 EST -- Created an attachment (id=297460) Test Program I'm using This is the test program I'm using to gather performance numbers and test algorithms. -- Additional comment from rpeterso on 2008-03-10 19:05 EST -- Created an attachment (id=297533) Third attempt at a patch This version has a few more of Steve's suggestions. They turned out to not make a whole lot of performance improvement, except for one: I added a prefetch to the section that parses an unsigned long at a time, and the performance increased about 24% faster than without prefetch. The prefetch statement isn't in userspace, so that made the timing a bit more tricky to test. What I did is use a toggle to throw half of the bitfits through the algorithm with the prefetch and half to the one without prefetch, then I did some big copy operations to see how it performed. This version is averaging about 8 times faster than the one we're using today, but it's quite often 10 - 15 times faster. The algorithm might still have room for improvement, especially by experimenting more with prefetch, but I need to draw the line somewhere or we'll never ship it. -- Additional comment from rpeterso on 2008-03-11 12:48 EST -- I did a crude, unscientific comparison test of before and after with the RHEL5.2 code. Using my roth cluster, two nearly identical boxes x86_64, one using the old gfs2_bitfit code and one using the revised, both using clvmd and dlm protocol, to the same SAN storage device. I copied 16GB of mp3 files of varying sizes. old gfs2_bitfit: 8m18s new gfs2_bitfit: 6m45s So that shows a 19 percent clock-time performance improvement. Next I'll try the single-node AOL test before and after. -- Additional comment from rpeterso on 2008-03-11 14:20 EST -- Created an attachment (id=297654) RHEL52 patch Here is the RHEL5 patch I posted to rhkernel-list. -- Additional comment from rpeterso on 2008-03-11 14:23 EST -- Patch now appears in the upstream git kernel for gfs2. The RHEL52 patch was tested on my roth cluster and posted to rhkernel-list. Reassigning to Don Zickus for inclusion in the kernel. -- Additional comment from rpeterso on 2008-03-12 10:17 EST -- FYI--AOl tests show a 10 percent increase in performance when maximum RGs (2GB) are used and 1K block size. The average time to do one iteration went from 65433586 to 58877961 for 272 iterations. I'm doing default RG size now with 1K to see how it looks. I've already done the old bitfit, and the new algorithm is running now. I have a spreadsheet with a chart, but I might as well combine them. -- Additional comment from rpeterso on 2008-03-12 12:28 EST -- Created an attachment (id=297802) The numbers and chart This file is an OpenOffice spreadsheet showing and charting the numbers. With 2GB RG size, we're showing a 10% increase in performance. With the default RG size, we're showing a 6% increase in performance. Still, something is better than nothing. It goes to show that the benchp13 test is giving a workout to a lot more code than the bitmap code. -- Additional comment from swhiteho on 2008-03-12 12:38 EST -- Marking comments which mention the customer private. The results in comment #15 do bear out what we have suspected though, which is that with larger RGs, the search of the individual RG is more important, and with larger numbers of RGs, its probably the lookup of the RG itself thats taking up the time. I'm going to open a RFE bug relating to speeding up the RG lookup during block allocation so that we will not forget to do that at some stage. -- Additional comment from dzickus on 2008-03-19 12:24 EST -- in kernel-2.6.18-86.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 -- Additional comment from errata-xmlrpc on 2008-03-26 16:38 EST -- Bug report changed to ON_QA status by Errata System. A QE request has been submitted for advisory RHBA-2008:8013-04 http://errata.devel.redhat.com/errata/show/6625
Created attachment 306122 [details] simzone output before my changes This is from a run of "simzone" without my bitfit changes. This is not a very good test because it was run on a slow machine, on a hard drive partition that is on the same disk as the root partition. Still, it gives us an idea of where we stand before the changes. The test was run on an upstream kernel from the gfs2 nmw git tree, in lock_nolock mode.
Created attachment 306123 [details] simzone output after my changes This is from a simzone run with the same physical conditions, but with my new bitfit algorithm in place.
Comparing some of the numbers, you can see most of the IO rate go up: [bob@ganesha ~/bugzilla]$ grep "recsize 128" simzone.without.bitfit.txt filesize 128 KB recsize 128 KB IOs2do 1 iwrite 250980 rewrite 544680 read 14222222 reread 21333333 filesize 256 KB recsize 128 KB IOs2do 2 iwrite 250244 rewrite 624390 read 18285714 reread 23272727 filesize 512 KB recsize 128 KB IOs2do 4 iwrite 257286 rewrite 707182 read 21333333 reread 24380952 filesize 1024 KB recsize 128 KB IOs2do 8 iwrite 253590 rewrite 737221 read 23272727 reread 24380952 filesize 2048 KB recsize 128 KB IOs2do 16 iwrite 267572 rewrite 724442 read 24094117 reread 24975609 filesize 4096 KB recsize 128 KB IOs2do 32 iwrite 216845 rewrite 739884 read 24380952 reread 24824242 filesize 8192 KB recsize 128 KB IOs2do 64 iwrite 270363 rewrite 748127 read 24526946 reread 24824242 filesize 16384 KB recsize 128 KB IOs2do 128 iwrite 272286 rewrite 755475 read 24786686 reread 24600600 filesize 32768 KB recsize 128 KB IOs2do 256 iwrite 250966 rewrite 680229 read 24749244 reread 25148119 filesize 65536 KB recsize 128 KB IOs2do 512 iwrite 252593 rewrite 633051 read 24591369 reread 24956587 filesize 131072 KB recsize 128 KB IOs2do 1024 iwrite 218476 rewrite 601708 read 24966095 reread 25080750 filesize 262144 KB recsize 128 KB IOs2do 2048 iwrite 183480 rewrite 582399 read 24728233 reread 24918631 filesize 524288 KB recsize 128 KB IOs2do 4096 iwrite 204610 rewrite 105852 read 25314470 reread 25285170 filesize 1048576 KB recsize 128 KB IOs2do 8192 iwrite 90809 rewrite 141231 read 25056776 reread 25139076 [bob@ganesha ~/bugzilla]$ grep "recsize 128" simzone.with.bitfit.txt filesize 128 KB recsize 128 KB IOs2do 1 iwrite 288939 rewrite 551724 read 15999999 reread 21333333 filesize 256 KB recsize 128 KB IOs2do 2 iwrite 276457 rewrite 635235 read 19692307 reread 23272727 filesize 512 KB recsize 128 KB IOs2do 4 iwrite 265147 rewrite 693766 read 21333333 reread 24380952 filesize 1024 KB recsize 128 KB IOs2do 8 iwrite 260096 rewrite 740419 read 23272727 reread 24380952 filesize 2048 KB recsize 128 KB IOs2do 16 iwrite 259503 rewrite 719353 read 24094117 reread 24975609 filesize 4096 KB recsize 128 KB IOs2do 32 iwrite 209257 rewrite 741894 read 24674698 reread 24975609 filesize 8192 KB recsize 128 KB IOs2do 64 iwrite 248347 rewrite 761267 read 24749244 reread 25051987 filesize 16384 KB recsize 128 KB IOs2do 128 iwrite 252061 rewrite 752042 read 24824242 reread 24899696 filesize 32768 KB recsize 128 KB IOs2do 256 iwrite 241896 rewrite 699140 read 24767951 reread 24994660 filesize 65536 KB recsize 128 KB IOs2do 512 iwrite 224253 rewrite 632843 read 24693293 reread 25128834 filesize 131072 KB recsize 128 KB IOs2do 1024 iwrite 216140 rewrite 503221 read 25119202 reread 24899696 filesize 262144 KB recsize 128 KB IOs2do 2048 iwrite 217153 rewrite 588882 read 24982750 reread 24800756 filesize 524288 KB recsize 128 KB IOs2do 4096 iwrite 195343 rewrite 224257 read 25067559 reread 24966095 filesize 1048576 KB recsize 128 KB IOs2do 8192 iwrite 153666 rewrite 206219 read 25037033 reread 25125221 So initial write performance is 15% to 69% faster in some cases. Other cases seem to show an odd performance loss, but I attribute that to the hardware and conditions. I should re-do this test using a separate hard drive and better hardware. Perhaps when I port it to RHEL I can do that. I used an old trin node because I wanted to test it with upstream code first, and I didn't want to bring any of my good rhel testing machines to the upstream kernel, openais, cluster code, etc.
Created attachment 306127 [details] Upstream version of the proposed patch This is the patch to master that I tested.
This fix was delayed a bit due to bug #448866. That issue is now known and the fix incorporated into a newer patch, which I pushed to the master, STABLE2 and RHEL5 git repositories. Changing status to MODIFIED.
gfs2-utils-2.03.05-1.fc9, rgmanager-2.03.05-1.fc9, cman-2.03.05-1.fc9 has been pushed to the Fedora 9 stable repository. If problems still persist, please make note of it in this bug report.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-0132.html