Bug 435456 - GFS2: Optimise loop in gfs2_bitfit
GFS2: Optimise loop in gfs2_bitfit
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
All Linux
high Severity high
: rc
: ---
Assigned To: Don Zickus
GFS Bugs
Depends On:
  Show dependency treegraph
Reported: 2008-02-29 09:26 EST by Steve Whitehouse
Modified: 2008-05-21 11:11 EDT (History)
5 users (show)

See Also:
Fixed In Version: RHBA-2008-0314
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2008-05-21 11:11:07 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
First go at a patch (2.93 KB, patch)
2008-03-07 18:40 EST, Robert Peterson
no flags Details | Diff
Second attempt at a patch (2.92 KB, patch)
2008-03-10 12:44 EDT, Robert Peterson
no flags Details | Diff
Test Program I'm using (10.87 KB, text/plain)
2008-03-10 12:46 EDT, Robert Peterson
no flags Details
Third attempt at a patch (3.56 KB, patch)
2008-03-10 19:05 EDT, Robert Peterson
no flags Details | Diff
RHEL52 patch (3.59 KB, patch)
2008-03-11 14:20 EDT, Robert Peterson
no flags Details | Diff
The numbers and chart (183.31 KB, application/octet-stream)
2008-03-12 12:28 EDT, Robert Peterson
no flags Details

  None (edit)
Description Steve Whitehouse 2008-02-29 09:26:21 EST
Optimise the loop in gfs2_bitfit by removing tests which are not required from
the fast path and applying the applicable techniques from the following tutorial:

Comment 2 Rob Kenna 2008-02-29 09:42:27 EST
A bit more on this.  The optimization is a bit more than a "nice to have" as the
performance issues is related to a key target area.
Comment 3 Robert Peterson 2008-03-04 17:28:41 EST
I wrote a tiny little C program that allows me to experiment with this.
I can basically plug in gfs2_bitfit implementations and test them in
userspace.  My first finding is that the current method from my 
25 Jan 2008 optimization, which can be seen here:
is approximately six times faster than it was before 25 Jan, at least
on my x86_64 platform.  It would be interesting to try this program on
other architectures like ppc.  From here it's easy to experiment to see
what works more efficiently.
Comment 4 Steve Whitehouse 2008-03-05 09:50:58 EST
Sounds good. I know this might be a silly question, but did you use the same -O
level as the kernel when doing your user mode tests?
Comment 5 Robert Peterson 2008-03-07 18:40:37 EST
Created attachment 297261 [details]
First go at a patch

This patch is untested in a real gfs2 file system, but I have tested
it in userspace using a test program.  I've studied this problem for
some time now and here's what I've found:

I ported the code to user-space so I could debug it better and
do timing tests easier using gdb and such.  So I have one test
program that tests all of the algorithms at once.

The performance of gfs2_bitfit of course varies greatly depending on
how fragmented with metadata vs. data the gfs2 file system is.
It's not right to optimize for either nearly-full or nearly-empty
bitmaps, so I took a good look at Barry's specsfs test system and saw 
just how fragmented those bitmaps typically are at any given moment.
I'm taking that as a good indication that customers' file systems
will be similarly fragmented as well.  Before looking at this
fragmentation, I made bad assumptions and got bad numbers for doing it.

I tested six algorithms under conditions similar to Barry's.  They are:

1. The algorithm we're using today.
2. The algorithm we were using prior to my 25 Jan 2008 optimization.
3. Optimized version of #1
4. More optimized variation of #3
5. New word alignment algorithm
6. Method that tries to use memcmp of large chunks for shortcuts

Algorithm 6 performed so horribly that I threw it out altogether
and never looked back.	I was just curious.
Algorithm 3 outperformed all the others on a bitmap that contains
most used blocks and rare metadata/data fragmentation.
Algorithm 4 performed better than 3 with metadata fragmentation.
Algorithm 5 outperformed all the others when there was periodic
metadata allocations typical of the specsfs file systems.

My test program searches a 50MB bitmap for a "free" block that 
doesn't appear until 100 bytes (400 blocks) from the end, and
starting with a "goal" block that's 25 percent into the bitmap.

This is a typical run on my x86_64 box.  The numbers are in

1. 679633
2. 972714
3. 685488
4. 242310
5. 169487

So for this performance run, algorithm 5 performed 4 times faster
than the one we're using today.

On my 32-bit test box I tested 20MB bitmaps and got similar ratios:

1. 1997501
2. 1351896
3. 1603124
4. 773022
5. 477915

Again, algorithm 5 performed almost 4 times faster than 1.  It's
interesting to note that the algorithm we're using today actually
performed worse than before the rewrite, on my 32-bit system under
some types of metadata/data fragmentation.
Comment 6 Robert Peterson 2008-03-07 23:26:52 EST
I did a few quick tests on my roth cluster using the upstream kernel
and this faster bitfit.  It seems to work okay.  I just did some basic
copies, looked at the bitmaps, then did gfs2_fsck, which saw no problems.
I also ran one of my "hell" tests on it and it passed.
Comment 7 Robert Peterson 2008-03-08 00:21:00 EST
This version passed all my "hell" tests with flying colors so I posted
it to cluster-devel for inclusion in the nmw git tree kernel.
Comment 8 Robert Peterson 2008-03-10 12:44:51 EDT
Created attachment 297459 [details]
Second attempt at a patch

This version is much faster thanks to Steve Whitehouse's suggestions.
This version is typically 6 times faster than today's algorithm on my
x86_64 test box, depending on the metadata / data fragmentation.
In cases of extreme fragmentation, this algorithm is up to 22 times faster.
Here are performance numbers from a few consecutive runs of my test

[root@roth-01 /home/bob]# ./bmapspeed 
trandfactor = 91
tdiff[0] =  972208   - 1.000000 times today's algorithm
tdiff[1] = 1133521   - 0.857689 times today's algorithm
tdiff[2] =  854765   - 1.137398 times today's algorithm
tdiff[3] =  354196   - 2.744831 times today's algorithm
tdiff[4] =  194574   - 4.996598 times today's algorithm
tdiff[5] =  139421   - 6.973182 times today's algorithm
[root@roth-01 /home/bob]# ./bmapspeed 
trandfactor = 116
tdiff[0] =  823955   - 1.000000 times today's algorithm
tdiff[1] = 1133362   - 0.727001 times today's algorithm
tdiff[2] =  709204   - 1.161803 times today's algorithm
tdiff[3] =  324066   - 2.542553 times today's algorithm
tdiff[4] =  199325   - 4.133726 times today's algorithm
tdiff[5] =  136314   - 6.044537 times today's algorithm
[root@roth-01 /home/bob]# ./bmapspeed 
trandfactor = 85
tdiff[0] = 1002704   - 1.000000 times today's algorithm
tdiff[1] = 1130705   - 0.886795 times today's algorithm
tdiff[2] =  884881   - 1.133151 times today's algorithm
tdiff[3] =  366433   - 2.736391 times today's algorithm
tdiff[4] =  199193   - 5.033832 times today's algorithm
tdiff[5] =  138418   - 7.244029 times today's algorithm
[root@roth-01 /home/bob]# ./bmapspeed 
trandfactor = 28
tdiff[0] = 2295333   - 1.000000 times today's algorithm
tdiff[1] = 1129925   - 2.031403 times today's algorithm
tdiff[2] = 2203636   - 1.041612 times today's algorithm
tdiff[3] =  441685   - 5.196764 times today's algorithm
tdiff[4] =  198743   - 11.549253 times today's algorithm
tdiff[5] =  138429   - 16.581301 times today's algorithm
[root@roth-01 /home/bob]# ./bmapspeed 
trandfactor = 49
tdiff[0] = 1429324   - 1.000000 times today's algorithm
tdiff[1] = 1131068   - 1.263694 times today's algorithm
tdiff[2] = 1309161   - 1.091786 times today's algorithm
tdiff[3] =  287547   - 4.970749 times today's algorithm
tdiff[4] =  191206   - 7.475309 times today's algorithm
tdiff[5] =  138709   - 10.304480 times today's algorithm
Comment 9 Robert Peterson 2008-03-10 12:46:36 EDT
Created attachment 297460 [details]
Test Program I'm using

This is the test program I'm using to gather performance numbers and
test algorithms.
Comment 10 Robert Peterson 2008-03-10 19:05:47 EDT
Created attachment 297533 [details]
Third attempt at a patch

This version has a few more of Steve's suggestions.  They turned out
to not make a whole lot of performance improvement, except for one:
I added a prefetch to the section that parses an unsigned long at a time,
and the performance increased about 24% faster than without prefetch.
The prefetch statement isn't in userspace, so that made the timing a bit
more tricky to test.  What I did is use a toggle to throw half of the
bitfits through the algorithm with the prefetch and half to the one
without prefetch, then I did some big copy operations to see how it

This version is averaging about 8 times faster than the one we're
using today, but it's quite often 10 - 15 times faster.
The algorithm might still have room for improvement, especially by
experimenting more with prefetch, but I need to draw the line somewhere
or we'll never ship it.
Comment 12 Robert Peterson 2008-03-11 14:20:26 EDT
Created attachment 297654 [details]
RHEL52 patch

Here is the RHEL5 patch I posted to rhkernel-list.
Comment 13 Robert Peterson 2008-03-11 14:23:40 EDT
Patch now appears in the upstream git kernel for gfs2.
The RHEL52 patch was tested on my roth cluster and posted to rhkernel-list.
Reassigning to Don Zickus for inclusion in the kernel.
Comment 17 Don Zickus 2008-03-19 12:24:37 EDT
in kernel-2.6.18-86.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 20 errata-xmlrpc 2008-05-21 11:11:07 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.