Bug 435456

Summary: GFS2: Optimise loop in gfs2_bitfit
Product: Red Hat Enterprise Linux 5 Reporter: Steve Whitehouse <swhiteho>
Component: kernelAssignee: Don Zickus <dzickus>
Status: CLOSED ERRATA QA Contact: GFS Bugs <gfs-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 5.2CC: adas, edamato, lwang, rpeterso, swhiteho
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
URL: http://lwn.net/Articles/250967/
Whiteboard:
Fixed In Version: RHBA-2008-0314 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-05-21 15:11:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
First go at a patch
none
Second attempt at a patch
none
Test Program I'm using
none
Third attempt at a patch
none
RHEL52 patch
none
The numbers and chart none

Description Steve Whitehouse 2008-02-29 14:26:21 UTC
Optimise the loop in gfs2_bitfit by removing tests which are not required from
the fast path and applying the applicable techniques from the following tutorial:

http://lwn.net/Articles/250967/

Comment 2 Rob Kenna 2008-02-29 14:42:27 UTC
A bit more on this.  The optimization is a bit more than a "nice to have" as the
performance issues is related to a key target area.

Comment 3 Robert Peterson 2008-03-04 22:28:41 UTC
I wrote a tiny little C program that allows me to experiment with this.
I can basically plug in gfs2_bitfit implementations and test them in
userspace.  My first finding is that the current method from my 
25 Jan 2008 optimization, which can be seen here:
http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=5fdc2eeb5d1d3800367f471690b01fcd1fd5b963
is approximately six times faster than it was before 25 Jan, at least
on my x86_64 platform.  It would be interesting to try this program on
other architectures like ppc.  From here it's easy to experiment to see
what works more efficiently.


Comment 4 Steve Whitehouse 2008-03-05 14:50:58 UTC
Sounds good. I know this might be a silly question, but did you use the same -O
level as the kernel when doing your user mode tests?


Comment 5 Robert Peterson 2008-03-07 23:40:37 UTC
Created attachment 297261 [details]
First go at a patch

This patch is untested in a real gfs2 file system, but I have tested
it in userspace using a test program.  I've studied this problem for
some time now and here's what I've found:

I ported the code to user-space so I could debug it better and
do timing tests easier using gdb and such.  So I have one test
program that tests all of the algorithms at once.

The performance of gfs2_bitfit of course varies greatly depending on
how fragmented with metadata vs. data the gfs2 file system is.
It's not right to optimize for either nearly-full or nearly-empty
bitmaps, so I took a good look at Barry's specsfs test system and saw 
just how fragmented those bitmaps typically are at any given moment.
I'm taking that as a good indication that customers' file systems
will be similarly fragmented as well.  Before looking at this
fragmentation, I made bad assumptions and got bad numbers for doing it.

I tested six algorithms under conditions similar to Barry's.  They are:

1. The algorithm we're using today.
2. The algorithm we were using prior to my 25 Jan 2008 optimization.
3. Optimized version of #1
4. More optimized variation of #3
5. New word alignment algorithm
6. Method that tries to use memcmp of large chunks for shortcuts

Algorithm 6 performed so horribly that I threw it out altogether
and never looked back.	I was just curious.
Algorithm 3 outperformed all the others on a bitmap that contains
most used blocks and rare metadata/data fragmentation.
Algorithm 4 performed better than 3 with metadata fragmentation.
Algorithm 5 outperformed all the others when there was periodic
metadata allocations typical of the specsfs file systems.

My test program searches a 50MB bitmap for a "free" block that 
doesn't appear until 100 bytes (400 blocks) from the end, and
starting with a "goal" block that's 25 percent into the bitmap.

This is a typical run on my x86_64 box.  The numbers are in
microseconds:

1. 679633
2. 972714
3. 685488
4. 242310
5. 169487

So for this performance run, algorithm 5 performed 4 times faster
than the one we're using today.

On my 32-bit test box I tested 20MB bitmaps and got similar ratios:

1. 1997501
2. 1351896
3. 1603124
4. 773022
5. 477915

Again, algorithm 5 performed almost 4 times faster than 1.  It's
interesting to note that the algorithm we're using today actually
performed worse than before the rewrite, on my 32-bit system under
some types of metadata/data fragmentation.

Comment 6 Robert Peterson 2008-03-08 04:26:52 UTC
I did a few quick tests on my roth cluster using the upstream kernel
and this faster bitfit.  It seems to work okay.  I just did some basic
copies, looked at the bitmaps, then did gfs2_fsck, which saw no problems.
I also ran one of my "hell" tests on it and it passed.


Comment 7 Robert Peterson 2008-03-08 05:21:00 UTC
This version passed all my "hell" tests with flying colors so I posted
it to cluster-devel for inclusion in the nmw git tree kernel.


Comment 8 Robert Peterson 2008-03-10 16:44:51 UTC
Created attachment 297459 [details]
Second attempt at a patch

This version is much faster thanks to Steve Whitehouse's suggestions.
This version is typically 6 times faster than today's algorithm on my
x86_64 test box, depending on the metadata / data fragmentation.
In cases of extreme fragmentation, this algorithm is up to 22 times faster.
Here are performance numbers from a few consecutive runs of my test
program:

[root@roth-01 /home/bob]# ./bmapspeed 
trandfactor = 91
tdiff[0] =  972208   - 1.000000 times today's algorithm
tdiff[1] = 1133521   - 0.857689 times today's algorithm
tdiff[2] =  854765   - 1.137398 times today's algorithm
tdiff[3] =  354196   - 2.744831 times today's algorithm
tdiff[4] =  194574   - 4.996598 times today's algorithm
tdiff[5] =  139421   - 6.973182 times today's algorithm
[root@roth-01 /home/bob]# ./bmapspeed 
trandfactor = 116
tdiff[0] =  823955   - 1.000000 times today's algorithm
tdiff[1] = 1133362   - 0.727001 times today's algorithm
tdiff[2] =  709204   - 1.161803 times today's algorithm
tdiff[3] =  324066   - 2.542553 times today's algorithm
tdiff[4] =  199325   - 4.133726 times today's algorithm
tdiff[5] =  136314   - 6.044537 times today's algorithm
[root@roth-01 /home/bob]# ./bmapspeed 
trandfactor = 85
tdiff[0] = 1002704   - 1.000000 times today's algorithm
tdiff[1] = 1130705   - 0.886795 times today's algorithm
tdiff[2] =  884881   - 1.133151 times today's algorithm
tdiff[3] =  366433   - 2.736391 times today's algorithm
tdiff[4] =  199193   - 5.033832 times today's algorithm
tdiff[5] =  138418   - 7.244029 times today's algorithm
[root@roth-01 /home/bob]# ./bmapspeed 
trandfactor = 28
tdiff[0] = 2295333   - 1.000000 times today's algorithm
tdiff[1] = 1129925   - 2.031403 times today's algorithm
tdiff[2] = 2203636   - 1.041612 times today's algorithm
tdiff[3] =  441685   - 5.196764 times today's algorithm
tdiff[4] =  198743   - 11.549253 times today's algorithm
tdiff[5] =  138429   - 16.581301 times today's algorithm
[root@roth-01 /home/bob]# ./bmapspeed 
trandfactor = 49
tdiff[0] = 1429324   - 1.000000 times today's algorithm
tdiff[1] = 1131068   - 1.263694 times today's algorithm
tdiff[2] = 1309161   - 1.091786 times today's algorithm
tdiff[3] =  287547   - 4.970749 times today's algorithm
tdiff[4] =  191206   - 7.475309 times today's algorithm
tdiff[5] =  138709   - 10.304480 times today's algorithm

Comment 9 Robert Peterson 2008-03-10 16:46:36 UTC
Created attachment 297460 [details]
Test Program I'm using

This is the test program I'm using to gather performance numbers and
test algorithms.

Comment 10 Robert Peterson 2008-03-10 23:05:47 UTC
Created attachment 297533 [details]
Third attempt at a patch

This version has a few more of Steve's suggestions.  They turned out
to not make a whole lot of performance improvement, except for one:
I added a prefetch to the section that parses an unsigned long at a time,
and the performance increased about 24% faster than without prefetch.
The prefetch statement isn't in userspace, so that made the timing a bit
more tricky to test.  What I did is use a toggle to throw half of the
bitfits through the algorithm with the prefetch and half to the one
without prefetch, then I did some big copy operations to see how it
performed.

This version is averaging about 8 times faster than the one we're
using today, but it's quite often 10 - 15 times faster.
The algorithm might still have room for improvement, especially by
experimenting more with prefetch, but I need to draw the line somewhere
or we'll never ship it.

Comment 12 Robert Peterson 2008-03-11 18:20:26 UTC
Created attachment 297654 [details]
RHEL52 patch

Here is the RHEL5 patch I posted to rhkernel-list.

Comment 13 Robert Peterson 2008-03-11 18:23:40 UTC
Patch now appears in the upstream git kernel for gfs2.
The RHEL52 patch was tested on my roth cluster and posted to rhkernel-list.
Reassigning to Don Zickus for inclusion in the kernel.


Comment 17 Don Zickus 2008-03-19 16:24:37 UTC
in kernel-2.6.18-86.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 20 errata-xmlrpc 2008-05-21 15:11:07 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html