Bug 446085 - RFE: GFS: Optimise loop in gfs_bitfit
Summary: RFE: GFS: Optimise loop in gfs_bitfit
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: gfs-kmod
Version: 5.3
Hardware: All
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: Robert Peterson
QA Contact: GFS Bugs
URL: http://lwn.net/Articles/250967/
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-05-12 17:12 UTC by Robert Peterson
Modified: 2010-01-12 03:28 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-20 21:18:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
simzone output before my changes (21.00 KB, text/plain)
2008-05-20 14:08 UTC, Robert Peterson
no flags Details
simzone output after my changes (20.99 KB, text/plain)
2008-05-20 14:10 UTC, Robert Peterson
no flags Details
Upstream version of the proposed patch (4.34 KB, patch)
2008-05-20 14:27 UTC, Robert Peterson
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2009:0132 0 normal SHIPPED_LIVE gfs-kmod bug-fix update 2009-01-20 16:04:57 UTC

Description Robert Peterson 2008-05-12 17:12:02 UTC
+++ This bug was initially created as a clone of Bug #435456 +++
I think we should port the improved GFS2 bitfit algorithm back to GFS,
so I cloned the GFS2 bug.  Should be an easy port.

Optimise the loop in gfs2_bitfit by removing tests which are not required from
the fast path and applying the applicable techniques from the following tutorial:

http://lwn.net/Articles/250967/

-- Additional comment from rkenna on 2008-02-29 09:41 EST --
Needed post beta.

-- Additional comment from rkenna on 2008-02-29 09:42 EST --
A bit more on this.  The optimization is a bit more than a "nice to have" as the
performance issues is related to a key target area.

-- Additional comment from rpeterso on 2008-03-04 17:28 EST --
I wrote a tiny little C program that allows me to experiment with this.
I can basically plug in gfs2_bitfit implementations and test them in
userspace.  My first finding is that the current method from my 
25 Jan 2008 optimization, which can be seen here:
http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=5fdc2eeb5d1d3800367f471690b01fcd1fd5b963
is approximately six times faster than it was before 25 Jan, at least
on my x86_64 platform.  It would be interesting to try this program on
other architectures like ppc.  From here it's easy to experiment to see
what works more efficiently.


-- Additional comment from swhiteho on 2008-03-05 09:50 EST --
Sounds good. I know this might be a silly question, but did you use the same -O
level as the kernel when doing your user mode tests?


-- Additional comment from rpeterso on 2008-03-07 18:40 EST --
Created an attachment (id=297261)
First go at a patch

This patch is untested in a real gfs2 file system, but I have tested
it in userspace using a test program.  I've studied this problem for
some time now and here's what I've found:

I ported the code to user-space so I could debug it better and
do timing tests easier using gdb and such.  So I have one test
program that tests all of the algorithms at once.

The performance of gfs2_bitfit of course varies greatly depending on
how fragmented with metadata vs. data the gfs2 file system is.
It's not right to optimize for either nearly-full or nearly-empty
bitmaps, so I took a good look at Barry's specsfs test system and saw 
just how fragmented those bitmaps typically are at any given moment.
I'm taking that as a good indication that customers' file systems
will be similarly fragmented as well.  Before looking at this
fragmentation, I made bad assumptions and got bad numbers for doing it.

I tested six algorithms under conditions similar to Barry's.  They are:

1. The algorithm we're using today.
2. The algorithm we were using prior to my 25 Jan 2008 optimization.
3. Optimized version of #1
4. More optimized variation of #3
5. New word alignment algorithm
6. Method that tries to use memcmp of large chunks for shortcuts

Algorithm 6 performed so horribly that I threw it out altogether
and never looked back.	I was just curious.
Algorithm 3 outperformed all the others on a bitmap that contains
most used blocks and rare metadata/data fragmentation.
Algorithm 4 performed better than 3 with metadata fragmentation.
Algorithm 5 outperformed all the others when there was periodic
metadata allocations typical of the specsfs file systems.

My test program searches a 50MB bitmap for a "free" block that 
doesn't appear until 100 bytes (400 blocks) from the end, and
starting with a "goal" block that's 25 percent into the bitmap.

This is a typical run on my x86_64 box.  The numbers are in
microseconds:

1. 679633
2. 972714
3. 685488
4. 242310
5. 169487

So for this performance run, algorithm 5 performed 4 times faster
than the one we're using today.

On my 32-bit test box I tested 20MB bitmaps and got similar ratios:

1. 1997501
2. 1351896
3. 1603124
4. 773022
5. 477915

Again, algorithm 5 performed almost 4 times faster than 1.  It's
interesting to note that the algorithm we're using today actually
performed worse than before the rewrite, on my 32-bit system under
some types of metadata/data fragmentation.


-- Additional comment from rpeterso on 2008-03-07 23:26 EST --
I did a few quick tests on my roth cluster using the upstream kernel
and this faster bitfit.  It seems to work okay.  I just did some basic
copies, looked at the bitmaps, then did gfs2_fsck, which saw no problems.
I also ran one of my "hell" tests on it and it passed.


-- Additional comment from rpeterso on 2008-03-08 00:21 EST --
This version passed all my "hell" tests with flying colors so I posted
it to cluster-devel for inclusion in the nmw git tree kernel.


-- Additional comment from rpeterso on 2008-03-10 12:44 EST --
Created an attachment (id=297459)
Second attempt at a patch

This version is much faster thanks to Steve Whitehouse's suggestions.
This version is typically 6 times faster than today's algorithm on my
x86_64 test box, depending on the metadata / data fragmentation.
In cases of extreme fragmentation, this algorithm is up to 22 times faster.
Here are performance numbers from a few consecutive runs of my test
program:

[root@roth-01 /home/bob]# ./bmapspeed 
trandfactor = 91
tdiff[0] =  972208   - 1.000000 times today's algorithm
tdiff[1] = 1133521   - 0.857689 times today's algorithm
tdiff[2] =  854765   - 1.137398 times today's algorithm
tdiff[3] =  354196   - 2.744831 times today's algorithm
tdiff[4] =  194574   - 4.996598 times today's algorithm
tdiff[5] =  139421   - 6.973182 times today's algorithm
[root@roth-01 /home/bob]# ./bmapspeed 
trandfactor = 116
tdiff[0] =  823955   - 1.000000 times today's algorithm
tdiff[1] = 1133362   - 0.727001 times today's algorithm
tdiff[2] =  709204   - 1.161803 times today's algorithm
tdiff[3] =  324066   - 2.542553 times today's algorithm
tdiff[4] =  199325   - 4.133726 times today's algorithm
tdiff[5] =  136314   - 6.044537 times today's algorithm
[root@roth-01 /home/bob]# ./bmapspeed 
trandfactor = 85
tdiff[0] = 1002704   - 1.000000 times today's algorithm
tdiff[1] = 1130705   - 0.886795 times today's algorithm
tdiff[2] =  884881   - 1.133151 times today's algorithm
tdiff[3] =  366433   - 2.736391 times today's algorithm
tdiff[4] =  199193   - 5.033832 times today's algorithm
tdiff[5] =  138418   - 7.244029 times today's algorithm
[root@roth-01 /home/bob]# ./bmapspeed 
trandfactor = 28
tdiff[0] = 2295333   - 1.000000 times today's algorithm
tdiff[1] = 1129925   - 2.031403 times today's algorithm
tdiff[2] = 2203636   - 1.041612 times today's algorithm
tdiff[3] =  441685   - 5.196764 times today's algorithm
tdiff[4] =  198743   - 11.549253 times today's algorithm
tdiff[5] =  138429   - 16.581301 times today's algorithm
[root@roth-01 /home/bob]# ./bmapspeed 
trandfactor = 49
tdiff[0] = 1429324   - 1.000000 times today's algorithm
tdiff[1] = 1131068   - 1.263694 times today's algorithm
tdiff[2] = 1309161   - 1.091786 times today's algorithm
tdiff[3] =  287547   - 4.970749 times today's algorithm
tdiff[4] =  191206   - 7.475309 times today's algorithm
tdiff[5] =  138709   - 10.304480 times today's algorithm


-- Additional comment from rpeterso on 2008-03-10 12:46 EST --
Created an attachment (id=297460)
Test Program I'm using

This is the test program I'm using to gather performance numbers and
test algorithms.


-- Additional comment from rpeterso on 2008-03-10 19:05 EST --
Created an attachment (id=297533)
Third attempt at a patch

This version has a few more of Steve's suggestions.  They turned out
to not make a whole lot of performance improvement, except for one:
I added a prefetch to the section that parses an unsigned long at a time,
and the performance increased about 24% faster than without prefetch.
The prefetch statement isn't in userspace, so that made the timing a bit
more tricky to test.  What I did is use a toggle to throw half of the
bitfits through the algorithm with the prefetch and half to the one
without prefetch, then I did some big copy operations to see how it
performed.

This version is averaging about 8 times faster than the one we're
using today, but it's quite often 10 - 15 times faster.
The algorithm might still have room for improvement, especially by
experimenting more with prefetch, but I need to draw the line somewhere
or we'll never ship it.


-- Additional comment from rpeterso on 2008-03-11 12:48 EST --
I did a crude, unscientific comparison test of before and after with
the RHEL5.2 code.  Using my roth cluster, two nearly identical boxes
x86_64, one using the old gfs2_bitfit code and one using the revised,
both using clvmd and dlm protocol, to the same SAN storage device.
I copied 16GB of mp3 files of varying sizes.

old gfs2_bitfit: 8m18s
new gfs2_bitfit: 6m45s

So that shows a 19 percent clock-time performance improvement.
Next I'll try the single-node AOL test before and after.


-- Additional comment from rpeterso on 2008-03-11 14:20 EST --
Created an attachment (id=297654)
RHEL52 patch

Here is the RHEL5 patch I posted to rhkernel-list.


-- Additional comment from rpeterso on 2008-03-11 14:23 EST --
Patch now appears in the upstream git kernel for gfs2.
The RHEL52 patch was tested on my roth cluster and posted to rhkernel-list.
Reassigning to Don Zickus for inclusion in the kernel.


-- Additional comment from rpeterso on 2008-03-12 10:17 EST --
FYI--AOl tests show a 10 percent increase in performance when
maximum RGs (2GB) are used and 1K block size.  The average time to do
one iteration went from 65433586 to 58877961 for 272 iterations.
I'm doing default RG size now with 1K to see how it looks.  I've
already done the old bitfit, and the new algorithm is running now.
I have a spreadsheet with a chart, but I might as well combine them.


-- Additional comment from rpeterso on 2008-03-12 12:28 EST --
Created an attachment (id=297802)
The numbers and chart

This file is an OpenOffice spreadsheet showing and charting the numbers.
With 2GB RG size, we're showing a 10% increase in performance.	With the
default RG size, we're showing a 6% increase in performance.
Still, something is better than nothing.  It goes to show that the benchp13
test is giving a workout to a lot more code than the bitmap code.


-- Additional comment from swhiteho on 2008-03-12 12:38 EST --
Marking comments which mention the customer private.

The results in comment #15 do bear out what we have suspected though, which is
that with larger RGs, the search of the individual RG is more important, and
with larger numbers of RGs, its probably the lookup of the RG itself thats
taking up the time.

I'm going to open a RFE bug relating to speeding up the RG lookup during block
allocation so that we will not forget to do that at some stage.


-- Additional comment from dzickus on 2008-03-19 12:24 EST --
in kernel-2.6.18-86.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

-- Additional comment from errata-xmlrpc on 2008-03-26 16:38 EST --
Bug report changed to ON_QA status by Errata System.
A QE request has been submitted for advisory RHBA-2008:8013-04
http://errata.devel.redhat.com/errata/show/6625

Comment 2 Robert Peterson 2008-05-20 14:08:40 UTC
Created attachment 306122 [details]
simzone output before my changes

This is from a run of "simzone" without my bitfit changes.  This is not
a very good test because it was run on a slow machine, on a hard drive
partition that is on the same disk as the root partition.  Still, it
gives us an idea of where we stand before the changes.	The test was run
on an upstream kernel from the gfs2 nmw git tree, in lock_nolock mode.

Comment 3 Robert Peterson 2008-05-20 14:10:04 UTC
Created attachment 306123 [details]
simzone output after my changes

This is from a simzone run with the same physical conditions, but with
my new bitfit algorithm in place.

Comment 4 Robert Peterson 2008-05-20 14:21:32 UTC
Comparing some of the numbers, you can see most of the IO rate go up:

[bob@ganesha ~/bugzilla]$ grep "recsize    128" simzone.without.bitfit.txt 
filesize        128 KB  recsize    128 KB  IOs2do      1  iwrite   250980
rewrite   544680 read 14222222 reread 21333333 
filesize        256 KB  recsize    128 KB  IOs2do      2  iwrite   250244
rewrite   624390 read 18285714 reread 23272727 
filesize        512 KB  recsize    128 KB  IOs2do      4  iwrite   257286
rewrite   707182 read 21333333 reread 24380952 
filesize       1024 KB  recsize    128 KB  IOs2do      8  iwrite   253590
rewrite   737221 read 23272727 reread 24380952 
filesize       2048 KB  recsize    128 KB  IOs2do     16  iwrite   267572
rewrite   724442 read 24094117 reread 24975609 
filesize       4096 KB  recsize    128 KB  IOs2do     32  iwrite   216845
rewrite   739884 read 24380952 reread 24824242 
filesize       8192 KB  recsize    128 KB  IOs2do     64  iwrite   270363
rewrite   748127 read 24526946 reread 24824242 
filesize      16384 KB  recsize    128 KB  IOs2do    128  iwrite   272286
rewrite   755475 read 24786686 reread 24600600 
filesize      32768 KB  recsize    128 KB  IOs2do    256  iwrite   250966
rewrite   680229 read 24749244 reread 25148119 
filesize      65536 KB  recsize    128 KB  IOs2do    512  iwrite   252593
rewrite   633051 read 24591369 reread 24956587 
filesize     131072 KB  recsize    128 KB  IOs2do   1024  iwrite   218476
rewrite   601708 read 24966095 reread 25080750 
filesize     262144 KB  recsize    128 KB  IOs2do   2048  iwrite   183480
rewrite   582399 read 24728233 reread 24918631 
filesize     524288 KB  recsize    128 KB  IOs2do   4096  iwrite   204610
rewrite   105852 read 25314470 reread 25285170 
filesize    1048576 KB  recsize    128 KB  IOs2do   8192  iwrite    90809
rewrite   141231 read 25056776 reread 25139076 

[bob@ganesha ~/bugzilla]$ grep "recsize    128" simzone.with.bitfit.txt 
filesize        128 KB  recsize    128 KB  IOs2do      1  iwrite   288939
rewrite   551724 read 15999999 reread 21333333 
filesize        256 KB  recsize    128 KB  IOs2do      2  iwrite   276457
rewrite   635235 read 19692307 reread 23272727 
filesize        512 KB  recsize    128 KB  IOs2do      4  iwrite   265147
rewrite   693766 read 21333333 reread 24380952 
filesize       1024 KB  recsize    128 KB  IOs2do      8  iwrite   260096
rewrite   740419 read 23272727 reread 24380952 
filesize       2048 KB  recsize    128 KB  IOs2do     16  iwrite   259503
rewrite   719353 read 24094117 reread 24975609 
filesize       4096 KB  recsize    128 KB  IOs2do     32  iwrite   209257
rewrite   741894 read 24674698 reread 24975609 
filesize       8192 KB  recsize    128 KB  IOs2do     64  iwrite   248347
rewrite   761267 read 24749244 reread 25051987 
filesize      16384 KB  recsize    128 KB  IOs2do    128  iwrite   252061
rewrite   752042 read 24824242 reread 24899696 
filesize      32768 KB  recsize    128 KB  IOs2do    256  iwrite   241896
rewrite   699140 read 24767951 reread 24994660 
filesize      65536 KB  recsize    128 KB  IOs2do    512  iwrite   224253
rewrite   632843 read 24693293 reread 25128834 
filesize     131072 KB  recsize    128 KB  IOs2do   1024  iwrite   216140
rewrite   503221 read 25119202 reread 24899696 
filesize     262144 KB  recsize    128 KB  IOs2do   2048  iwrite   217153
rewrite   588882 read 24982750 reread 24800756 
filesize     524288 KB  recsize    128 KB  IOs2do   4096  iwrite   195343
rewrite   224257 read 25067559 reread 24966095 
filesize    1048576 KB  recsize    128 KB  IOs2do   8192  iwrite   153666
rewrite   206219 read 25037033 reread 25125221 

So initial write performance is 15% to 69% faster in some cases.
Other cases seem to show an odd performance loss, but I attribute that
to the hardware and conditions.  I should re-do this test using a
separate hard drive and better hardware.  Perhaps when I port it to RHEL
I can do that.  I used an old trin node because I wanted to test it with
upstream code first, and I didn't want to bring any of my good rhel
testing machines to the upstream kernel, openais, cluster code, etc.


Comment 5 Robert Peterson 2008-05-20 14:27:55 UTC
Created attachment 306127 [details]
Upstream version of the proposed patch

This is the patch to master that I tested.

Comment 6 Robert Peterson 2008-06-18 19:43:06 UTC
This fix was delayed a bit due to bug #448866.  That issue is now known
and the fix incorporated into a newer patch, which I pushed to the master,
STABLE2 and RHEL5 git repositories.  Changing status to MODIFIED.


Comment 7 Fedora Update System 2008-07-30 20:04:00 UTC
gfs2-utils-2.03.05-1.fc9, rgmanager-2.03.05-1.fc9, cman-2.03.05-1.fc9 has been pushed to the Fedora 9 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 10 errata-xmlrpc 2009-01-20 21:18:38 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0132.html


Note You need to log in before you can comment on or make changes to this bug.