Bug 435456
Summary: | GFS2: Optimise loop in gfs2_bitfit | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Steve Whitehouse <swhiteho> | ||||||||||||||
Component: | kernel | Assignee: | Don Zickus <dzickus> | ||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | GFS Bugs <gfs-bugs> | ||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||
Priority: | high | ||||||||||||||||
Version: | 5.2 | CC: | adas, edamato, lwang, rpeterso, swhiteho | ||||||||||||||
Target Milestone: | rc | ||||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | All | ||||||||||||||||
OS: | Linux | ||||||||||||||||
URL: | http://lwn.net/Articles/250967/ | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | RHBA-2008-0314 | Doc Type: | Bug Fix | ||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2008-05-21 15:11:07 UTC | Type: | --- | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Attachments: |
|
Description
Steve Whitehouse
2008-02-29 14:26:21 UTC
A bit more on this. The optimization is a bit more than a "nice to have" as the performance issues is related to a key target area. I wrote a tiny little C program that allows me to experiment with this. I can basically plug in gfs2_bitfit implementations and test them in userspace. My first finding is that the current method from my 25 Jan 2008 optimization, which can be seen here: http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=5fdc2eeb5d1d3800367f471690b01fcd1fd5b963 is approximately six times faster than it was before 25 Jan, at least on my x86_64 platform. It would be interesting to try this program on other architectures like ppc. From here it's easy to experiment to see what works more efficiently. Sounds good. I know this might be a silly question, but did you use the same -O level as the kernel when doing your user mode tests? Created attachment 297261 [details]
First go at a patch
This patch is untested in a real gfs2 file system, but I have tested
it in userspace using a test program. I've studied this problem for
some time now and here's what I've found:
I ported the code to user-space so I could debug it better and
do timing tests easier using gdb and such. So I have one test
program that tests all of the algorithms at once.
The performance of gfs2_bitfit of course varies greatly depending on
how fragmented with metadata vs. data the gfs2 file system is.
It's not right to optimize for either nearly-full or nearly-empty
bitmaps, so I took a good look at Barry's specsfs test system and saw
just how fragmented those bitmaps typically are at any given moment.
I'm taking that as a good indication that customers' file systems
will be similarly fragmented as well. Before looking at this
fragmentation, I made bad assumptions and got bad numbers for doing it.
I tested six algorithms under conditions similar to Barry's. They are:
1. The algorithm we're using today.
2. The algorithm we were using prior to my 25 Jan 2008 optimization.
3. Optimized version of #1
4. More optimized variation of #3
5. New word alignment algorithm
6. Method that tries to use memcmp of large chunks for shortcuts
Algorithm 6 performed so horribly that I threw it out altogether
and never looked back. I was just curious.
Algorithm 3 outperformed all the others on a bitmap that contains
most used blocks and rare metadata/data fragmentation.
Algorithm 4 performed better than 3 with metadata fragmentation.
Algorithm 5 outperformed all the others when there was periodic
metadata allocations typical of the specsfs file systems.
My test program searches a 50MB bitmap for a "free" block that
doesn't appear until 100 bytes (400 blocks) from the end, and
starting with a "goal" block that's 25 percent into the bitmap.
This is a typical run on my x86_64 box. The numbers are in
microseconds:
1. 679633
2. 972714
3. 685488
4. 242310
5. 169487
So for this performance run, algorithm 5 performed 4 times faster
than the one we're using today.
On my 32-bit test box I tested 20MB bitmaps and got similar ratios:
1. 1997501
2. 1351896
3. 1603124
4. 773022
5. 477915
Again, algorithm 5 performed almost 4 times faster than 1. It's
interesting to note that the algorithm we're using today actually
performed worse than before the rewrite, on my 32-bit system under
some types of metadata/data fragmentation.
I did a few quick tests on my roth cluster using the upstream kernel and this faster bitfit. It seems to work okay. I just did some basic copies, looked at the bitmaps, then did gfs2_fsck, which saw no problems. I also ran one of my "hell" tests on it and it passed. This version passed all my "hell" tests with flying colors so I posted it to cluster-devel for inclusion in the nmw git tree kernel. Created attachment 297459 [details]
Second attempt at a patch
This version is much faster thanks to Steve Whitehouse's suggestions.
This version is typically 6 times faster than today's algorithm on my
x86_64 test box, depending on the metadata / data fragmentation.
In cases of extreme fragmentation, this algorithm is up to 22 times faster.
Here are performance numbers from a few consecutive runs of my test
program:
[root@roth-01 /home/bob]# ./bmapspeed
trandfactor = 91
tdiff[0] = 972208 - 1.000000 times today's algorithm
tdiff[1] = 1133521 - 0.857689 times today's algorithm
tdiff[2] = 854765 - 1.137398 times today's algorithm
tdiff[3] = 354196 - 2.744831 times today's algorithm
tdiff[4] = 194574 - 4.996598 times today's algorithm
tdiff[5] = 139421 - 6.973182 times today's algorithm
[root@roth-01 /home/bob]# ./bmapspeed
trandfactor = 116
tdiff[0] = 823955 - 1.000000 times today's algorithm
tdiff[1] = 1133362 - 0.727001 times today's algorithm
tdiff[2] = 709204 - 1.161803 times today's algorithm
tdiff[3] = 324066 - 2.542553 times today's algorithm
tdiff[4] = 199325 - 4.133726 times today's algorithm
tdiff[5] = 136314 - 6.044537 times today's algorithm
[root@roth-01 /home/bob]# ./bmapspeed
trandfactor = 85
tdiff[0] = 1002704 - 1.000000 times today's algorithm
tdiff[1] = 1130705 - 0.886795 times today's algorithm
tdiff[2] = 884881 - 1.133151 times today's algorithm
tdiff[3] = 366433 - 2.736391 times today's algorithm
tdiff[4] = 199193 - 5.033832 times today's algorithm
tdiff[5] = 138418 - 7.244029 times today's algorithm
[root@roth-01 /home/bob]# ./bmapspeed
trandfactor = 28
tdiff[0] = 2295333 - 1.000000 times today's algorithm
tdiff[1] = 1129925 - 2.031403 times today's algorithm
tdiff[2] = 2203636 - 1.041612 times today's algorithm
tdiff[3] = 441685 - 5.196764 times today's algorithm
tdiff[4] = 198743 - 11.549253 times today's algorithm
tdiff[5] = 138429 - 16.581301 times today's algorithm
[root@roth-01 /home/bob]# ./bmapspeed
trandfactor = 49
tdiff[0] = 1429324 - 1.000000 times today's algorithm
tdiff[1] = 1131068 - 1.263694 times today's algorithm
tdiff[2] = 1309161 - 1.091786 times today's algorithm
tdiff[3] = 287547 - 4.970749 times today's algorithm
tdiff[4] = 191206 - 7.475309 times today's algorithm
tdiff[5] = 138709 - 10.304480 times today's algorithm
Created attachment 297460 [details]
Test Program I'm using
This is the test program I'm using to gather performance numbers and
test algorithms.
Created attachment 297533 [details]
Third attempt at a patch
This version has a few more of Steve's suggestions. They turned out
to not make a whole lot of performance improvement, except for one:
I added a prefetch to the section that parses an unsigned long at a time,
and the performance increased about 24% faster than without prefetch.
The prefetch statement isn't in userspace, so that made the timing a bit
more tricky to test. What I did is use a toggle to throw half of the
bitfits through the algorithm with the prefetch and half to the one
without prefetch, then I did some big copy operations to see how it
performed.
This version is averaging about 8 times faster than the one we're
using today, but it's quite often 10 - 15 times faster.
The algorithm might still have room for improvement, especially by
experimenting more with prefetch, but I need to draw the line somewhere
or we'll never ship it.
Created attachment 297654 [details]
RHEL52 patch
Here is the RHEL5 patch I posted to rhkernel-list.
Patch now appears in the upstream git kernel for gfs2. The RHEL52 patch was tested on my roth cluster and posted to rhkernel-list. Reassigning to Don Zickus for inclusion in the kernel. in kernel-2.6.18-86.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html |