Red Hat Bugzilla – Bug 437149
GFS2: [RFE] Optimise rgrp lookup during block allocation/deallocation
Last modified: 2011-10-31 04:37:00 EDT
Description of problem:
Looking up rgrps can be inefficient. There are a number of things we might do to
speed it up:
1. Use a red-black tree to do the block -> rgrp mapping, rather than the current
list (the reason for using rbtrees is that (a) there is already a kernel
implementation of them, (b) they are efficient even when normal binary trees
would otherwise be unbalanced, (c) the kernel implemention can be used with RCU)
2. Dispose of the current set of two lists which are intended to keep recently
used rgrps near the head of the list (instead use either rbtree from #1, or
per-inode cache from #3)
3. Add a way to cache the most recently used rgrp pointer in the inode. Due to
the requirement for updates to the rgrps (e.g. when we expand the fs) we need to
have an "invalid" flag. Also need to be careful of the locking required to do this.
4. Avoid reading all the rgrps into memory at once. Ideally we only want to read
the rindex & rgrp itself when its needed. Since the rindex entries are a fixed
size it shouldn't be too tricky to quickly find a particular entry. We might
need to do readahead or similar tricks to hide the delay in reading in the
rgrps, but this is critical to be able to scale to large fs sizes.
We ought to also review our allocation algorithms to see if those can be
It would also be a good plan to see if we can optimise (remove?) the function in
go_lock which reads the rgrp in. If we are going to cache the rgrp info in the
lvb (another TBD for the rgrp code), then we don't want to be forced to do a
disk read each time we lock the rgrp.
It would be nice to eliminate the disk reads entirely from this position in the
code and just read the blocks when we need them.
While item #1 in the opening comment is well worth doing, there are some simpler
things we can do as well. Take for example recent_rgrp_first() this function is
intended to find the rgrp which has a particular block (the goal block) within it.
It looks down the list of recently used rgrps checking each one in turn and if
it doesn't find it in the list returns NULL.
Now there are two things which are obviously wrong there... firstly we want to
find the rgrp based upon the goal block, and we don't care how recently used or
not the rgrp is. If you consider a system with a large number of slowly growing
files where each node appends a few bytes every now and again, the net result
would be to very rarely find the rgrp and thus get_local_rgrp() will tend to
return rgrps which will result in the files being spread all around the disk,
instead of being kept local to each other. So in other words, the likelihood of
reusing the same rgrp decreases rapidly as the number of threads & nodes
increases, and this particular optimisation is probably bogus.
The other problem is that the list is a list. So the time taken to scan this
list is O(N) where N is the list length. Now each time an rgrp is returned for
allocation, its added to the list if its not already there. This also requires
(at the moment) a scan of the list, so we are up to O(2N) now. The reason for
scanning the list when we add items to it is both to test that its not already
on the list (can be done just be testing the entry for list_empty() - much
faster) and also to count the number of entries in the list (we can just add a
counter to keep track of that).
The limit on the number of entries in the recently used list is limited by (num
rgrps/num journals) which also seems to be a rather useless limit. On a single
node that means that the "recently used" list might grow to contain every rgrp
on the filesystem. So on a large filesystem we might be traversing this list
twice for every block thats allocated, and it might have thousands of entries.
The list_mru used in deallocation by blk2rgrpd is even worse than the recent
list in some respects. In that case, all the rgrps are added to the list when
the rindex is read in. That results in potentially the whole list being scanned
each time we look for a block's rgrp in order to unlink it. Again, with any
sizeable fs and non-local access patterns, that will quickly degrade to linear
Created attachment 311273 [details]
Idea to help speed up rgrp stuff
So the idea here is just to get rif of the "recent list". Instead we use the
mru list and that removes a whole bunch of code as well as speeding up various
cases. To see the most benefit this needs to be run on a fairly large fs which
is at least part full. It should also improve the locality of cluster
operations too I think.
Anyway it's fairly simple, and should be fairly low risk.
Some further info on the patch in comment #4.... the idea is to merge the
functions of the recent list and the mru list. The mru list was only used when
removing files, but there seems no reason not to use it for allocation as well.
It is the logic which is behind the gfs2_blk2rgrpd() function.
The recent list on the other hand, was not very well thought out. It was not a
true "mru" style list in that once an item was added to the list it was never
removed or even moved to the top of the list. The "next" function for the mru
list was such that it was effectively causing list scans to be N^2 wrt the list
length. There was also the artificial limit mentioned in comment #2 (last pg).
So switching this to the mru list seemed the thing to do.
Also, if we want to go on in the future and fix up our gfs2_blk2rgrpd() function
by (for example) making it a tree, then this is a good first step.
Patch in comment #4 now posted to cluster-devel.
Created attachment 312361 [details]
Idea to try and speed up writes
A fairly small patch to try and speed up writes. Seems to have a positive
affect on postmark, will be interesting to see if it improves other benchmarks
Created attachment 326366 [details]
Updated to the latest kernel
1. Add a "full" flag to each bitmap so that we don't waste time during allocations by scanning full bitmaps.
2. Use the rgrp glock's lvb to store useful information such as:
a) Useful things for stat (to avoid having to read the rgrp itself) and
b) A hint as to where free blocks might be found to avoid long searches during allocation
c) Stats to help us decide whether a resource group is contended or not (maybe this should be a part of the glock code rather than an rgrp specific thing)
3. For SMP performance we need to allow multiple local allocations in the same rgrp at the same time. Currently we treat other local glock holders the same as remote holders.
4. Something like ext3's reservation scheme might be a good plan, or maybe we'll just go straight to delayed allocation like ext4. Might be interesting locking issues with the latter though.
5. Some solution to the current N^2 allocation scheme which follows from the way we allocate new inodes and assign goal blocks. In some cases it would be a very good plan to find a new rgrp for allocations, rather than use the existing one even when we hold a local lock on it.
Created attachment 397212 [details]
Proof-of-concept rbtree patch
Proof-of-concept rbtree patch
In an attempt to make gfs2 faster I developed this patch to
switch the rgrps from a linked list to a (red-black) rbtree. In
theory, that should be it much faster to find the rgrps that
correspond to a particular block. Searching a binary tree
should be faster than running a linked list. However, in practice
this runs slower for both postmark and iozone benchmarks, except
for some corner cases (maybe anomalies?).
This is a very preliminary patch: It doesn't account for multiple
nodes having starting allocations in different rgrps, and that
definitely needs to change.
Perhaps the slowdown can be accounted for by some mistake or
oversight I made, so maybe the attempt is worth keeping and
Can you use RCU for the rbtree locking? We could perhaps do that as a follow on patch.
The other thing we should try to do is to keep a "last rgrp" pointer in each inode. That means that we need an "obsolete" flag in the rgrp which can be checked under a suitable lock (rgrp hold probaly). It should speed things like allocation up quite a lot as they are likely to access the same rgrp repeatedly.
Either (or both) of those things might account for the slowdown that you've seen.
Since we don't support fs shrink, we should never need to reread rgrps once they have been read in. As a result of that, we could alter the fs extend code so that we only read in rgrps which are new when the fs is extended. I'm not totally sure that we want to do that yet, but it might be worth considering.
Also, we could ignore the rindex completely when reading in rgrps initially... since we know from the length of the rgrp where (roughly) the next one should be, we could just search forward to find the next one. The downside of that would be that we could only read them sequentially and that we would have no protection against reading too many when mount & fs extent were going on at the same time. The upside is that the rindex is redundant and we could mount filesystems even when it is damaged or missing.
It should be possible to use RCU with an rbtree if an seqlock is used as the write side locking. Each rgrp structure will need a flag to say whether it is valid or not, independent of the ref count which will be used only for object lifetime management. An example of how to do this was discussed in relation to IMA a little while back.
Reminder to self:
This is the thread about using seq_lock and RCU to protect a tree.
Another thought... if we had an ordering on the glocks, rather than exact match lookup, we wouldn't actually need a separate data structure for the rgrps. Not sure if that is a good idea yet, but worth some consideration.
Created attachment 519803 [details]
Updated version of Bob's patch with extra bits
Here is an update of Bob's original patch which, in addition, also resolves the rather strange ref counting that was being done relating to the bitmap blocks.
Here is an explanation of whats changed:
Originally we had a dual system for journaling resource groups. The metadata blocks were journaled and also the rgrp itself was added to a list. The reason for adding the rgrp to the list in the journal was so that the "repolish clones" code could be run to update the free space, and potentially send any discard requests when the log was flushed. This was done by comparing the "cloned" bitmap with what had been written back on disk during the transaction commit.
Due to this, there was a requirement to hang on to the rgrps' bitmap buffers until the journal had been flushed. For that reason, there was a rather complicated set up in the ->go_lock ->go_unlock functions for rgrps involving both a mutex and a spinlock (the ->sd_rindex_spin) to maintain a reference count on the buffers.
However, the journal maintains a reference count on the buffers anyway, since they are being journaled as metadata buffers. So by moving the code which deals with the post-journal accounting for bitmap blocks to the metadata journaling code, we can entirely dispense with the rather strange buffer ref counting scheme and also the requirement to journal the rgrps.
The net result of all this is that the ->sd_rindex_spin is left to do exactly one job, and that is to look after the rbtree or rgrps.
This patch is designed to be a stepping stone towards using RCU for the rbtree of resource groups, however the reduction in the number of uses of the ->sd_rindex_spin is likely to have benefits for multi-threaded workloads, anyway.
The patch retains ->go_lock and ->go_unlock for rgrps, however these maybe also be removed in future in favour of calling the functions directly where required in the code. That will allow locking of resource groups without needing to actually read them in - something that could be useful in speeding up statfs.
In the mean time though it is valid to dereference ->bi_bh only when the rgrp is locked. This is basically the same rule as before, modulo the references not being valid until the following journal flush.
Created attachment 519826 [details]
A better version
Fixes a bug in the previous version
Thoughts on further developments....
1. "append only" rgrp rbtree when growing the fs (bug #719956)
2. RCU/seq_lock protection for rgrp rbtree
3. incore inode to gain a pointer to "last rgrp used" to avoid frequent lookup
4. Merge gfs2_alloc_dinode() into gfs2_alloc_block() so that we pass a flag which indicates whether the first block in the extent is to be an inode or not.
5. Try to find a complete extent which is contiguous, during allocation, rather than just returning the first block that is found
6. Adjust algorithm which chooses another rgrp when there is congestion so that instead of using "try locks" it measures the time taken to get the rgrp lock, and chooses another rgrp only if this becomes excessive.
Anything else I should have mentioned here?
Various patches relating to this work have been posted to cluster-devel now. Plan is to merge them into -nmw once hera is back. There is still more work to do, but it is looking promising already.
From comment #17 there are patches for #1 and #3 so far.
Moving to POST since a number of patches have been posted. They'll be in -nmw as soon as hera is back up and running again.
It would be nice to extend this work with a patch to do the following...
In gfs2_alloc_block (and really this is something that just affects the bitfit part of that) we currently look for the first free block after the given starting point and return that, plus as extent whose length is bounded by the requested length and the number of succeeding free blocks.
This means that if a single free block exists in the bitmaps, only a single block will be allocated, no matter how many longer extents exist in the rgrp.
I'd like to see a patch which changes that behaviour to:
1. If there is an extent starting at exactly the block given as a goal, then return that (i.e. contiguous with the existing allocation for that inode) however long that might be
2. Search for an extent which is at least as long as the requested extent, and return that if found
3. If we still have not found a suitable extent, then return the largest extent which was found during the search (i.e. we need to keep a note of the largest extent found during step #2 so we can return that without needing to search again)
The goal block currently has to be a valid block, so we cannot set it to the last allocated block plus one (think of what happens at the end of an rgrp). We can however set it to the last allocated block (currently it is set to the first block in the previously allocated extent, which needs to be fixed).
Another issue is what happens when an rgrp gets full.... current behaviour is to wrap to the start of the rgrp in order to find free blocks. I think this is wrong as multi-rgrp files are going to land up with blocks which go backwards in the middle of the file quite often. Eventually when all callers of gfs2_block_map() can cope with -ENOSPC returns, we can return that rather than search backwards so that we move into the next rgrp. This will ensure that larger files are written in a linear fashion.
We don't need to do all this immediately, but this is the general direction in which I'd like to move over time.
Patch posted to cluster-devel to fix the goal block issue.
Also, rgd->rd_last_alloc is rather suspect, and it would be good to allocate all uses of that. I think i_goal should be enough, and if it is not, we need to use something thats stored on disk with the rgrp, rather than a per-node variable.
s/allocate all/eliminate all in comment #21
My current thoughts are to split rgblk_search into two functions. One to search for blocks only, and one to deal with counting extent sizes and setting the bitmap bits to perform the actual allocations.
Several of the rgblk_search callers only need to use the search functions anyway.
The only callers which use the setting parts of rgblk_search are gfs2_alloc_block and gfs2_alloc_di and those can be merged into one in due course. Might even be worth doing that first, not sure at the moment how easy that will be.