Bug 486572 - libgfs2 requires arborial enlightenment
Summary: libgfs2 requires arborial enlightenment
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: gfs2-utils
Version: 11
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: Robert Peterson
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-02-20 12:07 UTC by Steve Whitehouse
Modified: 2010-03-10 21:14 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-03-10 21:14:58 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Steve Whitehouse 2009-02-20 12:07:35 UTC
I was looking at the following code recently:

static int gfs2_bitmap_create(struct gfs2_bmap *bmap, uint64_t size,
                                           uint8_t chunksize)
{
        if((((chunksize >> 1) << 1) != chunksize) && chunksize != 1)
                return -1;
        if(chunksize > 8)
                return -1;
        bmap->chunksize = chunksize;
        bmap->chunks_per_byte = 8 / chunksize;

        bmap->size = size;

        /* Have to add 1 to BITMAP_SIZE since it's 0-based and mallocs
         * must be 1-based */
        bmap->mapsize = BITMAP_SIZE(size, bmap->chunks_per_byte)+1;

        if(!(bmap->map = malloc(sizeof(char) * bmap->mapsize)))
                return -ENOMEM;
        if(!memset(bmap->map, 0, sizeof(char) * bmap->mapsize)) {
                free(bmap->map);
                bmap->map = NULL;
                return -ENOMEM;
        }

Now there are several things which immediately struck me about this bit of code. Firstly the initial test on chunksize seems to be rather long winded and it took me a while to realise that those shifts were in fact just implementing (chunksize&1). Secondly it rather looks like chunksize of anything other than a power of two would be nonsense, and that isn't tested for at all.

The comment about BITMAP_SIZE looks like total nonsense, it isn't based on anything, its just a size and the +1 is most likely due to rounding in the division which has been obscured by this pointless macro.

The thing which most amazed me though was the call to memset() and I wonder just how that is expected to fail...

Looking further through this code, there is this:

int gfs2_block_mark(struct gfs2_sbd *sdp, struct gfs2_block_list *il,
                    uint64_t block, enum gfs2_mark_block mark)
{
        int err = 0;

        if(mark == gfs2_bad_block)
                gfs2_special_set(&sdp->bad_blocks, block);
        else if(mark == gfs2_dup_block)
                gfs2_special_set(&sdp->dup_blocks, block);
        else if(mark == gfs2_eattr_block)
                gfs2_special_set(&sdp->eattr_blocks, block);
        else
                err = gfs2_bitmap_set(&il->list.gbmap.group_map, block,
                                      mark_to_gbmap[mark]);
        return err;
}

and this also worries me. Each of the calls to gfs2_special_set() hides a linear lookup over a list which presumably might become very long, so that the operation is O(n^2) wrt the list length. In addition to that, the bitmap is not a very efficient data structure in terms of memory usage, either.

I think we really need to replace this with an extent cache of some kind, where each extent can be tagged with various useful bits of information about the block type and allocation state. Such a thing would need to take the form of a tree indexed by the block number (and potentially also by the tags attached). Probably an rbtree or something along similar lines would suffice, with the important feature of keeping the data structure down to a minimum size.

Also a useful feature would be to change the "default" state so that extents not in the tree could be considered as being free (in a mostly empty filesystem) or allocated (in a mostly full filesystem).

It seems that one of the main problems with the utilities is that they rely almost totally on lists of objects, and that is the source of a lot of inefficiency.

This isn't an easy task, and it will take some thinking about to do it right, but it seems like a generally useful thing to have, and will certainly speed up a number of operations.

Comment 1 Robert Peterson 2009-05-05 19:08:46 UTC
The gfs2_bitmap_create code is old code, ported from gfs_fsck.
The "0-based" versus "1-based" as I interpret it, means that the
value goes from 0 to (size-1) rather than 1 to (size).  Since
malloc needs the size, they did the "+ 1".  The code seems to
work, so I opted not to mess with it.  That's not to say there
isn't room for improvement though.  :7)

As for the gfs2_block_mark code, that's for special block lists.
These lists were done in order to save gfs2_fsck memory.  In theory,
the lists should be relatively small.  I would hope gfs2_fsck would
not find many "bad" blocks and I would hope it wouldn't find many
duplicate blocks.  If you have more than 100 of these blocks, your
file system has some serious corruption and you're in trouble anyway.
In actual practice, allocating a small linked list saves a lot of
memory as opposed to a bitmap that spans to represent the entire
file system which might be many TB in size.  Historically speaking,
when gfs_fsck runs out of memory, it's almost always due to having
a bunch of huge bitmaps in memory.  The CPU time to run the
linked lists is cheap and fast, but if your node runs out of memory
and into swap space, then gfs_fsck takes an enormous amount of time
and agony.  So there's a trade-off there until we can get a better
system in place.

The third list, eattr_blocks, is for extended attribute blocks.
For most people, this list will also be very small.  However, when
we start talking about selinux labeling and such, the list could
get quite large.  So we'll have to re-think this.

This all goes back to the best way to implement gfs_fsck.  We've
discussed doing this on an rg-by-rg basis, which would improve its
memory usage enormously and allow us to run several threads in
parallel, each cleaning up its own rg.  The problem, of course, is
when inodes in one RG references blocks in another RG, and we
haven't sorted out the best way to implement that.

Comment 2 Bug Zapper 2009-06-09 11:32:43 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 3 Robert Peterson 2010-03-10 19:27:01 UTC
Question for Steve: Most of this code has been completely reworked
and now appears in the STABLE3 branch.  Most of the linked lists
have been replaced with rbtrees.  The bitmap code has been simplified.
Does that satisfy this bugzilla's requirements?  Can I not just close
this or is there still more work to be done?

Setting NEEDINFO.

Comment 4 Steve Whitehouse 2010-03-10 21:14:58 UTC
Sounds like we are done with it then, thanks for fixing that up.


Note You need to log in before you can comment on or make changes to this bug.