548215 – GFS2: Improve fast statfs performance with LVBs

Bug 548215 - GFS2: Improve fast statfs performance with LVBs

Summary: GFS2: Improve fast statfs performance with LVBs

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Ben Marzinski
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-12-16 22:51 UTC by Ben Marzinski
Modified:	2012-08-10 13:11 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-08-10 13:11:57 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
use lvbs to track resource group information (8.49 KB, patch) 2012-02-23 21:58 UTC, Ben Marzinski	no flags	Details \| Diff
use lvbs to track resource group space and unlinked inodes (9.42 KB, patch) 2012-05-22 05:28 UTC, Ben Marzinski	no flags	Details \| Diff
add mount option to lvb code (11.17 KB, patch) 2012-05-30 03:25 UTC, Ben Marzinski	no flags	Details \| Diff
Show Obsolete (2) View All

Description Ben Marzinski 2009-12-16 22:51:54 UTC

Keeping the GFS2 statfs information in LVBs could increase fast statfs performance. However, right now it isn't possible to do this due to assumptions the code makes about GL_SKIP (see comment #2 of BZ #4948469).

Comment 1 Steve Whitehouse 2011-11-24 11:16:56 UTC

Here are a few more thoughts wrt statfs:

1. Recovery

Whilst the statfs files are journaled, recovery will only recover the content of the files, so if there are items in the local statfs file which have not been put into the master statfs file, then those will not be recovered until that node is mounted again.

It ought to be possible to fix that without too much difficulty.

2. Performance

I don't like the fact that we have to update the "inplace" local statfs data. There seems little point in keeping the inplace data uptodate for the local statfs files since we can update them from the journal when required, and really the only point is to have data from which the master file can be updated.

If we can pick out the local statfs updates from the journal, then we could simply update the master statfs file directly at recovery time. We should then not need to ever write back "in place" the local statfs data, and those files can be zeroed and left alone.

That will remove a single block write which is almost certainly non-contiguous for every journal flush. That should help improve performance overall.

3. Scalability

One of the problems with the fast statfs info that we keep now is that we are using an fs-wide lock to keep that data structure uptodate. That is a pain when trying to scale to large numbers of cpus. It is one of the very few data structures which we share during writes (maybe the only one) and it would be nice to fix this.

One solution would be to use a per-cpu set of variables and to divide the update percentage limits by the number of cpus so that we continue to ensure that we meet the requested update limits.

Some issues: how to deal with resetting the cpu local stats when the master file is being updated? Maybe need an RCU based scheme to do that... Alternatively, do the master update "in line" with the local stats, but for just that cpu only, so then we don't have to care about whats going on, on the other cpus.

There are various possible solutions, and we can take a look to see what the best way to do this would be.

4. Using LVBs

There is plenty of space in the LVB to keep the stats for each rgrp, such that they can be read from there without needing to read in the rgrp's disk blocks.

There are two issues:

1. Decoupling the locking of the rgrp from reading in the disk blocks associated with it. This means removing the go_lock/go_unlock functions for rgrps and calling them manually next to each existing user of the rgrp glock. Then rgrps can be locked without needing to read in the data off disk.

2. Backwards compatibility. We need to know whether the data in the LVB is uptodate, or whether there is an older node in the cluster which hasn't updated the LVB for the rgrp. This is a more tricky problem to solve.

5. General code clean up

It would be a good plan to extract the statfs code into a source file of its own, rather than having it spread through super.c so that we have it all in one place. That should make it easier to spot problems and opportunities for further clean up.

Comment 2 Ben Marzinski 2012-02-23 21:58:56 UTC

Created attachment 565390 [details]
use lvbs to track resource group information

This patch uses LVBs to track resource group information.  Right now, the only thing that uses the LVBs is get_local_rgrp(). I've done some testing but I only have two nodes to use, and I'd like to run this on a bigger cluster and get more
recovery testing in.  One of the issues is that even though GFS2 can now see if
a resource group fits without reading in the resource group, it still needs to read in the rgrp to run try_rgrp_unlink().  It's possible to change how often we check for unlinked inodes, so that it isn't forcing us to read in the resource groups so often.

Right now, I'm just looking for comments on what is currently done.

Comment 3 Ben Marzinski 2012-05-22 05:28:49 UTC

Created attachment 585929 [details]
use lvbs to track resource group space and unlinked inodes

This version of the patch also stores the number of unlinked inodes in the lvb.  This lets gfs2 skip the searching for unlinked inodes in the resource groups where there won't be any.

Comment 4 Ben Marzinski 2012-05-30 03:25:58 UTC

Created attachment 587583 [details]
add mount option to lvb code

This is the same code as before, with a mount option "rgrplvb" added.  This option defaults to off.  With it off. The lvbs get created and updated, but they never get verified or used.

Comment 5 Steve Whitehouse 2012-08-10 13:11:57 UTC

This was merged at the last merge window.

Note You need to log in before you can comment on or make changes to this bug.