920425 – GFS2: [RFE] replace gfs2_ail structure with gfs2_trans

Bug 920425 - GFS2: [RFE] replace gfs2_ail structure with gfs2_trans

Summary: GFS2: [RFE] replace gfs2_ail structure with gfs2_trans

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Assignee:	Steve Whitehouse
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-03-12 05:36 UTC by Ben Marzinski
Modified:	2013-08-21 11:37 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-08-21 11:37:44 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
This patch removes the gfs2_ail struture, and active item lists to gfs2_trans (16.09 KB, patch) 2013-03-12 05:36 UTC, Ben Marzinski	no flags	Details \| Diff
Sort buffer lists by inplace block number (1.25 KB, patch) 2013-04-24 19:02 UTC, Ben Marzinski	no flags	Details \| Diff
Revoke all the on log blocks that have been already written to disk (4.93 KB, patch) 2013-04-24 19:18 UTC, Ben Marzinski	no flags	Details \| Diff
updated version of the revoke all written blocks patch (4.38 KB, patch) 2013-06-04 14:49 UTC, Ben Marzinski	no flags	Details \| Diff
Show Obsolete (1) View All

Description Ben Marzinski 2013-03-12 05:36:35 UTC

Created attachment 708769 [details]
This patch removes the gfs2_ail struture, and active item lists to gfs2_trans

In order to allow transactions and log flushes to happen at the same time, gfs2 needs to move the transaction accounting and active items list code into the gfs2_trans structure.  As a first step toward this, gfs2 should remove the gfs2_ail structure, and handle the active items list in the gfs_trans structure.

With the attached patch, at the end of a transaction, gfs2 will add the gfs2_trans structure to the superblock if there is not one already.  This structure now has the active items fields that were previously in gfs2_ail.

This is not necessary in the case where the transaction was simply used to add remokes, since these are never written outside of the journal, and thus, don't need an active items list.

Also, in order to make sure that the transaction structure is not removed while it's still in use by gfs2_trans_end, unlocking the sd_log_flush_lock has to happen slightly later in ending the transaction.

Comment 1 Steve Whitehouse 2013-03-15 12:26:02 UTC

I think this patch looks good, so if you are reasonably confident in it, and it passes a few tests, something involving lots of cross-node invalidations would be a good one, then it would be good to get this posted upstream.

There are some further possible improvements we may be able to make too....

One possible idea is to add a list_sort() into the beginning of gfs2_before_commit(). That might seem at first a bit odd, since the writes to the journal will be sequential and contiguous anyway, however my thought is that this will result in the AIL being ordered in the same way (after the log flush) and thus giving us better i/o patterns when writing the AIL list. It will be interesting to see whether the extra cpu time is worth it or not.

My other thought relates to gfs2_ail_flush() and what happens when we have a remote node that is iterating through a set of glocks requesting them in turn (e.g. readdir + stat for a directory) when they've been in use locally. In that case we'd have something like this:

1. Log flush for first glock (if dirty)
2. Writeback for first glock (if dirty)
3. remove buffers from AIL (if in AIL)
4. Second log flush for first glock
5. Log flush for second glock (if dirty, but probably isn't due to #1)
6. Writeback for second glock (if dirty)
7. remove buffers from AIL (if in AIL)
8. Second log flush for second glock
and so on....

Now if there were no dirty pages to write back, and if there were no buffers to be removed from the AIL, then we could release the glock without any further log flushes. In step #1 above, any glocks dirty at that time would have been written into the log, so that unless the other glocks were in active use at the time, we should not need to repeat that.

If we were to always write back the glocks' data on every glock demotion, we'd be basically implementing the OCFS2 scheme, and I'm not sure that we want to go that far.

However, we could look into always writing back all of the revokes when we remove the buffers from the AIL. So that at least in the case that the glocks has been synced, we wouldn't need to have one log flush per glock, but instead a single log flush for the first glock to be requested, and after that we may be able to free up the glocks without the extra flush. That should make certain workloads considerably faster.

It would be nice if we could solve the writeback issue too, but lets go one step at a time and do the easier things first I think.

Comment 2 Ben Marzinski 2013-04-24 19:02:23 UTC

Created attachment 739570 [details]
Sort buffer lists by inplace block number

This patch simply sorts the the data and metadata buffer lists by their block number, make the inplace IO get issued in sequential order.

Comment 3 Ben Marzinski 2013-04-24 19:18:26 UTC

Created attachment 739575 [details]
Revoke all the on log blocks that have been already written to disk

This patch looks at all the outstanding blocks in all the transactions on the log, and moves the completed ones to the ail2 list.  Then it issues revokes for all of these transactions.  This will hopefully speed things up in situations where there is a lot of contention for glocks, especially if they are acquired serially. 

This patch could definitely use another pass, but it passed my testing.

I can probably reorder things so revoke_lo_before_commit is only called once.  This will simplify the code, as well as not wasting space at the end of the revoke blocks from the first call.

Also, instead of doing all the work of reserving log space, we could simply have one block that is always reserved for additional revokes, and just do one blocks worth (plus any extra space on an existing revoke block) of extra ones on each flush. That would still allow us at least around 500 extra revokes per journal flush (assuming 4096 byte blocksize).

Comment 4 Ben Marzinski 2013-06-04 14:49:08 UTC

Created attachment 756842 [details]
updated version of the revoke all written blocks patch

This version of the patch puts the extra revokes inside of revoke_lo_before_commit. It also only allocates at most one block for revokes.  If there are already blocks for revokes, it simply adds to the last block.  In all my testing, I never saw more than one block used for revokes.  Aside from simplicity, there is no reason why I couldn't add back the code to allocate as many blocks as necessary, but this seems like it will rarely be more than one.

Comment 5 Steve Whitehouse 2013-08-21 11:37:44 UTC

This is upstream, so I think we can close this one now...

Note You need to log in before you can comment on or make changes to this bug.