Bug 1027451 - GFS2: [RFE] remove freeze lock from transaction code path
Summary: GFS2: [RFE] remove freeze lock from transaction code path
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Ben Marzinski
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 1130325
TreeView+ depends on / blocked
 
Reported: 2013-11-06 21:25 UTC by Ben Marzinski
Modified: 2014-08-14 21:12 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1130325 (view as bug list)
Environment:
Last Closed: 2014-06-09 17:46:31 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
not final patch to move freeze locking (23.20 KB, patch)
2013-11-06 23:49 UTC, Ben Marzinski
no flags Details | Diff
Better version of fsfreeze patch (22.73 KB, patch)
2013-11-16 03:16 UTC, Ben Marzinski
no flags Details | Diff
So close but yet so far (24.50 KB, patch)
2013-12-05 22:20 UTC, Ben Marzinski
no flags Details | Diff
yet another not quite working fsfreeze patch (30.08 KB, patch)
2014-01-07 20:19 UTC, Ben Marzinski
no flags Details | Diff
Working version of the patch (28.95 KB, patch)
2014-04-21 19:55 UTC, Ben Marzinski
no flags Details | Diff
updated version to fix a small bug. (28.96 KB, patch)
2014-04-23 15:22 UTC, Ben Marzinski
no flags Details | Diff
RHEL7 version of the patch (28.69 KB, patch)
2014-04-30 04:05 UTC, Ben Marzinski
no flags Details | Diff

Description Ben Marzinski 2013-11-06 21:25:41 UTC
Description of problem:  GFS2's transaction lock is really used to implement filesystem freezes.  The problem is that it is still acquired during every transaction.  Transactions cannot occur while the log is being flushed, and a log flush is necessary to freeze the filesystem, to make sure that it is stable on disk.  This makes gfs2_log_flush a sensible place to put move the code to
freeze the filesystem.  The goal is to speed up transactions, and simplify the code.

Comment 1 Ben Marzinski 2013-11-06 23:49:24 UTC
Created attachment 820790 [details]
not final patch to move freeze locking

This is a version of the freeze lock moving patch that still needs some work.  There is an issue that I don't understand with free block accounting when you freeze that triggers a warning.  This get's cleared up so that it doesn't cause problems on unmount, however.  The patch also causes the kernel to complain when gfs2_logd has been stuck for 120s.  But I just realized a design issue with both my patch, and the original version, unless I'm misunderstanding things.

The way freezing causes gfs2_log_flush to pause in my patch is that gfs2_log_flush will try to acquire the sd_freeze_gl in the shared state again, when it gets stopped for a freeze.  When freezer unfreezes the filesystem, it drops it's exclusive lock on the sd_freeze_gl.  The problem is that 1: you can only unfreeze the filesystem from the node that froze it. 2: If that freezing node crashes, the filesystem automatically unfreezes. I believe that since the old version also relied on a glock to to the blocking, the same issue should exist there.  It seems like this is something we should try to fix.

Comment 2 Ben Marzinski 2013-11-16 03:16:45 UTC
Created attachment 824779 [details]
Better version of fsfreeze patch

This patch works for most of the simple tests I've been doing on it.  I haven't tried doing things like recovery testing, and I'm not sure all of the issues that were only occasionally reproducing are fixed, but it's definitely worth having another set of eyes on.

Comment 3 Steve Whitehouse 2013-11-18 10:18:52 UTC
That looks good to me. Just a few minor comments:

 - The NORMAL_FLUSH etc, should probably be an enum type and using ">" with them is a bit confusing... better just to check for the cases individually I think to make the code clearer

 - I'm sure there will be a few corner cases that we need to consider. For example what will happen if we have a frozen fs and then a new node mounts the fs? Will it join frozen too? And like you say it will need some recovery testing.

However overall it looks like it achieves the main objective of keeping the trans/freeze lock out of the main transaction code path - so I hope that we'll see some speed improvements with this too - particularly with multi-threaded loads doing small writes to fast disks/SSD. Also it seems to be relatively clean and easy to follow whats going on too.

Comment 4 Ben Marzinski 2013-12-05 22:20:09 UTC
Created attachment 833334 [details]
So close but yet so far

This patch makes gfs2_log_flush call freeze_super() and thaw_super() on the nodes that did not initiate the freeze, so that __sb_start_write() can stop writes to the frozen filesystem before they exclusively lock the root directory.

This fixes a problem where:

1. Node A freezes the filesystem.
2. Node B attempts to create a file in the root directory, exclusively locking
   the directory, but unable to start the transaction because the filesystem
   is frozen.
3. Node A tries to unfreeze the filesytem, but can't get a shared lock on the
   root directory, which is needed to open it so that it can issue the ioctl.


However, there is a more complicated issue where:

1. Node A creates a file in the root directory, leaving a cached exclusive
   glock for that directory on the node.
2. Node A freezes the filesystem
3. Node B attempts to create a file in the root directory.
   a. In order acquire a shared lock on the root directory, Node A must drop
      its cached exclusive glock. The glock is dirty, which forces a log flush
      in inode_go_sync. This log flush can't happen, since another log flush
      is currently stalled because of the freeze.
4. Node A tries to unfreeze the filesystem, but it can't get a shared lock on
   the root directory because that glock is currently being demoted.

As far as I can tell, once the freezing log flush has gotten everything stable on disk, there's no way for the glocks to get redirtied.  The only function that sets GLF_DIRTY on a glock outside of a transaction is gfs2_page_mkwrite(), and it calls sb_start_pagefault() before this, which will keep it from running during a freeze.  This means that during a freeze, I believe we should be able to clear the GLF_DIRTY flags from all glocks, which should solve this issue. Steve, do you have any idea if this is a reasonable solution.

There's another possible issue. If any code paths call gfs2_log_flush() between __sb_start_write() and __sb_end_write() calls they will cause freezing to hang the node.  The freezing gfs2_log_flush's call to freeze_super() will block in sb_wait_write() and the code path that called __sb_start_write() will block trying to call gfs2_log_flush(). I'll see If I can find any, but they
might be tricky to spot.

Comment 5 Ben Marzinski 2013-12-06 01:39:17 UTC
I'm not at all sure how I want to go forward anymore.

Instead of solving the first issue that way I did, by making all nodes call freeze_super(), and dealing with the ordering issues that come along with that, I could just have the freezing node grab and hold a shared lock on the root directory before grabbing the freeze glock in exclusive. On unfreeze, it would drop the shared lock on the root directory after dropping the freeze glock.  This would make sure that it always had access to the root directory.

Another idea is to not block in gfs2_log_flush at all.  Instead, when a node grabbed the freeze glock, it would cause all the other nodes to call freeze_super(), and do a shutdown log_flush which would complete without blocking.  The nodes would rely on freeze_super to make sure that no activity happened on the filesystem.  This is just like the trans glock situation, except that we're using the locking already present in the vfs layer. In this setup, the freezing node wouldn't have to hold on to the freeze glock in exclusive the whole time.  Once the nodes were all frozen, everyone would go back to holding the freeze glock in the shared state.  There would probably need to be a worker thread that got kicked off the reacquire the freeze glock in the shared state after the nodes had frozen. This would allow the freezing node to crash without unfreezing the cluster, and would mean that any node could unfreeze the filesystem if that's want we want.  On unfreeze, all the nodes would call thaw_super().  One issue is that the command for freezing and thawing would be the same, grabbing the freeze glock in exclusive.  It might be necessary to have a freeze glock and a thaw glock.  Also, since the freezing node wouldn't be holding the lock exclusively anymore, we would need to do some extra work to make sure that new nodes joining the cluster were frozen.

Comment 6 Ben Marzinski 2014-01-07 20:19:22 UTC
Created attachment 846826 [details]
yet another not quite working fsfreeze patch

This patch has two know issues.

1. When the initiating node unfreezes, it first drops the EX lock on the freeze glock, then reacquires a SH lock. In this time, it is possible for someone to start an fsfreeze on another node in the cluster.  In this case the original initator wouldn't able to grab the SH lock and complete the unfreeze.  However the other node wouldn't be able to complete the freeze, since the original initiator is still stuck in thaw_super(), holding the sb->s_umount lock.  Moving the code to reaquire the SH lock outside of thaw_super won't work either, because then the other node will be able to successfully return from the freeze_super call, However the original initiator won't now be frozen, since it never had a SH lock on the freeze glock to get demoted and call the
freeze code.

However, in this case, the other node can't start the freeze to even request the EX lock on the freeze glock until after it has acquired the glock in the shared state.  As long as the DLM can guarantee that if the original initiator demotes the freeze glock from EX to SH, it will be able to acquire that SH lock before another node can acquire the SH, and then request and acquire the EX lock, this issue shouldn't happen. I'm not sure that this guarantee exists.

2. If two nodes try to freeze the filesystem at the same time, they will both first call freeze_super() and write lock sb->s_umount. Then they will race to
acquire the EX lock on the freeze glock.  Unfortunately, the one that wins won't be able to complete the fsfreeze, since the other node will be stuck in 
freeze_super() with the sb->s_umount lock held.

One way to solve this would be a hook in the vfs layer to allow gfs2 to grab a cluster lock before sb->s_umount gets locked. This would allow for consistent lock ordering across the cluster, so that the freeze glock is always grabbed before the sb->s_umount lock, if both are necessary. I haven't looked yet to see if there are other locking problems with going this route, but there quite possibly are.

Comment 7 Ben Marzinski 2014-04-21 19:55:54 UTC
Created attachment 888221 [details]
Working version of the patch

This is a version of the patch which I haven't run into any issues with. It is very similar to my original version, which locked in gfs2_log_flush() to stop all the freezing nodes.  The biggest difference is that in version, the freezing node also grabs a shared lock on the root directory of the filesystem.  The unfreezing code will call gfs2_permission() and gfs2_getattr(). If these two functions notice that they are being called on the filesystem root directory, while the filesystem is frozen, instead of grabbing a shared lock themselves, they will just piggyback on the shared lock grabbed by the freeze code.  The
freeze code makes sure that it won't release the shared lock while another function is relying on it.

The problem that this is trying to solve is that if another node tries to grab an exclusive lock on the filesystem root directory, the unfreeze code will not
be able to grab the necessary shared lock.

The real goal of this patch is to remove the transaction lock while making the freeze code work correctly.  While there are better ways to implement freezing if we change the VFS code, the code to remove the transaction lock is still the same. However, I don't want to make this any hackier than it needs to be, so any suggestions on improvements are certainly welcome.

Comment 8 Ben Marzinski 2014-04-23 15:22:21 UTC
Created attachment 888975 [details]
updated version to fix a small bug.

This bug is the same as the previous one, except that it fixes a typo bug.

Comment 9 Ben Marzinski 2014-04-30 04:05:07 UTC
Created attachment 891000 [details]
RHEL7 version of the patch

This is the same as the last patch, but ported to the RHEL7 kernel.

Comment 10 Steve Whitehouse 2014-05-23 09:18:00 UTC
Moving to POST since this is now pending merge to Linus' tree.


Note You need to log in before you can comment on or make changes to this bug.