+++ This bug was initially created as a clone of Bug #1027451 +++
Description of problem: GFS2's transaction lock is really used to implement filesystem freezes. The problem is that it is still acquired during every transaction. Transactions cannot occur while the log is being flushed, and a log flush is necessary to freeze the filesystem, to make sure that it is stable on disk. This makes gfs2_log_flush a sensible place to put move the code to
freeze the filesystem. The goal is to speed up transactions, and simplify the code.
--- Additional comment from Ben Marzinski on 2013-11-06 18:49:24 EST ---
This is a version of the freeze lock moving patch that still needs some work. There is an issue that I don't understand with free block accounting when you freeze that triggers a warning. This get's cleared up so that it doesn't cause problems on unmount, however. The patch also causes the kernel to complain when gfs2_logd has been stuck for 120s. But I just realized a design issue with both my patch, and the original version, unless I'm misunderstanding things.
The way freezing causes gfs2_log_flush to pause in my patch is that gfs2_log_flush will try to acquire the sd_freeze_gl in the shared state again, when it gets stopped for a freeze. When freezer unfreezes the filesystem, it drops it's exclusive lock on the sd_freeze_gl. The problem is that 1: you can only unfreeze the filesystem from the node that froze it. 2: If that freezing node crashes, the filesystem automatically unfreezes. I believe that since the old version also relied on a glock to to the blocking, the same issue should exist there. It seems like this is something we should try to fix.
--- Additional comment from Ben Marzinski on 2013-11-15 22:16:45 EST ---
This patch works for most of the simple tests I've been doing on it. I haven't tried doing things like recovery testing, and I'm not sure all of the issues that were only occasionally reproducing are fixed, but it's definitely worth having another set of eyes on.
--- Additional comment from Steve Whitehouse on 2013-11-18 05:18:52 EST ---
That looks good to me. Just a few minor comments:
- The NORMAL_FLUSH etc, should probably be an enum type and using ">" with them is a bit confusing... better just to check for the cases individually I think to make the code clearer
- I'm sure there will be a few corner cases that we need to consider. For example what will happen if we have a frozen fs and then a new node mounts the fs? Will it join frozen too? And like you say it will need some recovery testing.
However overall it looks like it achieves the main objective of keeping the trans/freeze lock out of the main transaction code path - so I hope that we'll see some speed improvements with this too - particularly with multi-threaded loads doing small writes to fast disks/SSD. Also it seems to be relatively clean and easy to follow whats going on too.
--- Additional comment from Ben Marzinski on 2013-12-05 17:20:09 EST ---
This patch makes gfs2_log_flush call freeze_super() and thaw_super() on the nodes that did not initiate the freeze, so that __sb_start_write() can stop writes to the frozen filesystem before they exclusively lock the root directory.
This fixes a problem where:
1. Node A freezes the filesystem.
2. Node B attempts to create a file in the root directory, exclusively locking
the directory, but unable to start the transaction because the filesystem
3. Node A tries to unfreeze the filesytem, but can't get a shared lock on the
root directory, which is needed to open it so that it can issue the ioctl.
However, there is a more complicated issue where:
1. Node A creates a file in the root directory, leaving a cached exclusive
glock for that directory on the node.
2. Node A freezes the filesystem
3. Node B attempts to create a file in the root directory.
a. In order acquire a shared lock on the root directory, Node A must drop
its cached exclusive glock. The glock is dirty, which forces a log flush
in inode_go_sync. This log flush can't happen, since another log flush
is currently stalled because of the freeze.
4. Node A tries to unfreeze the filesystem, but it can't get a shared lock on
the root directory because that glock is currently being demoted.
As far as I can tell, once the freezing log flush has gotten everything stable on disk, there's no way for the glocks to get redirtied. The only function that sets GLF_DIRTY on a glock outside of a transaction is gfs2_page_mkwrite(), and it calls sb_start_pagefault() before this, which will keep it from running during a freeze. This means that during a freeze, I believe we should be able to clear the GLF_DIRTY flags from all glocks, which should solve this issue. Steve, do you have any idea if this is a reasonable solution.
There's another possible issue. If any code paths call gfs2_log_flush() between __sb_start_write() and __sb_end_write() calls they will cause freezing to hang the node. The freezing gfs2_log_flush's call to freeze_super() will block in sb_wait_write() and the code path that called __sb_start_write() will block trying to call gfs2_log_flush(). I'll see If I can find any, but they
might be tricky to spot.
--- Additional comment from Ben Marzinski on 2013-12-05 20:39:17 EST ---
I'm not at all sure how I want to go forward anymore.
Instead of solving the first issue that way I did, by making all nodes call freeze_super(), and dealing with the ordering issues that come along with that, I could just have the freezing node grab and hold a shared lock on the root directory before grabbing the freeze glock in exclusive. On unfreeze, it would drop the shared lock on the root directory after dropping the freeze glock. This would make sure that it always had access to the root directory.
Another idea is to not block in gfs2_log_flush at all. Instead, when a node grabbed the freeze glock, it would cause all the other nodes to call freeze_super(), and do a shutdown log_flush which would complete without blocking. The nodes would rely on freeze_super to make sure that no activity happened on the filesystem. This is just like the trans glock situation, except that we're using the locking already present in the vfs layer. In this setup, the freezing node wouldn't have to hold on to the freeze glock in exclusive the whole time. Once the nodes were all frozen, everyone would go back to holding the freeze glock in the shared state. There would probably need to be a worker thread that got kicked off the reacquire the freeze glock in the shared state after the nodes had frozen. This would allow the freezing node to crash without unfreezing the cluster, and would mean that any node could unfreeze the filesystem if that's want we want. On unfreeze, all the nodes would call thaw_super(). One issue is that the command for freezing and thawing would be the same, grabbing the freeze glock in exclusive. It might be necessary to have a freeze glock and a thaw glock. Also, since the freezing node wouldn't be holding the lock exclusively anymore, we would need to do some extra work to make sure that new nodes joining the cluster were frozen.
--- Additional comment from Ben Marzinski on 2014-01-07 15:19:22 EST ---
This patch has two know issues.
1. When the initiating node unfreezes, it first drops the EX lock on the freeze glock, then reacquires a SH lock. In this time, it is possible for someone to start an fsfreeze on another node in the cluster. In this case the original initator wouldn't able to grab the SH lock and complete the unfreeze. However the other node wouldn't be able to complete the freeze, since the original initiator is still stuck in thaw_super(), holding the sb->s_umount lock. Moving the code to reaquire the SH lock outside of thaw_super won't work either, because then the other node will be able to successfully return from the freeze_super call, However the original initiator won't now be frozen, since it never had a SH lock on the freeze glock to get demoted and call the
However, in this case, the other node can't start the freeze to even request the EX lock on the freeze glock until after it has acquired the glock in the shared state. As long as the DLM can guarantee that if the original initiator demotes the freeze glock from EX to SH, it will be able to acquire that SH lock before another node can acquire the SH, and then request and acquire the EX lock, this issue shouldn't happen. I'm not sure that this guarantee exists.
2. If two nodes try to freeze the filesystem at the same time, they will both first call freeze_super() and write lock sb->s_umount. Then they will race to
acquire the EX lock on the freeze glock. Unfortunately, the one that wins won't be able to complete the fsfreeze, since the other node will be stuck in
freeze_super() with the sb->s_umount lock held.
One way to solve this would be a hook in the vfs layer to allow gfs2 to grab a cluster lock before sb->s_umount gets locked. This would allow for consistent lock ordering across the cluster, so that the freeze glock is always grabbed before the sb->s_umount lock, if both are necessary. I haven't looked yet to see if there are other locking problems with going this route, but there quite possibly are.
--- Additional comment from Ben Marzinski on 2014-04-21 15:55:54 EDT ---
This is a version of the patch which I haven't run into any issues with. It is very similar to my original version, which locked in gfs2_log_flush() to stop all the freezing nodes. The biggest difference is that in version, the freezing node also grabs a shared lock on the root directory of the filesystem. The unfreezing code will call gfs2_permission() and gfs2_getattr(). If these two functions notice that they are being called on the filesystem root directory, while the filesystem is frozen, instead of grabbing a shared lock themselves, they will just piggyback on the shared lock grabbed by the freeze code. The
freeze code makes sure that it won't release the shared lock while another function is relying on it.
The problem that this is trying to solve is that if another node tries to grab an exclusive lock on the filesystem root directory, the unfreeze code will not
be able to grab the necessary shared lock.
The real goal of this patch is to remove the transaction lock while making the freeze code work correctly. While there are better ways to implement freezing if we change the VFS code, the code to remove the transaction lock is still the same. However, I don't want to make this any hackier than it needs to be, so any suggestions on improvements are certainly welcome.
--- Additional comment from Ben Marzinski on 2014-04-23 11:22:21 EDT ---
This bug is the same as the previous one, except that it fixes a typo bug.
--- Additional comment from Ben Marzinski on 2014-04-30 00:05:07 EDT ---
This is the same as the last patch, but ported to the RHEL7 kernel.
--- Additional comment from Steve Whitehouse on 2014-05-23 05:18:00 EDT ---
Moving to POST since this is now pending merge to Linus' tree.
--- Additional comment from Steve Whitehouse on 2014-06-09 13:46:31 EDT ---
Now merged upstream:
The commit referened in the last comment went into 3.16-rc1. Rawhide is already onto the 3.17 merge window kernels.
Do you need this rawhide bug open for some reason?
Hmm, not sure.
Ben, did you clone the original bug to track the vfs hook work? If so then we should assign this to you, to make it clearer.
Yeah. This is a bug for a new version of the freeze code. I reassigned it to myself.
Created attachment 927253 [details]
Patches to implement the new gfs2 freeze method
The attached tarball contains a two sets of patches.
The 1130325_patches contains a patch for the new vfs hook, new_hook.patch, and a patch for the new freeze method that uses this hook, new_freeze.patch.
There seemed to me to be two ways to go about getting the necessary vfs hooks. One method was to add two more hooks that only gfs2 would use to do locking before the call to freeze_super(). The other method was to change the existing freeze/unfreeze hooks so that if they existed, the filesystem was responsible for calling freeze_super/thaw_super. I eventually went with the second method, which requires changing all the filesystems that implement their own freezing code, but didn't add additional hooks just for gfs2.
The gfs2 code is very similar to an earlier attempt I made in bz #1027451, where I found about about the lock ordering issue. The code is significantly simpler than the old freezing code, and the same series of events happens on all of the nodes of the cluster to freeze and unfreeze a filesystem.
The second directory, patches_split, contains two patches that when applied together, recreate new_freeze.patch. reversed_old_freeze.patch reverts my old freeze fix, and freeze_needs_vfs_hook.patch adds the new one from scratch. This makes it easier to read all the logic of the new freeze code, since it shares a chunck of it with my earlier freeze patch.
Created attachment 931677 [details]
new version of patches to implement gfs2 freeze method
This is exactly the same as the old patch, with some fixes to new_hook.patch to make reiserfs work correctly.
Created attachment 935766 [details]
version that fixes gfs2 in new_hook.patch
In the previous version, I accidentally left the gfs2 changes out of new_hook.patch. This didn't effect anything once both patches were applied, but it meant that freezing wouldn't work on gfs2 with just the first patch applied. This version makes new_hook.patch keep freezing working the current way on gfs2.
This version also doesn't have the split-out version of patches, were I first removed the old freeze code, and then added the new freeze code from scratch.
Is this now complete? I have a feeling we could close this bug now.