Red Hat Bugzilla – Bug 229081
GFS2: umount locks up when attempting to umount a withdrawn filesystem
Last modified: 2016-04-15 14:02:47 EDT
Description of problem:
When trying to umount a withdrawn filesystem umount will hang. I am using
Linus's git tree that was pulled down 2/15/07.
I've only done it once, but once I gather all this information i will attempt
to reproduce some more.
Steps to Reproduce:
1.get your filesystem to withdraw
2.try to umount the filesystem
umount will hang
umount shouldn't hang
I will provide logs with sysrq information.
Created attachment 148240 [details]
logs with sysrq
I don't think we can allow gfs to do a withdraw in the context of a dlm callback.
Well this is where we withdrew
void gfs2_meta_inval(struct gfs2_glock *gl)
struct gfs2_sbd *sdp = gl->gl_sbd;
struct inode *aspace = gl->gl_aspace;
struct address_space *mapping = gl->gl_aspace->i_mapping;
at the gfs2_assert_withdraw(sdp, !atomic_read(&gl->gl_ail_count));. Instead of
withdrawing, wouldn't it be best to just return? I cannot find any comments at
all in the code explaining what gl_ail_count is or what it refers to. Could
somebody explain this to me so I can try and figure out why this problem may
have happened to begin with?
If gl_ail_count is non-zero, then this means that this lock is still part of a
transaction. This should be impossible since a journal flush should have been
done before trying to demote this lock. The journal flush is conditional upon
this glock being part of a transaction, so it looks to me like a race condition.
It could potentially be that GLF_DIRTY is getting set or reset in the wrong
place since the code in meta_go_sync() which flushes the blocks relating to the
glock in question appears to be conditional upon that.
To be honest I'm rather skeptical that that particular code is correct anyway
since the functions which are called are no-ops in the case where there are no
buffers etc to be flushed, so it would probably be perfectly ok to just remove
that conditional and see if that improves the situation.
whats also kind of worrying is that only lock_dlm1 is supposed to be allowed
to do blocking gfs cb's, but according to the traceback lock_dlm2 is in a
blocking cb, so there is something not quite right about all of this.
hmm ok disregard comment 6 i'm a little slow, it looks like we got here via
lock_dlm2 doing this
ls->fscb(ls->sdp, LM_CB_DROPLOCKS, NULL);
where we do
blah blah all the way down to the withdraw.
Fixing Product Name. Cluster Suite components were integrated into Enterprise
Linux for verion 5.0.
Adding GFS2 into the bug summary so it appears on my list.
Probably won't be resolved for 5.1. Need to consider it for release notes.
The bug encountered in comment #1 may well be the bug which Ben fixed in this patch:
As Dave says in comment #2 we need to ensure that the lock_dlm threads do not
get stuck if we are to do a graceful withdraw. If they do then we will not be
able to unlock and hence unable to umount, so there in nothing that could
reasonably be done in this case to improve things, other than not calling
withdraw from a function which my potentially be used by lock_dlm during a callback.
Regarding comment #7, the default is to never do droplocks callbacks in the
current code, so its very unlikely someone will turn this feature on.
I tested withdraw from process context by changing the gfs2_assert_withdraw() in
the symlink operation to always trigger and then creating a symlink. The symlink
operation hung after the withdraw occurred, but a ctrl-c allowed me to exit that
process and upon moving to a directory outside of the gfs2 filesystem, I was
able to umount the filesystem cleanly.
So it looks like the only problem is the one mentioned in comment #2 where we
need to audit the lock_dlm thread paths to ensure that we don't call
gfs2_assert_withdraw() or (as an alternate solution) we do call it, but accept
that it will not be possible to umount the filesystem (i.e. leave thins as they
I have no real strong feelings one way or the other, but we should bear in mind
that there is a plan still to update the glock state machine in the future as
per bz #235697, #236404, #236088 and #221152 and this is likely to solve the
problem at that time.
So I guess I'm canvassing for opinion at this stage as to whether we need to do
anything here or not...
Moving this one to the 5.2 proposed list. Not convince the remaining scenarios
are able to be addressed.
Moving to 5.3.
Moving to 5.4, there is nothing we can do about this in the short term.
*** Bug 496884 has been marked as a duplicate of this bug. ***
Just doing some routine cleanup of GFS2 bug records.
Ben Marzinski posted a patch to cluster-devel for this problem
on 22 March 2016. I'm not sure what we want to do with this
bz, or where we want to propagate the patch, but I'm reassigning
this bug to Ben.