Description of problem: When trying to umount a withdrawn filesystem umount will hang. I am using Linus's git tree that was pulled down 2/15/07. How reproducible: I've only done it once, but once I gather all this information i will attempt to reproduce some more. Steps to Reproduce: 1.get your filesystem to withdraw 2.try to umount the filesystem Actual results: umount will hang Expected results: umount shouldn't hang Additional info: I will provide logs with sysrq information.
Created attachment 148240 [details] logs with sysrq
I don't think we can allow gfs to do a withdraw in the context of a dlm callback.
Well this is where we withdrew void gfs2_meta_inval(struct gfs2_glock *gl) { struct gfs2_sbd *sdp = gl->gl_sbd; struct inode *aspace = gl->gl_aspace; struct address_space *mapping = gl->gl_aspace->i_mapping; gfs2_assert_withdraw(sdp, !atomic_read(&gl->gl_ail_count)); atomic_inc(&aspace->i_writecount); truncate_inode_pages(mapping, 0); atomic_dec(&aspace->i_writecount); gfs2_assert_withdraw(sdp, !mapping->nrpages); } at the gfs2_assert_withdraw(sdp, !atomic_read(&gl->gl_ail_count));. Instead of withdrawing, wouldn't it be best to just return? I cannot find any comments at all in the code explaining what gl_ail_count is or what it refers to. Could somebody explain this to me so I can try and figure out why this problem may have happened to begin with?
If gl_ail_count is non-zero, then this means that this lock is still part of a transaction. This should be impossible since a journal flush should have been done before trying to demote this lock. The journal flush is conditional upon this glock being part of a transaction, so it looks to me like a race condition.
It could potentially be that GLF_DIRTY is getting set or reset in the wrong place since the code in meta_go_sync() which flushes the blocks relating to the glock in question appears to be conditional upon that. To be honest I'm rather skeptical that that particular code is correct anyway since the functions which are called are no-ops in the case where there are no buffers etc to be flushed, so it would probably be perfectly ok to just remove that conditional and see if that improves the situation.
whats also kind of worrying is that only lock_dlm1 is supposed to be allowed to do blocking gfs cb's, but according to the traceback lock_dlm2 is in a blocking cb, so there is something not quite right about all of this.
hmm ok disregard comment 6 i'm a little slow, it looks like we got here via lock_dlm2 doing this if (drop) ls->fscb(ls->sdp, LM_CB_DROPLOCKS, NULL); where we do gfs2_quota_scan gfs2_lvb_unhold gfs2_glmutex_unlock run_queue rq_demote gfs2_glock_drop_th drop_bh blah blah all the way down to the withdraw.
Fixing Product Name. Cluster Suite components were integrated into Enterprise Linux for verion 5.0.
Adding GFS2 into the bug summary so it appears on my list.
Probably won't be resolved for 5.1. Need to consider it for release notes.
The bug encountered in comment #1 may well be the bug which Ben fixed in this patch: http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=4ec0ffc04e621d115a68386500adde5cdea9fb34 As Dave says in comment #2 we need to ensure that the lock_dlm threads do not get stuck if we are to do a graceful withdraw. If they do then we will not be able to unlock and hence unable to umount, so there in nothing that could reasonably be done in this case to improve things, other than not calling withdraw from a function which my potentially be used by lock_dlm during a callback. Regarding comment #7, the default is to never do droplocks callbacks in the current code, so its very unlikely someone will turn this feature on.
I tested withdraw from process context by changing the gfs2_assert_withdraw() in the symlink operation to always trigger and then creating a symlink. The symlink operation hung after the withdraw occurred, but a ctrl-c allowed me to exit that process and upon moving to a directory outside of the gfs2 filesystem, I was able to umount the filesystem cleanly. So it looks like the only problem is the one mentioned in comment #2 where we need to audit the lock_dlm thread paths to ensure that we don't call gfs2_assert_withdraw() or (as an alternate solution) we do call it, but accept that it will not be possible to umount the filesystem (i.e. leave thins as they are). I have no real strong feelings one way or the other, but we should bear in mind that there is a plan still to update the glock state machine in the future as per bz #235697, #236404, #236088 and #221152 and this is likely to solve the problem at that time. So I guess I'm canvassing for opinion at this stage as to whether we need to do anything here or not...
Moving this one to the 5.2 proposed list. Not convince the remaining scenarios are able to be addressed.
Moving to 5.3.
Moving to 5.4, there is nothing we can do about this in the short term.
*** Bug 496884 has been marked as a duplicate of this bug. ***
Just doing some routine cleanup of GFS2 bug records. Ben Marzinski posted a patch to cluster-devel for this problem on 22 March 2016. I'm not sure what we want to do with this bz, or where we want to propagate the patch, but I'm reassigning this bug to Ben.