Bug 229081 - GFS2: umount locks up when attempting to umount a withdrawn filesystem
GFS2: umount locks up when attempting to umount a withdrawn filesystem
Status: NEW
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
rawhide
All Linux
low Severity medium
: ---
: ---
Assigned To: Ben Marzinski
Cluster QE
: FutureFeature
: 496884 (view as bug list)
Depends On:
Blocks: 533192
  Show dependency treegraph
 
Reported: 2007-02-16 15:35 EST by Josef Bacik
Modified: 2016-04-15 14:02 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logs with sysrq (651.68 KB, text/plain)
2007-02-16 15:35 EST, Josef Bacik
no flags Details

  None (edit)
Description Josef Bacik 2007-02-16 15:35:27 EST
Description of problem:
When trying to umount a withdrawn filesystem umount will hang.  I am using 
Linus's git tree that was pulled down 2/15/07.

How reproducible:
I've only done it once, but once I gather all this information i will attempt 
to reproduce some more.

Steps to Reproduce:
1.get your filesystem to withdraw
2.try to umount the filesystem
  
Actual results:
umount will hang

Expected results:
umount shouldn't hang

Additional info:
I will provide logs with sysrq information.
Comment 1 Josef Bacik 2007-02-16 15:35:27 EST
Created attachment 148240 [details]
logs with sysrq
Comment 2 David Teigland 2007-02-16 16:34:27 EST
I don't think we can allow gfs to do a withdraw in the context of a dlm callback.

Comment 3 Josef Bacik 2007-02-18 16:37:43 EST
Well this is where we withdrew

void gfs2_meta_inval(struct gfs2_glock *gl)
{
        struct gfs2_sbd *sdp = gl->gl_sbd;
        struct inode *aspace = gl->gl_aspace;
        struct address_space *mapping = gl->gl_aspace->i_mapping;

        gfs2_assert_withdraw(sdp, !atomic_read(&gl->gl_ail_count));

        atomic_inc(&aspace->i_writecount);
        truncate_inode_pages(mapping, 0);
        atomic_dec(&aspace->i_writecount);

        gfs2_assert_withdraw(sdp, !mapping->nrpages);
}

at the gfs2_assert_withdraw(sdp, !atomic_read(&gl->gl_ail_count));.  Instead of
withdrawing, wouldn't it be best to just return?  I cannot find any comments at
all in the code explaining what gl_ail_count is or what it refers to.  Could
somebody explain this to me so I can try and figure out why this problem may
have happened to begin with?

Comment 4 Steve Whitehouse 2007-02-19 07:50:17 EST
If gl_ail_count is non-zero, then this means that this lock is still part of a
transaction. This should be impossible since a journal flush should have been
done before trying to demote this lock. The journal flush is conditional upon
this glock being part of a transaction, so it looks to me like a race condition.
Comment 5 Steve Whitehouse 2007-02-19 08:54:29 EST
It could potentially be that GLF_DIRTY is getting set or reset in the wrong
place since the code in meta_go_sync() which flushes the blocks relating to the
glock in question appears to be conditional upon that.

To be honest I'm rather skeptical that that particular code is correct anyway
since the functions which are called are no-ops in the case where there are no
buffers etc to be flushed, so it would probably be perfectly ok to just remove
that conditional and see if that improves the situation.
Comment 6 Josef Bacik 2007-02-19 14:50:29 EST
whats also kind of worrying is that only lock_dlm1 is supposed to be allowed 
to do blocking gfs cb's, but according to the traceback lock_dlm2 is in a 
blocking cb, so there is something not quite right about all of this.
Comment 7 Josef Bacik 2007-02-19 15:04:15 EST
hmm ok disregard comment 6 i'm a little slow, it looks like we got here via 
lock_dlm2 doing this

if (drop)
       ls->fscb(ls->sdp, LM_CB_DROPLOCKS, NULL);

where we do

gfs2_quota_scan
  gfs2_lvb_unhold
    gfs2_glmutex_unlock
      run_queue
        rq_demote
          gfs2_glock_drop_th
            drop_bh

blah blah all the way down to the withdraw.
Comment 9 Kiersten (Kerri) Anderson 2007-04-23 13:43:04 EDT
Fixing Product Name.  Cluster Suite components were integrated into Enterprise
Linux for verion 5.0.
Comment 10 Steve Whitehouse 2007-04-27 08:16:09 EDT
Adding GFS2 into the bug summary so it appears on my list.
Comment 11 Kiersten (Kerri) Anderson 2007-06-05 11:29:29 EDT
Probably won't be resolved for 5.1.  Need to consider it for release notes.
Comment 12 Steve Whitehouse 2007-06-21 09:24:55 EDT
The bug encountered in comment #1 may well be the bug which Ben fixed in this patch:

http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=4ec0ffc04e621d115a68386500adde5cdea9fb34

As Dave says in comment #2 we need to ensure that the lock_dlm threads do not
get stuck if we are to do a graceful withdraw. If they do then we will not be
able to unlock and hence unable to umount, so there in nothing that could
reasonably be done in this case to improve things, other than not calling
withdraw from a function which my potentially be used by lock_dlm during a callback.

Regarding comment #7, the default is to never do droplocks callbacks in the
current code, so its very unlikely someone will turn this feature on.
Comment 13 Steve Whitehouse 2007-06-21 09:55:39 EDT
I tested withdraw from process context by changing the gfs2_assert_withdraw() in
the symlink operation to always trigger and then creating a symlink. The symlink
operation hung after the withdraw occurred, but a ctrl-c allowed me to exit that
process and upon moving to a directory outside of the gfs2 filesystem, I was
able to umount the filesystem cleanly.

So it looks like the only problem is the one mentioned in comment #2 where we
need to audit the lock_dlm thread paths to ensure that we don't call
gfs2_assert_withdraw() or (as an alternate solution) we do call it, but accept
that it will not be possible to umount the filesystem (i.e. leave thins as they
are).

I have no real strong feelings one way or the other, but we should bear in mind
that there is a plan still to update the glock state machine in the future as
per bz #235697, #236404, #236088 and #221152 and this is likely to solve the
problem at that time.

So I guess I'm canvassing for opinion at this stage as to whether we need to do
anything here or not...
Comment 14 Kiersten (Kerri) Anderson 2007-07-17 11:31:55 EDT
Moving this one to the 5.2 proposed list.  Not convince the remaining scenarios
are able to be addressed.
Comment 16 Kiersten (Kerri) Anderson 2007-11-02 17:13:15 EDT
Moving to 5.3.
Comment 17 Steve Whitehouse 2008-07-09 10:50:44 EDT
Moving to 5.4, there is nothing we can do about this in the short term.
Comment 21 Steve Whitehouse 2010-01-27 06:04:24 EST
*** Bug 496884 has been marked as a duplicate of this bug. ***
Comment 22 Robert Peterson 2016-04-15 14:02:47 EDT
Just doing some routine cleanup of GFS2 bug records.

Ben Marzinski posted a patch to cluster-devel for this problem
on 22 March 2016. I'm not sure what we want to do with this
bz, or where we want to propagate the patch, but I'm reassigning
this bug to Ben.

Note You need to log in before you can comment on or make changes to this bug.