229081 – GFS2: umount locks up when attempting to umount a withdrawn filesystem

Bug 229081 - GFS2: umount locks up when attempting to umount a withdrawn filesystem

Summary: GFS2: umount locks up when attempting to umount a withdrawn filesystem

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Ben Marzinski
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	496884 (view as bug list)
Depends On:
Blocks:	533192
TreeView+	depends on / blocked

Reported:	2007-02-16 20:35 UTC by Josef Bacik
Modified:	2018-09-12 15:44 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-09-12 15:44:58 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
logs with sysrq (651.68 KB, text/plain) 2007-02-16 20:35 UTC, Josef Bacik	no flags	Details
View All

Description Josef Bacik 2007-02-16 20:35:27 UTC

Description of problem:
When trying to umount a withdrawn filesystem umount will hang.  I am using 
Linus's git tree that was pulled down 2/15/07.

How reproducible:
I've only done it once, but once I gather all this information i will attempt 
to reproduce some more.

Steps to Reproduce:
1.get your filesystem to withdraw
2.try to umount the filesystem
  
Actual results:
umount will hang

Expected results:
umount shouldn't hang

Additional info:
I will provide logs with sysrq information.

Comment 1 Josef Bacik 2007-02-16 20:35:27 UTC

Created attachment 148240 [details]
logs with sysrq

Comment 2 David Teigland 2007-02-16 21:34:27 UTC

I don't think we can allow gfs to do a withdraw in the context of a dlm callback.

Comment 3 Josef Bacik 2007-02-18 21:37:43 UTC

Well this is where we withdrew

void gfs2_meta_inval(struct gfs2_glock *gl)
{
        struct gfs2_sbd *sdp = gl->gl_sbd;
        struct inode *aspace = gl->gl_aspace;
        struct address_space *mapping = gl->gl_aspace->i_mapping;

        gfs2_assert_withdraw(sdp, !atomic_read(&gl->gl_ail_count));

        atomic_inc(&aspace->i_writecount);
        truncate_inode_pages(mapping, 0);
        atomic_dec(&aspace->i_writecount);

        gfs2_assert_withdraw(sdp, !mapping->nrpages);
}

at the gfs2_assert_withdraw(sdp, !atomic_read(&gl->gl_ail_count));.  Instead of
withdrawing, wouldn't it be best to just return?  I cannot find any comments at
all in the code explaining what gl_ail_count is or what it refers to.  Could
somebody explain this to me so I can try and figure out why this problem may
have happened to begin with?

Comment 4 Steve Whitehouse 2007-02-19 12:50:17 UTC

If gl_ail_count is non-zero, then this means that this lock is still part of a
transaction. This should be impossible since a journal flush should have been
done before trying to demote this lock. The journal flush is conditional upon
this glock being part of a transaction, so it looks to me like a race condition.

Comment 5 Steve Whitehouse 2007-02-19 13:54:29 UTC

It could potentially be that GLF_DIRTY is getting set or reset in the wrong
place since the code in meta_go_sync() which flushes the blocks relating to the
glock in question appears to be conditional upon that.

To be honest I'm rather skeptical that that particular code is correct anyway
since the functions which are called are no-ops in the case where there are no
buffers etc to be flushed, so it would probably be perfectly ok to just remove
that conditional and see if that improves the situation.

Comment 6 Josef Bacik 2007-02-19 19:50:29 UTC

whats also kind of worrying is that only lock_dlm1 is supposed to be allowed 
to do blocking gfs cb's, but according to the traceback lock_dlm2 is in a 
blocking cb, so there is something not quite right about all of this.

Comment 7 Josef Bacik 2007-02-19 20:04:15 UTC

hmm ok disregard comment 6 i'm a little slow, it looks like we got here via 
lock_dlm2 doing this

if (drop)
       ls->fscb(ls->sdp, LM_CB_DROPLOCKS, NULL);

where we do

gfs2_quota_scan
  gfs2_lvb_unhold
    gfs2_glmutex_unlock
      run_queue
        rq_demote
          gfs2_glock_drop_th
            drop_bh

blah blah all the way down to the withdraw.

Comment 9 Kiersten (Kerri) Anderson 2007-04-23 17:43:04 UTC

Fixing Product Name.  Cluster Suite components were integrated into Enterprise
Linux for verion 5.0.

Comment 10 Steve Whitehouse 2007-04-27 12:16:09 UTC

Adding GFS2 into the bug summary so it appears on my list.

Comment 11 Kiersten (Kerri) Anderson 2007-06-05 15:29:29 UTC

Probably won't be resolved for 5.1.  Need to consider it for release notes.

Comment 12 Steve Whitehouse 2007-06-21 13:24:55 UTC

The bug encountered in comment #1 may well be the bug which Ben fixed in this patch:

http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=4ec0ffc04e621d115a68386500adde5cdea9fb34

As Dave says in comment #2 we need to ensure that the lock_dlm threads do not
get stuck if we are to do a graceful withdraw. If they do then we will not be
able to unlock and hence unable to umount, so there in nothing that could
reasonably be done in this case to improve things, other than not calling
withdraw from a function which my potentially be used by lock_dlm during a callback.

Regarding comment #7, the default is to never do droplocks callbacks in the
current code, so its very unlikely someone will turn this feature on.

Comment 13 Steve Whitehouse 2007-06-21 13:55:39 UTC

I tested withdraw from process context by changing the gfs2_assert_withdraw() in
the symlink operation to always trigger and then creating a symlink. The symlink
operation hung after the withdraw occurred, but a ctrl-c allowed me to exit that
process and upon moving to a directory outside of the gfs2 filesystem, I was
able to umount the filesystem cleanly.

So it looks like the only problem is the one mentioned in comment #2 where we
need to audit the lock_dlm thread paths to ensure that we don't call
gfs2_assert_withdraw() or (as an alternate solution) we do call it, but accept
that it will not be possible to umount the filesystem (i.e. leave thins as they
are).

I have no real strong feelings one way or the other, but we should bear in mind
that there is a plan still to update the glock state machine in the future as
per bz #235697, #236404, #236088 and #221152 and this is likely to solve the
problem at that time.

So I guess I'm canvassing for opinion at this stage as to whether we need to do
anything here or not...

Comment 14 Kiersten (Kerri) Anderson 2007-07-17 15:31:55 UTC

Moving this one to the 5.2 proposed list.  Not convince the remaining scenarios
are able to be addressed.

Comment 16 Kiersten (Kerri) Anderson 2007-11-02 21:13:15 UTC

Moving to 5.3.

Comment 17 Steve Whitehouse 2008-07-09 14:50:44 UTC

Moving to 5.4, there is nothing we can do about this in the short term.

Comment 21 Steve Whitehouse 2010-01-27 11:04:24 UTC

*** Bug 496884 has been marked as a duplicate of this bug. ***

Comment 22 Robert Peterson 2016-04-15 18:02:47 UTC

Just doing some routine cleanup of GFS2 bug records.

Ben Marzinski posted a patch to cluster-devel for this problem
on 22 March 2016. I'm not sure what we want to do with this
bz, or where we want to propagate the patch, but I'm reassigning
this bug to Ben.

Note You need to log in before you can comment on or make changes to this bug.