Bug 447920 - GFS2: deadlock running d_io w/ jdata on lock_nolock
Summary: GFS2: deadlock running d_io w/ jdata on lock_nolock
Keywords:
Status: CLOSED DUPLICATE of bug 432057
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.2
Hardware: i386
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: Steve Whitehouse
QA Contact: GFS Bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-05-22 13:43 UTC by Nate Straz
Modified: 2009-05-28 03:39 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-05-30 15:27:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
glock list from tank-01 (139.46 KB, application/x-bzip2)
2008-05-22 13:43 UTC, Nate Straz
no flags Details
sysrq-t output from tank-01 (18.42 KB, application/x-bzip2)
2008-05-22 13:46 UTC, Nate Straz
no flags Details
Call traces for a hang on my exxon cluster (19.63 KB, application/octet-stream)
2008-05-22 23:06 UTC, Robert Peterson
no flags Details
Glock dump from the latest hang (48.42 KB, text/plain)
2008-05-22 23:08 UTC, Robert Peterson
no flags Details
Differences between latest readv and what we gave aol (47.24 KB, patch)
2008-05-23 13:43 UTC, Robert Peterson
no flags Details | Diff
Bob's 27May2008 patch (1.33 KB, patch)
2008-05-27 18:27 UTC, Robert Peterson
no flags Details | Diff
Upstream attempt at a readpage lock ordering fix (1.75 KB, patch)
2008-05-28 09:12 UTC, Steve Whitehouse
no flags Details | Diff
Latest upstream fix (1.60 KB, patch)
2008-05-29 12:36 UTC, Steve Whitehouse
no flags Details | Diff

Description Nate Straz 2008-05-22 13:43:47 UTC
Description of problem:

While running through single node GFS2 testing, d_io hung while running with
jdata enabled.  The tags running at the time of the hang were laio_lock3 and
rwransyncsmall.

I will attach the sysrq-t output and glock dump.


Version-Release number of selected component (if applicable):
kernel-2.6.18-92.el5
kmod-gfs2-1.92-1.13 -- GFS2 Overlay (built May 21 2008 14:34:22)

How reproducible:
Unknown

Comment 1 Nate Straz 2008-05-22 13:43:47 UTC
Created attachment 306378 [details]
glock list from tank-01

Comment 2 Nate Straz 2008-05-22 13:46:20 UTC
Created attachment 306379 [details]
sysrq-t output from tank-01

Comment 3 Robert Peterson 2008-05-22 14:35:50 UTC
Initial analysis of locked gfs2 processes:

glock_workque D 00003657  2740  8008     19          8009  8006 (L-TLB)
	      Waiting on sd_log_flush_lock semaphore
	      gfs2_log_flush
glock_workque D 00003657  2740  8009     19          8010  8008 (L-TLB)
	      Waiting on sd_log_flush_lock semaphore
	      gfs2_log_flush
gfs2_logd     D 000013FD  2860  8531     19          8532  8530 (L-TLB)
	      Waiting on sd_log_flush_lock semaphore
	      gfs2_ail1_empty->gfs2_log_flush
gfs2_quotad   D 00001401  2788  8532     19         10161  8531 (L-TLB)
	      Waiting on sd_log_flush_lock semaphore
	      gfs2_do_trans_begin->gfs2_log_reserve
pdflush       D 00001402  2532 10161     19                8532 (L-TLB)
	      down_read must be for sd_log_flush_lock
	      gfs2_log_reserve
xdoio         D 000013FB  2672 11277  11274         11278       (NOTLB)
	      gfs2_readpage->gfs2_glock_nq_atime->gfs2_glock_wait->...wait_on_bit
xdoio         D 000013FB  2720 11278  11274         11279 11277 (NOTLB)
	      find_lock_page (page lock)
xdoio         D 000013FB  2720 11280  11274         11282 11279 (NOTLB)
	      gfs2_permission->gfs2_glock_wait
xdoio         D 000013FB  2428 11285  11284                     (NOTLB)
	      gfs2_jdata_writepages->gfs2_log_flush->sd_log_flush_lock
updatedb      D 00003664  2612 13171  13166                     (NOTLB)
	      down_read must be for sd_log_flush_lock
	      gfs2_log_reserve

Now we need to track down who locks what and who holds what, which is
basically digging through the code.


Comment 4 Nate Straz 2008-05-22 17:47:50 UTC
I was able to hit this again w/ jdata and the test cases were laio_lock3 and
mtfile_lock2.  That's twice in a row that laio_lock3 was running during the
hang.  laio_lock3 uses io_submit async I/O and works from the end of the file to
the beginning with five processes handling requests.

Comment 5 Robert Peterson 2008-05-22 23:06:14 UTC
Created attachment 306427 [details]
Call traces for a hang on my exxon cluster

We've got a problem and I don't know if it's only jdata.  I was able to
recreate a hang similar to the one posted above, on my exxon cluster.
To recreate the problem, mount the gfs2 file system in lock_nolock and
from "try" run these commands:

cd /usr/tests/sts-rhel5.2/gfs/bin
./d_io -j -u root -n exxon-01 -m /mnt/gfs2 -S REG -l /usr/tests/sts-rhel5.2/ -r
/usr/tests/sts-rhel5.2/

The -j says to focus on the jdata tests, but I don't know if it tests
jdata exclusively.  Maybe.  Maybe not.	At any rate, the test runs for a
while, then all the processes hang up.	Up to now, I haven't gotten any
useful information, so I've been adding various instrumentation trying
to figure this out.  Here's what I know for a fact so far:

In this example, we've got a xdoio process, pid 6661, that is currently
holding the sd_log_flush_lock.	It took the lock at log.c line 323,
which is in gfs2_log_reserve where it says:

	if (wait){
		mutex_unlock(&sdp->sd_log_reserve_mutex);
		down_read(&sdp->sd_log_flush_lock);

However, the call trace doesn't show any gfs2 functions in there.
So apparently it's taken the log flush lock and gone off to do some
other menial task.  It's certainly not doing the log flush it should
have done.  Actually, it's waiting for a page lock, which I suspect is
locked by another one of our tasks, possibly a read.

There are 40 different calls to gfs2_trans_begin that cause that
particular code path to happen.  We probably need to go through them all.

Some other processes are sitting waiting for the log flush lock it
holds, and still others are waiting for glocks taken by them.  So the
next task is to figure out what happened to process 6661.

Additional information:

I also know, through instrumentation, that the sd_log_lock spinlock is 
NOT held by anyone.  The version of log.c represented in the call trace
has extra instrumentation and I've removed most of the occurrences of
"static" for the functions so the call trace will show better which
function called which other function.  I haven't gone through all the
gfs2 processes from the call trace in depth yet, but it's something we
should do.

Comment 6 Robert Peterson 2008-05-22 23:08:00 UTC
Created attachment 306428 [details]
Glock dump from the latest hang

Here's the patching glock dump from this hang.

Comment 7 Nate Straz 2008-05-22 23:23:03 UTC
(In reply to comment #5)
> To recreate the problem, mount the gfs2 file system in lock_nolock and
> from "try" run these commands:
> 
> cd /usr/tests/sts-rhel5.2/gfs/bin
> ./d_io -j -u root -n exxon-01 -m /mnt/gfs2 -S REG -l /usr/tests/sts-rhel5.2/ -r
> /usr/tests/sts-rhel5.2/
> 
> The -j says to focus on the jdata tests, but I don't know if it tests
> jdata exclusively.  Maybe.  Maybe not.

All the -j flag does to d_io is tell it to exclude the tests which aren't
supported when jdata is used.  That means tests which depend on O_DIRECT and
mmap are disabled.  The rest of the tests are exactly the same as without -j flag.

Comment 8 Steve Whitehouse 2008-05-23 08:33:00 UTC
I thought we were now testing mmap and jdata for gfs2, but still excluding mmap
tests for gfs1?

Comment 9 Robert Peterson 2008-05-23 13:43:20 UTC
Created attachment 306497 [details]
Differences between latest readv and what we gave aol

Comment 10 Robert Peterson 2008-05-23 18:40:22 UTC
Notes on this problem:

1. The problem recreates with the version we gave to AOL, so it's
   not something new we've introduced since then.
2. The problem recreates regardless of normal data or jdata.
3. The problem recreates regardless of data=writeback or ordered.

The problem almost always recreates for me when d_io gets to the
"lock3" test.  I'm not sure why we haven't seen it before, except
that most of my testing was focused on dlm, whereas this is
happening when I'm using nolock protocol.

This appears to be a brand new problem, similar to the problems we
were having earlier.  Like before, the contention seems to be
between a writer and a reader.

The writer does this:

1. gfs2_trans_begin(wait==1)
      gfs2_log_reserve(wait==1)
         down_read(&sdp->sd_log_flush_lock);
2. Acquire EX glock for an inode
3. find_lock_page()

The reader does this:

1. find_lock_page()
2. Acquire a SH "trylock" glock for the same inode

The majority of processes won't run because they're waiting for
the sd_log_flush_lock semaphore.  The writer who holds that
semaphore is blocked on the page lock that's held by the reader.
The reader is blocked on the glock held by the writer.

There are four ways I can think of to fix this:

1. Fix vfs to have a prepare_read function that functions like
   the prepare_write function, in which we can acquire the glocks.
   That's a hard sell to upstream, hard sell to rhkernel-list and
   we can't give it to AOL because they'll be using a 5.2 kernel.
2. Kludge vfs so that it behaves better when it gets the
   AOP_TRUNCATED_PAGE return code.  Same problems as 1.
3. Change the GFS2 lock ordering like Wendy wanted.  This is a
   very big design change, and Steve has already said it isn't
   possible for reasons given in prior discussions.
4. Kludge GFS2 to get around it, which seems like our only viable
   option.

The reader is using a try lock, LM_FLAG_TRY_1CB, so presumably we
can find a way to kludge through it.  I tried one method and that
is to add a condition to add_to_queue to fail any try locks that
are done while a transaction is open.  Like this:

if (gh->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB)) {
	if (test_bit(GLF_LOCK, &gl->gl_flags))
		try_lock = 1;
	if (test_bit(GLF_INVALIDATE_IN_PROGRESS, &gl->gl_flags) ||
	    current->journal_info != NULL)
		goto fail;
}

That seemed to make no difference; I was still able to recreate
the failure.  Perhaps the reader needs to fail his own try lock
when he realizes the page he will be needing is already locked.

I'm not sure how the trylock got queued while there was an EX
holder in the first place.  This is all magic that Steve Whitehouse
understands.  I would have thought the promotion of the EX lock
should have failed the trylock, or the add_to_queue of the
trylock should have failed the "try" because of the EX holder.

At any rate, we've got to find a good way to fail the try lock
reader, which will cause it to let go of the page lock, which will
free up the writer to finish its transaction and set everything back
into motion.


Comment 11 Steve Whitehouse 2008-05-27 10:39:13 UTC
The promotion of EX locks does fail try locks, but that doesn't stop them being
queued after the EX lock is granted (but before the EX lock holding thread has
grabbed the page lock).

So we have:

Thread A              Thread B
1. Get EX glock
2.                    Get page lock
3.                    Queue SH "try lock"
4. Try to get page lock (hang)

Now if we fail all "try lock"s when we have an EX holder then the result is that
EX holders can starve out the SH locks and we get processes which call readpage
hanging with 100% cpu usage as they spin waiting for an SH "try lock".


Comment 12 Robert Peterson 2008-05-27 18:27:25 UTC
Created attachment 306817 [details]
Bob's 27May2008 patch

This patch passes the same single-node d_io test that was previously
failing.  The fix basically changes the gfs2_readpage function so that
it takes a normal (blocking) glock, rather than a "try lock."  But before
it does, it unlocks the page passed in from vfs to get the same lock
ordering as prepare_write and the write side.  Then, after the glock is
acquired, it needs to reacquire the page lock, but in order to prevent
deadlock, it uses a "page try lock".  If the page try lock is successful,
everybody's happy and we finish the read.  If the page try lock is
unsuccessful, we unlock the glock we acquired, yield the cpu to give the
contending process a chance to run, then return AOP_TRUNCATED_PAGE to
make vfs try calling the gsf2_readpage function again.

This patch goes on top of the patch I posted to rhkernel-list for
bug #432057.

I did run into an issue: The vfs page trylock function is deprecated,
so it isn't likely to fly upstream.  So I'm not sure what it will take
to get it coded properly for upstream.	Here's the compiler message:

/home/bob/27May2008.aol.debug/ops_address.c:604: warning:
‘find_trylock_page’ is deprecated (declared at include/linux/pagemap.h:76)

So the function is deprecated, but I'm not sure what to code in its place.

Other concerns:
Steve may want a crack at rewriting it.  Also, I don't know if this
affects Abhi's code for page_mkwrite fix (bug #315191).  I haven't had
a chance to review those changes yet.

For now I'm going to run it through dd_io and also see if Dean can
recreate the problem on this version.

Comment 13 Steve Whitehouse 2008-05-28 09:12:00 UTC
Created attachment 306906 [details]
Upstream attempt at a readpage lock ordering fix

This is an attempt at fixing the lock ordering for the upstream kernel. Its
based upon Bob's idea and means that we now always look up the page twice (at a
mimimum) on each readpage. We could potentially reduce the cost of doing this
by making readpages cover more of the corner cases (stuffed inodes is the main
one), but I'm not sure that it will make a huge difference. Really the only
proper fix is to sort out the lock ordering at the VFS/VM level.

Note that we do not need (or want) Abhi's patch for AOL. It says clearly in
that bug that its the first item _after_ AOL. Also, when that patch does get
applied, then it will mean that gfs2_readpage in upstream and in RHEL will be
identical, so that will make a lot of things much easier in the future as we
will no longer need two versions of the changes in that area.

Comment 14 Steve Whitehouse 2008-05-29 12:36:06 UTC
Created attachment 307044 [details]
Latest upstream fix

I've not been able to figure out the latest hang so far, but here is an updated
upstream version of the patch which is much simpler, at least.

Comment 15 Robert Peterson 2008-05-30 15:27:55 UTC
This bug was caused by a shortcoming of the patch I originally posted to
rhkernel-list for bug #432057.  Based on these findings, we decided to
re-spin the 432057 patch to include this fix.  That new patch was posted
to rhkernel-list today, therefore I'm closing this as a duplicate of that.


*** This bug has been marked as a duplicate of 432057 ***


Note You need to log in before you can comment on or make changes to this bug.