Bug 447920
| Summary: | GFS2: deadlock running d_io w/ jdata on lock_nolock | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Nate Straz <nstraz> |
| Component: | kernel | Assignee: | Steve Whitehouse <swhiteho> |
| Status: | CLOSED DUPLICATE | QA Contact: | GFS Bugs <gfs-bugs> |
| Severity: | low | Docs Contact: | |
| Priority: | low | ||
| Version: | 5.2 | CC: | adas, cluster-maint, edamato, rpeterso |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | i386 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2008-05-30 15:27:55 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Attachments: | |||
|
Description
Nate Straz
2008-05-22 13:43:47 UTC
Created attachment 306378 [details]
glock list from tank-01
Created attachment 306379 [details]
sysrq-t output from tank-01
Initial analysis of locked gfs2 processes: glock_workque D 00003657 2740 8008 19 8009 8006 (L-TLB) Waiting on sd_log_flush_lock semaphore gfs2_log_flush glock_workque D 00003657 2740 8009 19 8010 8008 (L-TLB) Waiting on sd_log_flush_lock semaphore gfs2_log_flush gfs2_logd D 000013FD 2860 8531 19 8532 8530 (L-TLB) Waiting on sd_log_flush_lock semaphore gfs2_ail1_empty->gfs2_log_flush gfs2_quotad D 00001401 2788 8532 19 10161 8531 (L-TLB) Waiting on sd_log_flush_lock semaphore gfs2_do_trans_begin->gfs2_log_reserve pdflush D 00001402 2532 10161 19 8532 (L-TLB) down_read must be for sd_log_flush_lock gfs2_log_reserve xdoio D 000013FB 2672 11277 11274 11278 (NOTLB) gfs2_readpage->gfs2_glock_nq_atime->gfs2_glock_wait->...wait_on_bit xdoio D 000013FB 2720 11278 11274 11279 11277 (NOTLB) find_lock_page (page lock) xdoio D 000013FB 2720 11280 11274 11282 11279 (NOTLB) gfs2_permission->gfs2_glock_wait xdoio D 000013FB 2428 11285 11284 (NOTLB) gfs2_jdata_writepages->gfs2_log_flush->sd_log_flush_lock updatedb D 00003664 2612 13171 13166 (NOTLB) down_read must be for sd_log_flush_lock gfs2_log_reserve Now we need to track down who locks what and who holds what, which is basically digging through the code. I was able to hit this again w/ jdata and the test cases were laio_lock3 and mtfile_lock2. That's twice in a row that laio_lock3 was running during the hang. laio_lock3 uses io_submit async I/O and works from the end of the file to the beginning with five processes handling requests. Created attachment 306427 [details]
Call traces for a hang on my exxon cluster
We've got a problem and I don't know if it's only jdata. I was able to
recreate a hang similar to the one posted above, on my exxon cluster.
To recreate the problem, mount the gfs2 file system in lock_nolock and
from "try" run these commands:
cd /usr/tests/sts-rhel5.2/gfs/bin
./d_io -j -u root -n exxon-01 -m /mnt/gfs2 -S REG -l /usr/tests/sts-rhel5.2/ -r
/usr/tests/sts-rhel5.2/
The -j says to focus on the jdata tests, but I don't know if it tests
jdata exclusively. Maybe. Maybe not. At any rate, the test runs for a
while, then all the processes hang up. Up to now, I haven't gotten any
useful information, so I've been adding various instrumentation trying
to figure this out. Here's what I know for a fact so far:
In this example, we've got a xdoio process, pid 6661, that is currently
holding the sd_log_flush_lock. It took the lock at log.c line 323,
which is in gfs2_log_reserve where it says:
if (wait){
mutex_unlock(&sdp->sd_log_reserve_mutex);
down_read(&sdp->sd_log_flush_lock);
However, the call trace doesn't show any gfs2 functions in there.
So apparently it's taken the log flush lock and gone off to do some
other menial task. It's certainly not doing the log flush it should
have done. Actually, it's waiting for a page lock, which I suspect is
locked by another one of our tasks, possibly a read.
There are 40 different calls to gfs2_trans_begin that cause that
particular code path to happen. We probably need to go through them all.
Some other processes are sitting waiting for the log flush lock it
holds, and still others are waiting for glocks taken by them. So the
next task is to figure out what happened to process 6661.
Additional information:
I also know, through instrumentation, that the sd_log_lock spinlock is
NOT held by anyone. The version of log.c represented in the call trace
has extra instrumentation and I've removed most of the occurrences of
"static" for the functions so the call trace will show better which
function called which other function. I haven't gone through all the
gfs2 processes from the call trace in depth yet, but it's something we
should do.
Created attachment 306428 [details]
Glock dump from the latest hang
Here's the patching glock dump from this hang.
(In reply to comment #5) > To recreate the problem, mount the gfs2 file system in lock_nolock and > from "try" run these commands: > > cd /usr/tests/sts-rhel5.2/gfs/bin > ./d_io -j -u root -n exxon-01 -m /mnt/gfs2 -S REG -l /usr/tests/sts-rhel5.2/ -r > /usr/tests/sts-rhel5.2/ > > The -j says to focus on the jdata tests, but I don't know if it tests > jdata exclusively. Maybe. Maybe not. All the -j flag does to d_io is tell it to exclude the tests which aren't supported when jdata is used. That means tests which depend on O_DIRECT and mmap are disabled. The rest of the tests are exactly the same as without -j flag. I thought we were now testing mmap and jdata for gfs2, but still excluding mmap tests for gfs1? Created attachment 306497 [details]
Differences between latest readv and what we gave aol
Notes on this problem:
1. The problem recreates with the version we gave to AOL, so it's
not something new we've introduced since then.
2. The problem recreates regardless of normal data or jdata.
3. The problem recreates regardless of data=writeback or ordered.
The problem almost always recreates for me when d_io gets to the
"lock3" test. I'm not sure why we haven't seen it before, except
that most of my testing was focused on dlm, whereas this is
happening when I'm using nolock protocol.
This appears to be a brand new problem, similar to the problems we
were having earlier. Like before, the contention seems to be
between a writer and a reader.
The writer does this:
1. gfs2_trans_begin(wait==1)
gfs2_log_reserve(wait==1)
down_read(&sdp->sd_log_flush_lock);
2. Acquire EX glock for an inode
3. find_lock_page()
The reader does this:
1. find_lock_page()
2. Acquire a SH "trylock" glock for the same inode
The majority of processes won't run because they're waiting for
the sd_log_flush_lock semaphore. The writer who holds that
semaphore is blocked on the page lock that's held by the reader.
The reader is blocked on the glock held by the writer.
There are four ways I can think of to fix this:
1. Fix vfs to have a prepare_read function that functions like
the prepare_write function, in which we can acquire the glocks.
That's a hard sell to upstream, hard sell to rhkernel-list and
we can't give it to AOL because they'll be using a 5.2 kernel.
2. Kludge vfs so that it behaves better when it gets the
AOP_TRUNCATED_PAGE return code. Same problems as 1.
3. Change the GFS2 lock ordering like Wendy wanted. This is a
very big design change, and Steve has already said it isn't
possible for reasons given in prior discussions.
4. Kludge GFS2 to get around it, which seems like our only viable
option.
The reader is using a try lock, LM_FLAG_TRY_1CB, so presumably we
can find a way to kludge through it. I tried one method and that
is to add a condition to add_to_queue to fail any try locks that
are done while a transaction is open. Like this:
if (gh->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB)) {
if (test_bit(GLF_LOCK, &gl->gl_flags))
try_lock = 1;
if (test_bit(GLF_INVALIDATE_IN_PROGRESS, &gl->gl_flags) ||
current->journal_info != NULL)
goto fail;
}
That seemed to make no difference; I was still able to recreate
the failure. Perhaps the reader needs to fail his own try lock
when he realizes the page he will be needing is already locked.
I'm not sure how the trylock got queued while there was an EX
holder in the first place. This is all magic that Steve Whitehouse
understands. I would have thought the promotion of the EX lock
should have failed the trylock, or the add_to_queue of the
trylock should have failed the "try" because of the EX holder.
At any rate, we've got to find a good way to fail the try lock
reader, which will cause it to let go of the page lock, which will
free up the writer to finish its transaction and set everything back
into motion.
The promotion of EX locks does fail try locks, but that doesn't stop them being queued after the EX lock is granted (but before the EX lock holding thread has grabbed the page lock). So we have: Thread A Thread B 1. Get EX glock 2. Get page lock 3. Queue SH "try lock" 4. Try to get page lock (hang) Now if we fail all "try lock"s when we have an EX holder then the result is that EX holders can starve out the SH locks and we get processes which call readpage hanging with 100% cpu usage as they spin waiting for an SH "try lock". Created attachment 306817 [details] Bob's 27May2008 patch This patch passes the same single-node d_io test that was previously failing. The fix basically changes the gfs2_readpage function so that it takes a normal (blocking) glock, rather than a "try lock." But before it does, it unlocks the page passed in from vfs to get the same lock ordering as prepare_write and the write side. Then, after the glock is acquired, it needs to reacquire the page lock, but in order to prevent deadlock, it uses a "page try lock". If the page try lock is successful, everybody's happy and we finish the read. If the page try lock is unsuccessful, we unlock the glock we acquired, yield the cpu to give the contending process a chance to run, then return AOP_TRUNCATED_PAGE to make vfs try calling the gsf2_readpage function again. This patch goes on top of the patch I posted to rhkernel-list for bug #432057. I did run into an issue: The vfs page trylock function is deprecated, so it isn't likely to fly upstream. So I'm not sure what it will take to get it coded properly for upstream. Here's the compiler message: /home/bob/27May2008.aol.debug/ops_address.c:604: warning: ‘find_trylock_page’ is deprecated (declared at include/linux/pagemap.h:76) So the function is deprecated, but I'm not sure what to code in its place. Other concerns: Steve may want a crack at rewriting it. Also, I don't know if this affects Abhi's code for page_mkwrite fix (bug #315191). I haven't had a chance to review those changes yet. For now I'm going to run it through dd_io and also see if Dean can recreate the problem on this version. Created attachment 306906 [details]
Upstream attempt at a readpage lock ordering fix
This is an attempt at fixing the lock ordering for the upstream kernel. Its
based upon Bob's idea and means that we now always look up the page twice (at a
mimimum) on each readpage. We could potentially reduce the cost of doing this
by making readpages cover more of the corner cases (stuffed inodes is the main
one), but I'm not sure that it will make a huge difference. Really the only
proper fix is to sort out the lock ordering at the VFS/VM level.
Note that we do not need (or want) Abhi's patch for AOL. It says clearly in
that bug that its the first item _after_ AOL. Also, when that patch does get
applied, then it will mean that gfs2_readpage in upstream and in RHEL will be
identical, so that will make a lot of things much easier in the future as we
will no longer need two versions of the changes in that area.
Created attachment 307044 [details]
Latest upstream fix
I've not been able to figure out the latest hang so far, but here is an updated
upstream version of the patch which is much simpler, at least.
This bug was caused by a shortcoming of the patch I originally posted to rhkernel-list for bug #432057. Based on these findings, we decided to re-spin the 432057 patch to include this fix. That new patch was posted to rhkernel-list today, therefore I'm closing this as a duplicate of that. *** This bug has been marked as a duplicate of 432057 *** |