Description of problem: While running through single node GFS2 testing, d_io hung while running with jdata enabled. The tags running at the time of the hang were laio_lock3 and rwransyncsmall. I will attach the sysrq-t output and glock dump. Version-Release number of selected component (if applicable): kernel-2.6.18-92.el5 kmod-gfs2-1.92-1.13 -- GFS2 Overlay (built May 21 2008 14:34:22) How reproducible: Unknown
Created attachment 306378 [details] glock list from tank-01
Created attachment 306379 [details] sysrq-t output from tank-01
Initial analysis of locked gfs2 processes: glock_workque D 00003657 2740 8008 19 8009 8006 (L-TLB) Waiting on sd_log_flush_lock semaphore gfs2_log_flush glock_workque D 00003657 2740 8009 19 8010 8008 (L-TLB) Waiting on sd_log_flush_lock semaphore gfs2_log_flush gfs2_logd D 000013FD 2860 8531 19 8532 8530 (L-TLB) Waiting on sd_log_flush_lock semaphore gfs2_ail1_empty->gfs2_log_flush gfs2_quotad D 00001401 2788 8532 19 10161 8531 (L-TLB) Waiting on sd_log_flush_lock semaphore gfs2_do_trans_begin->gfs2_log_reserve pdflush D 00001402 2532 10161 19 8532 (L-TLB) down_read must be for sd_log_flush_lock gfs2_log_reserve xdoio D 000013FB 2672 11277 11274 11278 (NOTLB) gfs2_readpage->gfs2_glock_nq_atime->gfs2_glock_wait->...wait_on_bit xdoio D 000013FB 2720 11278 11274 11279 11277 (NOTLB) find_lock_page (page lock) xdoio D 000013FB 2720 11280 11274 11282 11279 (NOTLB) gfs2_permission->gfs2_glock_wait xdoio D 000013FB 2428 11285 11284 (NOTLB) gfs2_jdata_writepages->gfs2_log_flush->sd_log_flush_lock updatedb D 00003664 2612 13171 13166 (NOTLB) down_read must be for sd_log_flush_lock gfs2_log_reserve Now we need to track down who locks what and who holds what, which is basically digging through the code.
I was able to hit this again w/ jdata and the test cases were laio_lock3 and mtfile_lock2. That's twice in a row that laio_lock3 was running during the hang. laio_lock3 uses io_submit async I/O and works from the end of the file to the beginning with five processes handling requests.
Created attachment 306427 [details] Call traces for a hang on my exxon cluster We've got a problem and I don't know if it's only jdata. I was able to recreate a hang similar to the one posted above, on my exxon cluster. To recreate the problem, mount the gfs2 file system in lock_nolock and from "try" run these commands: cd /usr/tests/sts-rhel5.2/gfs/bin ./d_io -j -u root -n exxon-01 -m /mnt/gfs2 -S REG -l /usr/tests/sts-rhel5.2/ -r /usr/tests/sts-rhel5.2/ The -j says to focus on the jdata tests, but I don't know if it tests jdata exclusively. Maybe. Maybe not. At any rate, the test runs for a while, then all the processes hang up. Up to now, I haven't gotten any useful information, so I've been adding various instrumentation trying to figure this out. Here's what I know for a fact so far: In this example, we've got a xdoio process, pid 6661, that is currently holding the sd_log_flush_lock. It took the lock at log.c line 323, which is in gfs2_log_reserve where it says: if (wait){ mutex_unlock(&sdp->sd_log_reserve_mutex); down_read(&sdp->sd_log_flush_lock); However, the call trace doesn't show any gfs2 functions in there. So apparently it's taken the log flush lock and gone off to do some other menial task. It's certainly not doing the log flush it should have done. Actually, it's waiting for a page lock, which I suspect is locked by another one of our tasks, possibly a read. There are 40 different calls to gfs2_trans_begin that cause that particular code path to happen. We probably need to go through them all. Some other processes are sitting waiting for the log flush lock it holds, and still others are waiting for glocks taken by them. So the next task is to figure out what happened to process 6661. Additional information: I also know, through instrumentation, that the sd_log_lock spinlock is NOT held by anyone. The version of log.c represented in the call trace has extra instrumentation and I've removed most of the occurrences of "static" for the functions so the call trace will show better which function called which other function. I haven't gone through all the gfs2 processes from the call trace in depth yet, but it's something we should do.
Created attachment 306428 [details] Glock dump from the latest hang Here's the patching glock dump from this hang.
(In reply to comment #5) > To recreate the problem, mount the gfs2 file system in lock_nolock and > from "try" run these commands: > > cd /usr/tests/sts-rhel5.2/gfs/bin > ./d_io -j -u root -n exxon-01 -m /mnt/gfs2 -S REG -l /usr/tests/sts-rhel5.2/ -r > /usr/tests/sts-rhel5.2/ > > The -j says to focus on the jdata tests, but I don't know if it tests > jdata exclusively. Maybe. Maybe not. All the -j flag does to d_io is tell it to exclude the tests which aren't supported when jdata is used. That means tests which depend on O_DIRECT and mmap are disabled. The rest of the tests are exactly the same as without -j flag.
I thought we were now testing mmap and jdata for gfs2, but still excluding mmap tests for gfs1?
Created attachment 306497 [details] Differences between latest readv and what we gave aol
Notes on this problem: 1. The problem recreates with the version we gave to AOL, so it's not something new we've introduced since then. 2. The problem recreates regardless of normal data or jdata. 3. The problem recreates regardless of data=writeback or ordered. The problem almost always recreates for me when d_io gets to the "lock3" test. I'm not sure why we haven't seen it before, except that most of my testing was focused on dlm, whereas this is happening when I'm using nolock protocol. This appears to be a brand new problem, similar to the problems we were having earlier. Like before, the contention seems to be between a writer and a reader. The writer does this: 1. gfs2_trans_begin(wait==1) gfs2_log_reserve(wait==1) down_read(&sdp->sd_log_flush_lock); 2. Acquire EX glock for an inode 3. find_lock_page() The reader does this: 1. find_lock_page() 2. Acquire a SH "trylock" glock for the same inode The majority of processes won't run because they're waiting for the sd_log_flush_lock semaphore. The writer who holds that semaphore is blocked on the page lock that's held by the reader. The reader is blocked on the glock held by the writer. There are four ways I can think of to fix this: 1. Fix vfs to have a prepare_read function that functions like the prepare_write function, in which we can acquire the glocks. That's a hard sell to upstream, hard sell to rhkernel-list and we can't give it to AOL because they'll be using a 5.2 kernel. 2. Kludge vfs so that it behaves better when it gets the AOP_TRUNCATED_PAGE return code. Same problems as 1. 3. Change the GFS2 lock ordering like Wendy wanted. This is a very big design change, and Steve has already said it isn't possible for reasons given in prior discussions. 4. Kludge GFS2 to get around it, which seems like our only viable option. The reader is using a try lock, LM_FLAG_TRY_1CB, so presumably we can find a way to kludge through it. I tried one method and that is to add a condition to add_to_queue to fail any try locks that are done while a transaction is open. Like this: if (gh->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB)) { if (test_bit(GLF_LOCK, &gl->gl_flags)) try_lock = 1; if (test_bit(GLF_INVALIDATE_IN_PROGRESS, &gl->gl_flags) || current->journal_info != NULL) goto fail; } That seemed to make no difference; I was still able to recreate the failure. Perhaps the reader needs to fail his own try lock when he realizes the page he will be needing is already locked. I'm not sure how the trylock got queued while there was an EX holder in the first place. This is all magic that Steve Whitehouse understands. I would have thought the promotion of the EX lock should have failed the trylock, or the add_to_queue of the trylock should have failed the "try" because of the EX holder. At any rate, we've got to find a good way to fail the try lock reader, which will cause it to let go of the page lock, which will free up the writer to finish its transaction and set everything back into motion.
The promotion of EX locks does fail try locks, but that doesn't stop them being queued after the EX lock is granted (but before the EX lock holding thread has grabbed the page lock). So we have: Thread A Thread B 1. Get EX glock 2. Get page lock 3. Queue SH "try lock" 4. Try to get page lock (hang) Now if we fail all "try lock"s when we have an EX holder then the result is that EX holders can starve out the SH locks and we get processes which call readpage hanging with 100% cpu usage as they spin waiting for an SH "try lock".
Created attachment 306817 [details] Bob's 27May2008 patch This patch passes the same single-node d_io test that was previously failing. The fix basically changes the gfs2_readpage function so that it takes a normal (blocking) glock, rather than a "try lock." But before it does, it unlocks the page passed in from vfs to get the same lock ordering as prepare_write and the write side. Then, after the glock is acquired, it needs to reacquire the page lock, but in order to prevent deadlock, it uses a "page try lock". If the page try lock is successful, everybody's happy and we finish the read. If the page try lock is unsuccessful, we unlock the glock we acquired, yield the cpu to give the contending process a chance to run, then return AOP_TRUNCATED_PAGE to make vfs try calling the gsf2_readpage function again. This patch goes on top of the patch I posted to rhkernel-list for bug #432057. I did run into an issue: The vfs page trylock function is deprecated, so it isn't likely to fly upstream. So I'm not sure what it will take to get it coded properly for upstream. Here's the compiler message: /home/bob/27May2008.aol.debug/ops_address.c:604: warning: ‘find_trylock_page’ is deprecated (declared at include/linux/pagemap.h:76) So the function is deprecated, but I'm not sure what to code in its place. Other concerns: Steve may want a crack at rewriting it. Also, I don't know if this affects Abhi's code for page_mkwrite fix (bug #315191). I haven't had a chance to review those changes yet. For now I'm going to run it through dd_io and also see if Dean can recreate the problem on this version.
Created attachment 306906 [details] Upstream attempt at a readpage lock ordering fix This is an attempt at fixing the lock ordering for the upstream kernel. Its based upon Bob's idea and means that we now always look up the page twice (at a mimimum) on each readpage. We could potentially reduce the cost of doing this by making readpages cover more of the corner cases (stuffed inodes is the main one), but I'm not sure that it will make a huge difference. Really the only proper fix is to sort out the lock ordering at the VFS/VM level. Note that we do not need (or want) Abhi's patch for AOL. It says clearly in that bug that its the first item _after_ AOL. Also, when that patch does get applied, then it will mean that gfs2_readpage in upstream and in RHEL will be identical, so that will make a lot of things much easier in the future as we will no longer need two versions of the changes in that area.
Created attachment 307044 [details] Latest upstream fix I've not been able to figure out the latest hang so far, but here is an updated upstream version of the patch which is much simpler, at least.
This bug was caused by a shortcoming of the patch I originally posted to rhkernel-list for bug #432057. Based on these findings, we decided to re-spin the 432057 patch to include this fix. That new patch was posted to rhkernel-list today, therefore I'm closing this as a duplicate of that. *** This bug has been marked as a duplicate of 432057 ***