Description of problem: Application threads get stuck in "D" state while doing GFS posix lock calls and can't be killed. The following are two threads back traces that show the issue (taken from Eric Ayers' live system yesterday): PID: 355 TASK: 100ebd8b030 CPU: 1 COMMAND: "dcs_event_mgr" #0 [101de649c28] schedule at ffffffff8030a1c1 #1 [101de649d00] wait_for_completion at ffffffff8030a405 #2 [101de649d80] wait_async at ffffffffa029fb2b #3 [101de649da0] wait_async_list at ffffffffa029fc42 #4 [101de649dd0] lm_dlm_punlock at ffffffffa02a03a6 #5 [101de649e30] gfs_lm_punlock at ffffffffa0263df2 #6 [101de649e40] gfs_lock at ffffffffa026edad #7 [101de649e70] locks_remove_posix at ffffffff8018e49b #8 [101de649e80] wait_on_page_writeback_range at ffffffff80159f49 #9 [101de649f30] filp_close at ffffffff80178e3c #10 [101de649f50] sys_close at ffffffff80178ecd #11 [101de649f80] cstar_do_call at ffffffff80125d33 PID: 32632 TASK: 100dbb1b030 CPU: 1 COMMAND: "dcs_event_mgr" #0 [100e749db28] schedule at ffffffff8030a1c1 #1 [100e749dc00] wait_for_completion at ffffffff8030a405 #2 [100e749dc80] wait_async at ffffffffa029fb2b #3 [100e749dca0] wait_async_list at ffffffffa029fc42 #4 [100e749dcd0] lm_dlm_plock at ffffffffa02a0fd2 #5 [100e749de10] gfs_lm_plock at ffffffffa0263db7 #6 [100e749de20] gfs_lock at ffffffffa026edbf #7 [100e749de50] fcntl_setlk at ffffffff8018e244 #8 [100e749de70] __dentry_open at ffffffff801789ea #9 [100e749dea0] filp_open at ffffffff80178b88 #10 [100e749df10] sys_fcntl at ffffffff8018a613 #11 [100e749df40] compat_sys_fcntl64 at ffffffff8019ecad #12 [100e749df80] cstar_do_call at ffffffff80125d33 Version-Release number of selected component (if applicable): * RHEL4 U4 x86_64 kernel (2.6.9-42.0.2.ELsmp) * gfs/dlm R4.4 errata built by Chris on Sept 26 http://post-office.corp.redhat.com/archives/cluster-list/2006-September/msg00312.html How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I've recreated the hang by putting a 300 usec delay into wait_async() before it goes waiting for completion: static int wait_async_new(dlm_lock_t *lp) { udelay(300); wait_for_completion(&lp->uast_wait); del_async(lp->posix, lp->posix->resource); return lp->lksb.sb_status; } Then I alternated wait_async_list() to use old (no-delay) and new (300 usec) wait_async(): static void wait_async_list(struct dlm_resource *r, unsigned long owner) { struct posix_lock *po; int error, found; static int i; ..... spin_unlock(&r->async_spin); if (found) { DLM_ASSERT(po->lp, ); i++; if (i%2) error = wait_async_new(po->lp); else error = wait_async(po->lp); DLM_ASSERT(!error, ); goto restart; } This is to simulate processes get interrupted by various kernel interrupt events. So I would assume GFS *will* lose the wake-up calls. I need sometime to think about this implication (whether my test is valid) and/or how to fix this.
Created attachment 137433 [details] the thread backtrace from my test machines that show this hang
Beginner luck yesterday - got hang on the first try. The second one happened 6 hours later after I rebooted the node. After examining more (on my test machine), the hang looks different from Eric's. The thread that blocks everyone else (this morning) is: crash> bt 6080 PID: 6080 TASK: 1006fee87f0 CPU: 0 COMMAND: "fl_64" #0 [10064a69c58] schedule at ffffffff8030a1c1 #1 [10064a69d30] wait_for_completion at ffffffff8030a405 #2 [10064a69db0] do_dlm_lock_sync at ffffffffa02fea4c #3 [10064a69dd0] unlock_resource at ffffffffa03006be #4 [10064a69df0] lm_dlm_punlock at ffffffffa0301384 #5 [10064a69e50] gfs_lm_punlock at ffffffffa02c4df2 #6 [10064a69e60] gfs_lock at ffffffffa02cfdad #7 [10064a69e90] fcntl_setlk at ffffffff8018e244 #8 [10064a69ea8] thread_return at ffffffff8030a219 #9 [10064a69f50] sys_fcntl at ffffffff8018a613 #10 [10064a69f80] system_call at ffffffff8011026a Look like the DLM callback (process_complete and "uast_wait" ) either gets lost or never happened. Also discussed this with base kernel folks yesterday. It is ok for complete() to be called before wait_for_complete(). So we don't have a clue or solution yet.
My thread consistently (actually another 10 hours for this to happen) hangs in: PID: 5160 TASK: 1006fde27f0 CPU: 0 COMMAND: "fl_64" #0 [100674edb78] schedule at ffffffff8030a1c1 #1 [100674edc50] wait_for_completion at ffffffff8030a405 #2 [100674edcd0] do_dlm_lock_sync at ffffffffa02feb98 #3 [100674edcf0] unlock_resource at ffffffffa030080a #4 [100674edd10] lm_dlm_plock at ffffffffa0302106 #5 [100674edd90] inode_has_perm at ffffffff801d1e3f #6 [100674ede50] gfs_lm_plock at ffffffffa02c4db7 #7 [100674ede60] gfs_lock at ffffffffa02cfdbf #8 [100674ede90] fcntl_setlk at ffffffff8018e244 #9 [100674eded0] sys_select at ffffffff8018bcc5 #10 [100674edf50] sys_fcntl at ffffffff8018a613 #11 [100674edf80] system_call at ffffffff8011026a The following is the "lp" structure that causes this hang. Every field looks ok, except the lp->flags which is (oddly) set as 0x9080 ?? The LFL_INLOCK bit had probably been cleaned. So I would expect this request had been processed: lp=1007a78d9c0 dlm_t *dlm =10037d75a00 ln_number =c3830 ln_type =17 char *lvb =0 lksb.sb_status=0 lksb.sb_lkid=13f00c5 lksb.sb_flags=0 int16_t cur=5 (DLM_LOCK_EX) req=0 prev_req=0 lkf = 4 type = 0 flags =9080 base_mode=0 uast_wait.done=0 uast_wait.prev=1007a78da28 uast_wait.next=1007a78da28 All the completion queue lists in lp->dlm structure are NULL. So now we go back to sqare one - the following two puzzles hold the info for this mystery: 1. why the lp->flags = 0x9080 ? 2. why this line showed up in dlm_debug (and console printk): dlm: mygfs: process_lockqueue_reply id 13f00c5 state 0
For every file, there's a "plock_update" lock that's held in EX while doing anything with the dlm locks representing the actual plocks on the file. The update lock is locked (or promoted) to EX in lock_resource() and demoted to NL in unlock_resource(). This update lock has the INLOCK and NOBAST lp flags set on it (it's a dlm lock internal to lock_dlm, it doesn't represent anything to gfs) and we don't need any blocking callbacks for it. The WAIT_COMPLETE flag is set when we make a dlm request and cleared when we get the completion callback for it. So, it appears that unlock_resource() is demoting the lock from EX to NL and is waiting for a completion callback for that conversion. A dump of the dlm locks should show whether or not the dlm has done the EX->NL conversion. The process_lockqueue_reply() might be a clue as to whether or why a completion callback is missed, I'm not sure.
I see - so 0x9080 is actually "LFL_NOBAST + LFL_INLOCK + LFL_WAIT_COMPLETE". This is exactly what do_dlm_lock_sync() passes to dlm_lock(). So somehow DLM astd thread never calls back (queue_complete()) ... This doesn't look like a GFS issue at all.
Created attachment 137689 [details] test case to recreate the hang * compile with "gcc -o fl fl.c" * run with "fl <filename>", filename should be a gfs file.
A quick summary of the problems: 1) The GFS crash was stopped by: * remove EMC powerpath (implies new formatted GFS filesystem) * turn off panic-on-oops on x96_64 node 2) The newest problem is application hangs inside the kernel (D state) waiting for dlm posix lock operations to complete - un-killable until the box gets rebooted. The hang recreation has been attempted inside Red Hat lab. One hang is found as described in comment #6. The thread is blocked waiting for dlm internal synchronization to complete. Intec has a different hang - it hangs while waiting for dlm asynchronous posix locks requests to complete.
Sorry if I was unclear about this, but before I tried it, I assumed the worst - that I would have to reformat GFS. In fact, I did not have to do that and was able to fsck and remount the exisitng filesystem afer using 'rpm' to remove the PowerPath installation from our blades.
Sorry - I thought you had reformatted GFS. In theory, fsck should be good enough. Dave is looking into this issue as well. We'll keep you informed.
With current status of development (posix lock module has been re-written in RHEL5) and the difficulty of recreating the problem (could not recreate this in-house), would like to close this bugzilla for now. If there is more info (or new leads of how to recreae this problem), we will re-open.