From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041111 Firefox/1.0 Description of problem: I've run into this twice in the past week (once on my 4-node va cluster and once on my 8-node bench cluster). Both times it happened while running my mu_loop script on all nodes: while (1) mount /gfs sleeprand 8 umount /gfs sleeprand 8 I'm thinking this must be related to a recent change because I often run this test and have never seen this before. It's hung in exactly the same spot both times. lock_dlm is processing a start and has just done: 3971 lk 10,0 id 0 -1,3 9 The dlm lock dump shows the lock has been granted. The ast lock_dlm is waiting for is therefore for this lock 10,0. Both times lock_dlm is hung waiting here: Stack traceback for pid 3971 0xcecec130 3971 7 0 1 D 0xcecec370 lock_dlm1 EBP EIP Function (args) 0xca6efe04 0xc033eccc schedule+0x2fc (0xc035d0d3, 0xb7c, 0xca7331c8, 0x0, 0xcecec130) 0xca6efe64 0xc033f0b4 wait_for_completion+0xa4 (0xca73318c, 0x0, 0x3, 0x5, 0x0) 0xca6efe84 0xd08dc1f9 �lock_dlm�lm_dlm_lock_sync+0x59 (0xca73318c, 0x0, 0x3, 0x5, 0x5a000020) 0xca6efec8 0xd08da163 �lock_dlm�id_test_and_set+0xa3 (0xca65db48, 0x0, 0x2, 0xca65db74, 0xcebd6c78) 0xca6efef4 0xd08da597 �lock_dlm�claim_jid+0x47 (0xca65db48, 0xd0, 0x2, 0x1f6, 0x250) 0xca6eff3c 0xd08dad00 �lock_dlm�process_start+0x480 (0xca65db48, 0xcfdac64c, 0xc9e6b390, 0xca65db80, 0xca65db78) 0xca6effbc 0xd08e1494 �lock_dlm�dlm_async+0x284 (0x0, 0x0, 0xca7ddc94, 0xca7ddc88, 0xc9e92a48) The ast isn't being delivered to lock_dlm because dlm_astd is hung here: 0xce3a3630 2583 7 0 1 D 0xce3a3870 dlm_astd EBP EIP Function (args) 0xcc2dfee4 0xc033eccc schedule+0x2fc (0x1, 0xcceac7f0, 0xca65d1b8, 0xc0119fbc, 0xce3a3630) 0xcc2dff1c 0xc033f5fc rwsem_down_read_failed+0x9c (0xd099b1e0, 0xca2d5aec, 0xd09928f8, 0x2b, 0xcec0b257) 0xd097d191 �dlm�.text.lock.ast+0x7f 0xcc2dff60 0xd097c4a2 �dlm�process_asts+0xe2 (0xd099b260, 0x65bf68, 0x0) 0xcc2dffbc 0xd097cf55 �dlm�dlm_astd+0x1c5 (0x0, 0x0, 0xcc693e58, 0xcc693e4c, 0xcedb65b8) 0xc011b640 default_wake_function (0x100100, 0x200200, 0xfbfc9fe4, 0x650, 0xce3a3798) 0xc011b640 default_wake_function (0x0, 0xff, 0x0, 0xfffffffc, 0xd097cd90) 0xc0133d4a kthread+0xba Version-Release number of selected component (if applicable): How reproducible: Sometimes Steps to Reproduce: 1. run the test described above 2. 3. Actual Results: mount hangs on the bug above, other mounts hang waiting for the first Additional info:
We should study (and test) if it's safe to just remove the locking that's causing the problem in process_asts: down_read(&ls->ls_in_recovery); release_lkb(ls, lkb); release_rsb(rsb); up_read(&ls->ls_in_recovery); or else add these to a list for release after recovery.
The ast thread was blocking on ls->ls_in_recovery just prior to that ls being freed. dlm_astd now skips any lkb's from lockspaces that aren't running.