Description of problem: When running loops of "ls" on 1 node, and "tar"/"untar" on another node, the ls blocked after more than 1 day. Version-Release number of selected component (if applicable): GFS-6.1.5-0 GFS-kernel-smp-2.6.9-49.1.centos4 GFS-kernheaders-2.6.9-49.1.centos4 dlm-1.0.0-5 dlm-kernel-smp-2.6.9-41.7.centos4 dlm-kernheaders-2.6.9-41.7.centos4 dlm-devel-1.0.0-5 How reproducible: This is hardly reproducible. The given steps to reproduce may not lead to this bug 100% of the time. Steps to Reproduce: 1. Run loops of tar/untar of a directory on 1 node, run loops of ls on the tar'd directory on another node 2. Wait a long time. NOTE : no node crash, no device error. Actual results: On the locked node that ran "ls", at the time when it blocked, syslog contains : Jul 30 11:11:09 sam39 kernel: dlm: gfsbig: process_lockqueue_reply id 1dc01bc state 0 Jul 30 13:14:39 sam39 kernel: dlm: gfsbig: cancel reply ret 0 Jul 30 13:14:39 sam39 kernel: dlm: gfsbig: process_lockqueue_reply id 2010258 state 0 Jul 30 13:14:39 sam39 kernel: lock_dlm: unlock sb_status 0 2,6de3e4 flags 0 Additional info: I have saved the contents of /proc/cluster/*. I can give additional info if needed. On the blocked node, here is the content of "echo t > /proc/sysrq-trigger" for the blocked "ls" process: Jul 31 14:57:53 sam39 kernel: ls D 0000000000000000 0 29333 1 8104 (NOTLB) Jul 31 14:57:53 sam39 kernel: ffff8100138e1c88 0000000000000086 ffff81003f09935c ffffffff8817c284 Jul 31 14:57:53 sam39 kernel: ffff8100138e1c28 0000000000000000 0000000000000000 ffff8100138e1c28 Jul 31 14:57:53 sam39 kernel: ffff810037d0f8b0 ffff810037d0f8b0 Jul 31 14:57:53 sam39 kernel: Call Trace:<ffffffff8817c284>{:gfs:gfs_glock_dq+436} <ffffffff8019aaa6>{dput+54} Jul 31 14:57:53 sam39 kernel: <ffffffff803a8f13>{wait_for_completion+179} <ffffffff80130e40>{default_wake_function+0} Jul 31 14:57:53 sam39 kernel: <ffffffff80190534>{do_lookup+100} <ffffffff80130e40>{default_wake_function+0} Jul 31 14:57:53 sam39 kernel: <ffffffff8019aaa6>{dput+54} <ffffffff8817bda9>{:gfs:glock_wait_internal+345} Jul 31 14:57:53 sam39 kernel: <ffffffff8817c6e2>{:gfs:gfs_glock_nq+994} <ffffffff8817c9be>{:gfs:gfs_glock_nq_init+30} Jul 31 14:57:53 sam39 kernel: <ffffffff88194c03>{:gfs:gfs_getattr+83} <ffffffff8018b6db>{vfs_getattr+59} Jul 31 14:57:53 sam39 kernel: <ffffffff8018b83d>{vfs_lstat+61} <ffffffff8019ab27>{dput+183} Jul 31 14:57:53 sam39 kernel: <ffffffff801901bc>{path_release+12} <ffffffff8018bc5f>{sys_newlstat+31} Jul 31 14:57:53 sam39 kernel: <ffffffff8010dc66>{system_call+126} After looking at the existing bugs database, it seems https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=147682 may be the same bug. But i'm unable to confirm that. (The conclusion of this bug is that no-one was able to reproduce it)
There's a chance this may have the same root cause as bz 172944. One possible cause in this case is an unlock that happens while another lock operation is still in progress on the same lock. (GFS shouldn't do that and there is plenty of room for problems like this in the dlm if it does happen for some reason). In that scenario, the reply to the first operation reports "cancel rep 0" because the dlm expects a return status of EUNLOCK (for the unlock) but gets the 0 for the first op. The reply to the unlock then reports "process_lockqueue_reply ... state 0" because the lock was removed from the waiting list. If this is what happened, we should see a message of "unlock cancel status %d" on one of the other nodes. (Even though this has nothing to do with lock cancelation.) We could print some additional debug info to try to say more definatively what's going wrong. If we're successful at that the best I think we could hope for is a work-around to detect when this happens and does something special to handle it.
If this is still a problem for the reporter then we should have our QA group run a test like this.
(sorry for the delay) It never reproduced. The problem is, this behaviour is not a desired one : the processes are taking lots of time communicating and are therefore very slow. We haven't continued investigating in that way. We will try as hard as possible not to have 2 nodes accessing the very same data at the very time, although GFS is designed just for that. Therefore the bug may be there, but we it didn't happened again. Still, it would be interesting to have this kind of tests run by your QA group.
Will reopen this if it becomes a problem again.