From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041111 Firefox/1.0 Description of problem: I run "make_panic -f3" on a seven node cluster. After running for some time, the make_panic processes on all machines will stop. A lock dump shows they are all waiting on a single lock that GFS is caching on some node but isn't releasing. It appears that a blocking callback isn't being delivered to GFS for this lock. After several minutes, gfs releases locks it's not using which allows all the processes to get this lock and they all run fine again until it happens again. There are several cases in process_asts() where we decide to skip sending a bast. One of those may be incorrect; some debugging should tell. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. make_panic -f 3 on several nodes 2. 3. Additional info:
Changes by: teigland 2005-02-16 03:53:16 Modified files: dlm-kernel/src : lockqueue.c Log message: Blocking asts were being ignored for all locks being converted which resulted in some necessary basts being skipped. In particular, after a failed NOQUEUE conversion, gfs could be left holding a lock and getting no callback for it while others were left waiting. This changes things so that a bast message is ignored if the lock is being converted and NOQUEUE isn't set, or if the locks is being unlocked. Fixes bz 147798. Changes by: teigland 2005-02-16 04:03:50 Modified files: gfs-kernel/src/dlm: lock.c Log message: We were ignoring blocking callbacks for locks being converted which caused us to skip some that were necessary. Fixes bz 147798 (along with similar dlm fix)