Description of problem: There is already a bug in sistina bugs I believe for the issue of Gulm and posix locks, however I reproduced this hang by only using flocks as well with the tests genesis and accordion on each of the machines in the cluster. genesis -L flock -n 250 -d 50 -p 2 accordion -L flock -p 2 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 eventually the fileystems become stuck: [root@morph-03 tmp]# ps -efFwTl | grep genesis 4 S root 5044 5044 5043 0 81 0 - 365 wait 336 0 Jan28 ? 00:00:00 genesis -L flock -n 250 -d 50 -p 2 5 D root 5049 5049 5044 0 78 0 - 368 glock_ 448 0 Jan28 ? 00:06:05 genesis -L flock -n 250 -d 50 -p 2 5 D root 5050 5050 5044 0 78 0 - 368 glock_ 448 0 Jan28 ? 00:06:02 genesis -L flock -n 250 -d 50 -p 2 4 S root 5055 5055 5054 0 84 0 - 525 wait 336 1 Jan28 ? 00:00:00 genesis -L flock -n 250 -d 50 -p 2 5 D root 5058 5058 5055 0 78 0 - 528 glock_ 448 0 Jan28 ? 00:05:08 genesis -L flock -n 250 -d 50 -p 2 5 D root 5060 5060 5055 0 78 0 - 528 glock_ 448 0 Jan28 ? 00:05:09 genesis -L flock -n 250 -d 50 -p 2 4 S root 5075 5075 5074 0 85 0 - 726 wait 332 0 Jan28 ? 00:00:00 genesis -L flock -n 250 -d 50 -p 2 5 D root 5076 5076 5075 0 78 0 - 730 glock_ 448 0 Jan28 ? 00:04:56 genesis -L flock -n 250 -d 50 -p 2 5 D root 5078 5078 5075 0 78 0 - 730 glock_ 448 0 Jan28 ? 00:04:56 genesis -L flock -n 250 -d 50 -p 2 4 S root 5086 5086 5085 0 85 0 - 843 wait 336 0 Jan28 ? 00:00:00 genesis -L flock -n 250 -d 50 -p 2 5 D root 5088 5088 5086 0 78 0 - 846 glock_ 448 0 Jan28 ? 00:05:47 genesis -L flock -n 250 -d 50 -p 2 5 D root 5090 5090 5086 0 78 0 - 846 glock_ 448 1 Jan28 ? 00:05:45 genesis -L flock -n 250 -d 50 -p 2 [root@morph-04 root]# strace df -h . . . statfs64("/mnt/gfs0", 84, I tried to get more info from /proc/pid but that was stuck as well. Version-Release number of selected component (if applicable): Gulm <CVS> (built Jan 28 2005 16:39:38) installed How reproducible: Sometimes
pretty sure this doesn't have anything to do with plocks or flocks. Seems to be entirely load based.
Adding to release blocker list
All that's required to get this is two clients, gulm server[s] in either slm or rlm modes and load. (no clvm) load for me is fsstresses and a couple doios. Lighten the load (stop all fsstress or doio) and the deadlock isn't hit.
found it. Flooding ltpx faster than it can handle. so it locks up. a `gulm_tool getstats <deadlocked-client>:ltpx` will timeout. now to fix.....
Added out going queues to the local connects. Bug seems to dissapeared.
fix verified.