Description of problem: Some people have been given the impression that setting /proc/cluster/config/dlm/lock_timeout to 0 will disable the lock_timeout feature and get rid of the ETIMEDOUT errors returned from dlm_locks. This is not the case. Setting lock_timeout to zero does exactly what it says, it sets the timeout to zero! This means that the ETIMEDOUT errors are far MORE likely and also nodes can oops and get into a tight loop (and thus removed the cluster). How reproducible: Very easily. It happens 3 times out of 4 roughly. Steps to Reproduce: 1. Take a 3 node cluster 2. On all 3 nodes create a lockspace using dlm test: ./dlmtest -mnl -d 100000000 & 3. On all 3 nodes set lock_timeout to zero: echo 0 > /proc/cluster/config/dlm/lock_timeout 4. on all 3 nodes request another lock: ./dlmtest Actual results: One node returns ETIMEDOUT from the lock operation, one Oopses and one gets stuck in a tight loop and has to be power cycled! Sometimes you might only get one or two of these symptoms. Expected results: Normal locking. The nodes each get the lock in turn then release it. It should also be possible to disable lock_timeouts using this configuration variable. Additional info: At least one site seems to have been recommended to use this setting
Created attachment 307354 [details] Patch to fix This probably needs a review ... for the brackets as much as anything else. Because the timer is triggered using the lock_timeout value we can't just ignore it if it's zero or it would be impossible to re-enable it! It would also disable deadlock checking for all lockspaces. So what I've done here is to default the timer to 30 seconds (this will only affects the deadlock checker) if lock_timeout is set to zero. 30 seconds is the compiled-in default anyway.
The fix for this is just "don't do that". The patch is not worth the trouble integrating into 4.8 and testing.