Red Hat Bugzilla – Bug 449098
Setting lock_timeout to 0 causes all sorts of problems
Last modified: 2009-04-16 16:32:53 EDT
Description of problem:
Some people have been given the impression that setting
/proc/cluster/config/dlm/lock_timeout to 0 will disable the lock_timeout feature
and get rid of the ETIMEDOUT errors returned from dlm_locks. This is not the case.
Setting lock_timeout to zero does exactly what it says, it sets the timeout to
zero! This means that the ETIMEDOUT errors are far MORE likely and also nodes
can oops and get into a tight loop (and thus removed the cluster).
Very easily. It happens 3 times out of 4 roughly.
Steps to Reproduce:
1. Take a 3 node cluster
2. On all 3 nodes create a lockspace using dlm test:
./dlmtest -mnl -d 100000000 &
3. On all 3 nodes set lock_timeout to zero:
echo 0 > /proc/cluster/config/dlm/lock_timeout
4. on all 3 nodes request another lock:
One node returns ETIMEDOUT from the lock operation, one Oopses and one gets
stuck in a tight loop and has to be power cycled! Sometimes you might only get
one or two of these symptoms.
Normal locking. The nodes each get the lock in turn then release it. It should
also be possible to disable lock_timeouts using this configuration variable.
At least one site seems to have been recommended to use this setting
Created attachment 307354 [details]
Patch to fix
This probably needs a review ... for the brackets as much as anything else.
Because the timer is triggered using the lock_timeout value we can't just
ignore it if it's zero or it would be impossible to re-enable it! It would also
disable deadlock checking for all lockspaces.
So what I've done here is to default the timer to 30 seconds (this will only
affects the deadlock checker) if lock_timeout is set to zero. 30 seconds is the
compiled-in default anyway.
The fix for this is just "don't do that". The patch is not worth the trouble integrating into 4.8 and testing.