Red Hat Bugzilla – Bug 156872
Last modified: 2009-04-16 16:24:59 EDT
Description of problem:
The problem is reported as the lt_high_locks setting in cluster.ccs is getting
ignored. However, via few quick greps/finds, the issue (looks to me) seems to be
caused by the compiler casting in the bound_to_ulong() call since both ccs and
gulm all know about this tunable and have code to work with it.
Note that the customer is running GFS-22.214.171.124 on AMD64 and this is affecting
their production environment - when the maximum number of locks is reached,
performance is drastically reduced while the lock server requests nodes drop
their unncessary locks. Adjusting this setting higher would be a work-around to
that problem if this bug can be fixed.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1) Numerous messages shown up in /var/log/messages on the lock_gulm MASTER:
Apr 28 10:51:51 icla2g lock_gulmd_LT000: Lock count is at 2127373 which
is more than the max 2097152. Sending Drop all req to clients
2) performance is drastically reduced since lock server keeps requesting nodes
to drop their unncessary locks.
The culprit seems to be in this routine where val should have been (casted) set
to ulong ?
unsigned long bound_to_ulong(int val, unsigned long min, unsigned long max)
if( val < min ) return min;
if( val > max ) return max;
its not there actually. ccs doesn't have a function to find a long, only int or
float values. So ccs is actually not reading the number correctly.
For fixing the code, I think libccs will need to add a find long function.
Another temporary thing the customer can do is decrease the rate at when the
drop lock req are sent. This would be the lt_drop_req_rate, set this to the
number of seconds between each drop req.
ccs in 6.0 can only find int, float, or string.
so to get a long form ccs, either we need to have a string passed in and parse
it ourselves, or we need to change the ccs libs.
Second report on GFS-6.0.2-25-i686 2.4.21-27.0.2.ELhugemem kernel. The problem
causes failover to occur.
also, setting lt_high_locks to -1 will max it out. (although if you dump the
config with either -C or SIGUSR1, it will show -1 instead of 4294967295. One
more thing to fix. wheeeee.)
Created attachment 114165 [details]
kludge around ccs's lack of find_css_long
This is a quick patch that can fix this bug. It kludges around things by
letting users specify unsigned longs as a string. ie lt_high_locks =
"4294967296" (which would be the maximum value)
This also changes a bunch of %d to %u is the config dump function.
checked this into cvs.
oh, you can still use numbers to set lt_hight_locks, just in case that wasn't clear.
Without the patch, the following two settings effectively turn the HighWater
lock drop request off.
lt_high_locks = -1
lt_drop_req_rate = -1
This actually sets both values to the maximum of an unsigned integer.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.