Raft election timer supports setting from 100ms to 10min, default is 1s. We need to determine if this value is too low at high scale and if we need to increase it.
Election timeout is basically an outage time for the deployment from HA perspective. That means, in the worst case scenario, your cluster is not available to process the transaction for that timeout duration because the leader possibly left the cluster. Lower timeout value can lead to unnecessary leader elections and higher timeout value can increase the possible outage time. To find the right election timeout for any environment, it depends mainly on
* Network latency between the cluster nodes -- lower is better
* Disk write speed -- faster is better
* Implementation of heart beat in the software (in our case ovsdb-server) -- Less impact of cpu load on heartbeat is better for overall stability. Higher scale can starve the software for CPU and broadcasting heartbeat can be delayed significantly.
All these three parameters can vary based on the deployment environment, so evaluating the right timeout value in one environment might not work for another environment. To give a sense of an approximate timeout value for a specific scale, i think we need to test it with a various scale level (500 nodes, 1000 nodes, 2000 nodes) and publish the timeout numbers along with Network Latency, Disk Speed and CPU load on the leader node of the cluster. Most of the surprises generally come from the CPU load and how well software manages to send the heartbeat.
In my experience I have never seen this value being set more than 4 seconds for 3 node cluster, where network latency is <100ms, disk write speed is >700mbps and average cpu load is 30-40%. But we need to evaluate the similar number for our environment and publish them as a recommendation for various scales. Current default 1 second value is on the higher side for default deployment, so i would recommend to change the default value during the openshift deployment based on the size of the deployment.
Food for thought : given that ovsdb-server allows to dynamically set the timeout value, we can automate it to change the timeout values based on the average load on the cluster leader (and network latency and disk speed).
Following PR is under review for this bug :
Just to add some more details about the election-timer value we need to set to scale to certain number of nodes.
<100 nodes -- election-timer=5 seconds
<200 nodes -- election-timer=10 seconds
<300 nodes -- election-timer=20 seconds
<400 nodes -- election-timer=40 seconds
<500 nodes -- election-timer=50 seconds
<600 nodes -- election-timer=60 seconds
ovsdb server doesn't scale beyond that, it starts hitting issues where sb-db cluster goes into bad state and doesn't recover until and unless it's scaled down.
Setting this to 4.6 so we can land the fix in master. We can decide if we need a backport after we get it in.
(In reply to Ben Bennett from comment #4)
> Setting this to 4.6 so we can land the fix in master. We can decide if we
> need a backport after we get it in.
I think it will need a backport to 4.5 for webscale Not sure about 4.4.z
Dcbw can confirm
On a 4.5.7 cluster on BM, I verified that the timer has been set to 5000ms by default.
Marking VERIFIED based on comment 10. Thanks, Sai.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.