Bug 1822296 - [OVN] [SCALE] Investigate/Adjust RAFT election timer
Summary: [OVN] [SCALE] Investigate/Adjust RAFT election timer
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: Anil Vishnoi
QA Contact: Mike Fiedler
URL:
Whiteboard: SDN-CI-IMPACT
Depends On:
Blocks: 1851518
TreeView+ depends on / blocked
 
Reported: 2020-04-08 16:58 UTC by Tim Rozet
Modified: 2020-10-27 15:58 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1851518 (view as bug list)
Environment:
Last Closed: 2020-10-27 15:57:47 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 615 0 None closed Bug 1822296: Expose raft (nb-db/sb-db) election-timer and ovn-controller inactivit… 2020-09-29 17:30:13 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 15:58:14 UTC

Description Tim Rozet 2020-04-08 16:58:09 UTC
Raft election timer supports setting from 100ms to 10min, default is 1s. We need to determine if this value is too low at high scale and if we need to increase it.

Comment 1 Anil Vishnoi 2020-04-16 01:52:20 UTC
Election timeout is basically an outage time for the deployment from HA perspective. That means, in the worst case scenario, your cluster is not available to process the transaction for that timeout duration because the leader possibly left the cluster. Lower timeout value can lead to unnecessary leader elections and higher timeout value can increase the possible outage time. To find the right election timeout for any environment, it depends mainly on
* Network latency between the cluster nodes -- lower is better
* Disk write speed -- faster is better
* Implementation of heart beat in the software (in our case ovsdb-server) -- Less impact of cpu load on heartbeat is better for overall stability. Higher scale can starve the software for CPU and broadcasting heartbeat can be delayed significantly.

All these three parameters can vary based on the deployment environment, so evaluating the right timeout value in one environment might not work for another environment. To give a sense of an approximate timeout value for a specific scale, i think we need to test it with a various scale level (500 nodes, 1000 nodes, 2000 nodes) and publish the timeout numbers along with Network Latency, Disk Speed and CPU load on the leader node of the cluster. Most of the surprises generally come from the CPU load and how well software manages to send the heartbeat. 

In my experience I have never seen this value being set more than 4 seconds for 3 node cluster, where network latency is <100ms, disk write speed is >700mbps and average cpu load is 30-40%. But we need to evaluate the similar number for our environment and publish them as a recommendation for various scales. Current default 1 second value is on the higher side for default deployment, so i would recommend to change the default value during the openshift deployment based on the size of the deployment.

Food for thought : given that ovsdb-server allows to dynamically set the timeout value, we can automate it to change the timeout values based on the average load on the cluster leader (and network latency and disk speed).

Comment 2 Anil Vishnoi 2020-05-29 16:38:46 UTC
Following PR is under review for this bug :

https://github.com/openshift/cluster-network-operator/pull/615

Comment 3 Anil Vishnoi 2020-06-03 01:31:52 UTC
Just to add some more details about the election-timer value we need to set to scale to certain number of nodes.

<100 nodes -- election-timer=5 seconds
<200 nodes -- election-timer=10 seconds
<300 nodes -- election-timer=20 seconds
<400 nodes -- election-timer=40 seconds
<500 nodes -- election-timer=50 seconds
<600 nodes -- election-timer=60 seconds
ovsdb server doesn't scale beyond that, it starts hitting issues where sb-db cluster goes into bad state and doesn't recover until and unless it's scaled down.

Comment 4 Ben Bennett 2020-06-10 13:37:34 UTC
Setting this to 4.6 so we can land the fix in master.  We can decide if we need a backport after we get it in.

Comment 5 Rashid Khan 2020-06-25 15:22:35 UTC
(In reply to Ben Bennett from comment #4)
> Setting this to 4.6 so we can land the fix in master.  We can decide if we
> need a backport after we get it in.

Hi Ben, 
I think it will need a backport to 4.5 for webscale Not sure about 4.4.z 
Dcbw can confirm

Comment 10 Sai Sindhur Malleni 2020-09-02 20:56:19 UTC
On a 4.5.7 cluster on BM, I verified that the timer has been set to 5000ms by default.

Comment 12 Mike Fiedler 2020-09-04 13:04:06 UTC
Marking VERIFIED based on comment 10.   Thanks, Sai.

Comment 14 errata-xmlrpc 2020-10-27 15:57:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.