1822296 – [OVN] [SCALE] Investigate/Adjust RAFT election timer

Bug 1822296 - [OVN] [SCALE] Investigate/Adjust RAFT election timer

Summary: [OVN] [SCALE] Investigate/Adjust RAFT election timer

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Anil Vishnoi
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:	SDN-CI-IMPACT
Depends On:
Blocks:	1851518
TreeView+	depends on / blocked

Reported:	2020-04-08 16:58 UTC by Tim Rozet
Modified:	2020-10-27 15:58 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1851518 (view as bug list)
Environment:
Last Closed:	2020-10-27 15:57:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 615	0	None	closed	Bug 1822296: Expose raft (nb-db/sb-db) election-timer and ovn-controller inactivit…	2020-09-29 17:30:13 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 15:58:14 UTC

Description Tim Rozet 2020-04-08 16:58:09 UTC

Raft election timer supports setting from 100ms to 10min, default is 1s. We need to determine if this value is too low at high scale and if we need to increase it.

Comment 1 Anil Vishnoi 2020-04-16 01:52:20 UTC

Election timeout is basically an outage time for the deployment from HA perspective. That means, in the worst case scenario, your cluster is not available to process the transaction for that timeout duration because the leader possibly left the cluster. Lower timeout value can lead to unnecessary leader elections and higher timeout value can increase the possible outage time. To find the right election timeout for any environment, it depends mainly on
* Network latency between the cluster nodes -- lower is better
* Disk write speed -- faster is better
* Implementation of heart beat in the software (in our case ovsdb-server) -- Less impact of cpu load on heartbeat is better for overall stability. Higher scale can starve the software for CPU and broadcasting heartbeat can be delayed significantly.

All these three parameters can vary based on the deployment environment, so evaluating the right timeout value in one environment might not work for another environment. To give a sense of an approximate timeout value for a specific scale, i think we need to test it with a various scale level (500 nodes, 1000 nodes, 2000 nodes) and publish the timeout numbers along with Network Latency, Disk Speed and CPU load on the leader node of the cluster. Most of the surprises generally come from the CPU load and how well software manages to send the heartbeat. 

In my experience I have never seen this value being set more than 4 seconds for 3 node cluster, where network latency is <100ms, disk write speed is >700mbps and average cpu load is 30-40%. But we need to evaluate the similar number for our environment and publish them as a recommendation for various scales. Current default 1 second value is on the higher side for default deployment, so i would recommend to change the default value during the openshift deployment based on the size of the deployment.

Food for thought : given that ovsdb-server allows to dynamically set the timeout value, we can automate it to change the timeout values based on the average load on the cluster leader (and network latency and disk speed).

Comment 2 Anil Vishnoi 2020-05-29 16:38:46 UTC

Following PR is under review for this bug :

https://github.com/openshift/cluster-network-operator/pull/615

Comment 3 Anil Vishnoi 2020-06-03 01:31:52 UTC

Just to add some more details about the election-timer value we need to set to scale to certain number of nodes.

<100 nodes -- election-timer=5 seconds
<200 nodes -- election-timer=10 seconds
<300 nodes -- election-timer=20 seconds
<400 nodes -- election-timer=40 seconds
<500 nodes -- election-timer=50 seconds
<600 nodes -- election-timer=60 seconds
ovsdb server doesn't scale beyond that, it starts hitting issues where sb-db cluster goes into bad state and doesn't recover until and unless it's scaled down.

Comment 4 Ben Bennett 2020-06-10 13:37:34 UTC

Setting this to 4.6 so we can land the fix in master.  We can decide if we need a backport after we get it in.

Comment 5 Rashid Khan 2020-06-25 15:22:35 UTC

(In reply to Ben Bennett from comment #4)
> Setting this to 4.6 so we can land the fix in master.  We can decide if we
> need a backport after we get it in.

Hi Ben, 
I think it will need a backport to 4.5 for webscale Not sure about 4.4.z 
Dcbw can confirm

Comment 10 Sai Sindhur Malleni 2020-09-02 20:56:19 UTC

On a 4.5.7 cluster on BM, I verified that the timer has been set to 5000ms by default.

Comment 12 Mike Fiedler 2020-09-04 13:04:06 UTC

Marking VERIFIED based on comment 10.   Thanks, Sai.

Comment 14 errata-xmlrpc 2020-10-27 15:57:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.