Bug 1883662

Summary: [sbdb][raft] Tune out of the box timer to be 16sec
Product: OpenShift Container Platform Reporter: Joe Talerico <jtaleric>
Component: NetworkingAssignee: Anil Vishnoi <avishnoi>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: avishnoi, dblack, dcbw, fiezzi, mifiedle, rbrattai
Version: 4.6   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:46:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Joe Talerico 2020-09-29 19:43:30 UTC
Description of problem:
During our scale test of :

root@ip-172-31-68-73: ~/e2e-benchmarking/workloads/network-perf # oc get services -A | wc -l
3005
root@ip-172-31-68-73: ~/e2e-benchmarking/workloads/network-perf # oc get nodes | grep Ready | wc -l
107
root@ip-172-31-68-73: ~/e2e-benchmarking/workloads/network-perf # oc get pods -A | wc -l
14738

We noticed that sbdb had many leader elections :
Name: OVN_Southbound
Cluster ID: fffd (fffd1a68-2815-450b-8d44-823e2a1c7e02)
Server ID: 6a7d (6a7d99a3-62d4-4434-938e-56081abf884c)
Address: ssl:10.0.203.77:9644
Status: cluster member
Role: follower
Term: 318
Leader: 2234
Vote: 2234
Election timer: 16000
Log: [106718, 106777]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->0000 <-2234 <-058a ->058a
Servers:
    058a (058a at ssl:10.0.187.209:9644)
    2234 (2234 at ssl:10.0.148.212:9644)
    6a7d (6a7d at ssl:10.0.203.77:9644) (self)

We tuned sbdb to be 16sec, and the sbdb stopped having to run through elections. 

We should have CNO default to 16sec versus the 5sec timer it currently has. 

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-09-22-130743

How reproducible:
100%

Comment 4 Ross Brattain 2020-10-06 20:15:08 UTC
Timer is set correctly on 4.6.0-0.nightly-2020-10-05-234751


ovnkube-master-lrrw2
734:    - name: OVN_NB_RAFT_ELECTION_TIMER
735-      value: "10000"
--
1070:    - name: OVN_SB_RAFT_ELECTION_TIMER
1071-      value: "16000"

ovnkube-master-thrh2
734:    - name: OVN_NB_RAFT_ELECTION_TIMER
735-      value: "10000"
--
1070:    - name: OVN_SB_RAFT_ELECTION_TIMER
1071-      value: "16000"

ovnkube-master-9t7wm
734:    - name: OVN_NB_RAFT_ELECTION_TIMER
735-      value: "10000"
--
1070:    - name: OVN_SB_RAFT_ELECTION_TIMER
1071-      value: "16000"


log_ovnkube-master-thrh2
108:2020-10-06T12:52:06Z|00005|raft|INFO|Election timer changed from 1000 to 2000
109:2020-10-06T12:52:06Z|00006|raft|INFO|Election timer changed from 2000 to 4000
110:2020-10-06T12:52:06Z|00007|raft|INFO|Election timer changed from 4000 to 8000
111:2020-10-06T12:52:06Z|00008|raft|INFO|Election timer changed from 8000 to 16000
2903:2020-10-06T12:51:34Z|00005|raft|INFO|Election timer changed from 1000 to 2000
2904:2020-10-06T12:51:34Z|00006|raft|INFO|Election timer changed from 2000 to 4000
2905:2020-10-06T12:51:34Z|00007|raft|INFO|Election timer changed from 4000 to 8000
2906:2020-10-06T12:51:34Z|00008|raft|INFO|Election timer changed from 8000 to 10000

log_ovnkube-master-lrrw2
229:2020-10-06T12:51:36Z|00022|raft|INFO|Election timer changed from 10000 to 2000
230:2020-10-06T12:51:36Z|00023|raft|INFO|Election timer changed from 2000 to 4000
231:2020-10-06T12:51:36Z|00024|raft|INFO|Election timer changed from 4000 to 8000
232:2020-10-06T12:51:36Z|00025|raft|INFO|Election timer changed from 8000 to 10000
2128:2020-10-06T12:52:09Z|00022|raft|INFO|Election timer changed from 16000 to 2000
2129:2020-10-06T12:52:09Z|00023|raft|INFO|Election timer changed from 2000 to 4000
2130:2020-10-06T12:52:09Z|00024|raft|INFO|Election timer changed from 4000 to 8000
2131:2020-10-06T12:52:09Z|00025|raft|INFO|Election timer changed from 8000 to 16000

log_ovnkube-master-9t7wm
125:2020-10-06T12:52:12Z|00023|raft|INFO|Election timer changed from 16000 to 2000
126:2020-10-06T12:52:12Z|00024|raft|INFO|Election timer changed from 2000 to 4000
127:2020-10-06T12:52:12Z|00025|raft|INFO|Election timer changed from 4000 to 8000
128:2020-10-06T12:52:12Z|00026|raft|INFO|Election timer changed from 8000 to 16000
3335:2020-10-06T12:51:39Z|00023|raft|INFO|Election timer changed from 10000 to 2000
3336:2020-10-06T12:51:39Z|00024|raft|INFO|Election timer changed from 2000 to 4000
3337:2020-10-06T12:51:39Z|00025|raft|INFO|Election timer changed from 4000 to 8000
3338:2020-10-06T12:51:39Z|00026|raft|INFO|Election timer changed from 8000 to 10000

Comment 5 Anurag saxena 2020-10-08 13:26:02 UTC
@Joe, How many nodes were involved in your scale test? Guess we might need around same to verify this. Although functionally as per Ross comments above, it seems okay.

Comment 6 Joe Talerico 2020-10-08 13:30:17 UTC
(In reply to Anurag saxena from comment #5)
> @Joe, How many nodes were involved in your scale test? Guess we might need
> around same to verify this. Although functionally as per Ross comments
> above, it seems okay.

So, the bummer here is that I was able to cause many leader elections even with the 16 second Timer... I am not sure if this is the silver bullet for the issue. See https://bugzilla.redhat.com/show_bug.cgi?id=1855408#c30

Comment 7 Mike Fiedler 2020-10-08 14:44:33 UTC
Marking verified.  The timer has been changed but there may still be additional issues, per comment 6, tracked in bug 1855408.   No need to keep this one open

Comment 9 errata-xmlrpc 2020-10-27 16:46:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196