Bug 1883662 - [sbdb][raft] Tune out of the box timer to be 16sec
Summary: [sbdb][raft] Tune out of the box timer to be 16sec
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: All
OS: All
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: Anil Vishnoi
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-29 19:43 UTC by Joe Talerico
Modified: 2020-10-27 16:46 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:46:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 812 0 None closed Bug 1883662: Tune sb-db raft cluster election-timer 2020-10-26 10:38:56 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:46:51 UTC

Description Joe Talerico 2020-09-29 19:43:30 UTC
Description of problem:
During our scale test of :

root@ip-172-31-68-73: ~/e2e-benchmarking/workloads/network-perf # oc get services -A | wc -l
3005
root@ip-172-31-68-73: ~/e2e-benchmarking/workloads/network-perf # oc get nodes | grep Ready | wc -l
107
root@ip-172-31-68-73: ~/e2e-benchmarking/workloads/network-perf # oc get pods -A | wc -l
14738

We noticed that sbdb had many leader elections :
Name: OVN_Southbound
Cluster ID: fffd (fffd1a68-2815-450b-8d44-823e2a1c7e02)
Server ID: 6a7d (6a7d99a3-62d4-4434-938e-56081abf884c)
Address: ssl:10.0.203.77:9644
Status: cluster member
Role: follower
Term: 318
Leader: 2234
Vote: 2234
Election timer: 16000
Log: [106718, 106777]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->0000 <-2234 <-058a ->058a
Servers:
    058a (058a at ssl:10.0.187.209:9644)
    2234 (2234 at ssl:10.0.148.212:9644)
    6a7d (6a7d at ssl:10.0.203.77:9644) (self)

We tuned sbdb to be 16sec, and the sbdb stopped having to run through elections. 

We should have CNO default to 16sec versus the 5sec timer it currently has. 

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-09-22-130743

How reproducible:
100%

Comment 4 Ross Brattain 2020-10-06 20:15:08 UTC
Timer is set correctly on 4.6.0-0.nightly-2020-10-05-234751


ovnkube-master-lrrw2
734:    - name: OVN_NB_RAFT_ELECTION_TIMER
735-      value: "10000"
--
1070:    - name: OVN_SB_RAFT_ELECTION_TIMER
1071-      value: "16000"

ovnkube-master-thrh2
734:    - name: OVN_NB_RAFT_ELECTION_TIMER
735-      value: "10000"
--
1070:    - name: OVN_SB_RAFT_ELECTION_TIMER
1071-      value: "16000"

ovnkube-master-9t7wm
734:    - name: OVN_NB_RAFT_ELECTION_TIMER
735-      value: "10000"
--
1070:    - name: OVN_SB_RAFT_ELECTION_TIMER
1071-      value: "16000"


log_ovnkube-master-thrh2
108:2020-10-06T12:52:06Z|00005|raft|INFO|Election timer changed from 1000 to 2000
109:2020-10-06T12:52:06Z|00006|raft|INFO|Election timer changed from 2000 to 4000
110:2020-10-06T12:52:06Z|00007|raft|INFO|Election timer changed from 4000 to 8000
111:2020-10-06T12:52:06Z|00008|raft|INFO|Election timer changed from 8000 to 16000
2903:2020-10-06T12:51:34Z|00005|raft|INFO|Election timer changed from 1000 to 2000
2904:2020-10-06T12:51:34Z|00006|raft|INFO|Election timer changed from 2000 to 4000
2905:2020-10-06T12:51:34Z|00007|raft|INFO|Election timer changed from 4000 to 8000
2906:2020-10-06T12:51:34Z|00008|raft|INFO|Election timer changed from 8000 to 10000

log_ovnkube-master-lrrw2
229:2020-10-06T12:51:36Z|00022|raft|INFO|Election timer changed from 10000 to 2000
230:2020-10-06T12:51:36Z|00023|raft|INFO|Election timer changed from 2000 to 4000
231:2020-10-06T12:51:36Z|00024|raft|INFO|Election timer changed from 4000 to 8000
232:2020-10-06T12:51:36Z|00025|raft|INFO|Election timer changed from 8000 to 10000
2128:2020-10-06T12:52:09Z|00022|raft|INFO|Election timer changed from 16000 to 2000
2129:2020-10-06T12:52:09Z|00023|raft|INFO|Election timer changed from 2000 to 4000
2130:2020-10-06T12:52:09Z|00024|raft|INFO|Election timer changed from 4000 to 8000
2131:2020-10-06T12:52:09Z|00025|raft|INFO|Election timer changed from 8000 to 16000

log_ovnkube-master-9t7wm
125:2020-10-06T12:52:12Z|00023|raft|INFO|Election timer changed from 16000 to 2000
126:2020-10-06T12:52:12Z|00024|raft|INFO|Election timer changed from 2000 to 4000
127:2020-10-06T12:52:12Z|00025|raft|INFO|Election timer changed from 4000 to 8000
128:2020-10-06T12:52:12Z|00026|raft|INFO|Election timer changed from 8000 to 16000
3335:2020-10-06T12:51:39Z|00023|raft|INFO|Election timer changed from 10000 to 2000
3336:2020-10-06T12:51:39Z|00024|raft|INFO|Election timer changed from 2000 to 4000
3337:2020-10-06T12:51:39Z|00025|raft|INFO|Election timer changed from 4000 to 8000
3338:2020-10-06T12:51:39Z|00026|raft|INFO|Election timer changed from 8000 to 10000

Comment 5 Anurag saxena 2020-10-08 13:26:02 UTC
@Joe, How many nodes were involved in your scale test? Guess we might need around same to verify this. Although functionally as per Ross comments above, it seems okay.

Comment 6 Joe Talerico 2020-10-08 13:30:17 UTC
(In reply to Anurag saxena from comment #5)
> @Joe, How many nodes were involved in your scale test? Guess we might need
> around same to verify this. Although functionally as per Ross comments
> above, it seems okay.

So, the bummer here is that I was able to cause many leader elections even with the 16 second Timer... I am not sure if this is the silver bullet for the issue. See https://bugzilla.redhat.com/show_bug.cgi?id=1855408#c30

Comment 7 Mike Fiedler 2020-10-08 14:44:33 UTC
Marking verified.  The timer has been changed but there may still be additional issues, per comment 6, tracked in bug 1855408.   No need to keep this one open

Comment 9 errata-xmlrpc 2020-10-27 16:46:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.