Description of problem: During our scale test of : root@ip-172-31-68-73: ~/e2e-benchmarking/workloads/network-perf # oc get services -A | wc -l 3005 root@ip-172-31-68-73: ~/e2e-benchmarking/workloads/network-perf # oc get nodes | grep Ready | wc -l 107 root@ip-172-31-68-73: ~/e2e-benchmarking/workloads/network-perf # oc get pods -A | wc -l 14738 We noticed that sbdb had many leader elections : Name: OVN_Southbound Cluster ID: fffd (fffd1a68-2815-450b-8d44-823e2a1c7e02) Server ID: 6a7d (6a7d99a3-62d4-4434-938e-56081abf884c) Address: ssl:10.0.203.77:9644 Status: cluster member Role: follower Term: 318 Leader: 2234 Vote: 2234 Election timer: 16000 Log: [106718, 106777] Entries not yet committed: 0 Entries not yet applied: 0 Connections: ->0000 <-2234 <-058a ->058a Servers: 058a (058a at ssl:10.0.187.209:9644) 2234 (2234 at ssl:10.0.148.212:9644) 6a7d (6a7d at ssl:10.0.203.77:9644) (self) We tuned sbdb to be 16sec, and the sbdb stopped having to run through elections. We should have CNO default to 16sec versus the 5sec timer it currently has. Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-09-22-130743 How reproducible: 100%
PR : https://github.com/openshift/cluster-network-operator/pull/812
Timer is set correctly on 4.6.0-0.nightly-2020-10-05-234751 ovnkube-master-lrrw2 734: - name: OVN_NB_RAFT_ELECTION_TIMER 735- value: "10000" -- 1070: - name: OVN_SB_RAFT_ELECTION_TIMER 1071- value: "16000" ovnkube-master-thrh2 734: - name: OVN_NB_RAFT_ELECTION_TIMER 735- value: "10000" -- 1070: - name: OVN_SB_RAFT_ELECTION_TIMER 1071- value: "16000" ovnkube-master-9t7wm 734: - name: OVN_NB_RAFT_ELECTION_TIMER 735- value: "10000" -- 1070: - name: OVN_SB_RAFT_ELECTION_TIMER 1071- value: "16000" log_ovnkube-master-thrh2 108:2020-10-06T12:52:06Z|00005|raft|INFO|Election timer changed from 1000 to 2000 109:2020-10-06T12:52:06Z|00006|raft|INFO|Election timer changed from 2000 to 4000 110:2020-10-06T12:52:06Z|00007|raft|INFO|Election timer changed from 4000 to 8000 111:2020-10-06T12:52:06Z|00008|raft|INFO|Election timer changed from 8000 to 16000 2903:2020-10-06T12:51:34Z|00005|raft|INFO|Election timer changed from 1000 to 2000 2904:2020-10-06T12:51:34Z|00006|raft|INFO|Election timer changed from 2000 to 4000 2905:2020-10-06T12:51:34Z|00007|raft|INFO|Election timer changed from 4000 to 8000 2906:2020-10-06T12:51:34Z|00008|raft|INFO|Election timer changed from 8000 to 10000 log_ovnkube-master-lrrw2 229:2020-10-06T12:51:36Z|00022|raft|INFO|Election timer changed from 10000 to 2000 230:2020-10-06T12:51:36Z|00023|raft|INFO|Election timer changed from 2000 to 4000 231:2020-10-06T12:51:36Z|00024|raft|INFO|Election timer changed from 4000 to 8000 232:2020-10-06T12:51:36Z|00025|raft|INFO|Election timer changed from 8000 to 10000 2128:2020-10-06T12:52:09Z|00022|raft|INFO|Election timer changed from 16000 to 2000 2129:2020-10-06T12:52:09Z|00023|raft|INFO|Election timer changed from 2000 to 4000 2130:2020-10-06T12:52:09Z|00024|raft|INFO|Election timer changed from 4000 to 8000 2131:2020-10-06T12:52:09Z|00025|raft|INFO|Election timer changed from 8000 to 16000 log_ovnkube-master-9t7wm 125:2020-10-06T12:52:12Z|00023|raft|INFO|Election timer changed from 16000 to 2000 126:2020-10-06T12:52:12Z|00024|raft|INFO|Election timer changed from 2000 to 4000 127:2020-10-06T12:52:12Z|00025|raft|INFO|Election timer changed from 4000 to 8000 128:2020-10-06T12:52:12Z|00026|raft|INFO|Election timer changed from 8000 to 16000 3335:2020-10-06T12:51:39Z|00023|raft|INFO|Election timer changed from 10000 to 2000 3336:2020-10-06T12:51:39Z|00024|raft|INFO|Election timer changed from 2000 to 4000 3337:2020-10-06T12:51:39Z|00025|raft|INFO|Election timer changed from 4000 to 8000 3338:2020-10-06T12:51:39Z|00026|raft|INFO|Election timer changed from 8000 to 10000
@Joe, How many nodes were involved in your scale test? Guess we might need around same to verify this. Although functionally as per Ross comments above, it seems okay.
(In reply to Anurag saxena from comment #5) > @Joe, How many nodes were involved in your scale test? Guess we might need > around same to verify this. Although functionally as per Ross comments > above, it seems okay. So, the bummer here is that I was able to cause many leader elections even with the 16 second Timer... I am not sure if this is the silver bullet for the issue. See https://bugzilla.redhat.com/show_bug.cgi?id=1855408#c30
Marking verified. The timer has been changed but there may still be additional issues, per comment 6, tracked in bug 1855408. No need to keep this one open
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196