Bug 2041546

Summary: ovnkube: set election timer at RAFT cluster creation time
Product: OpenShift Container Platform Reporter: Dan Williams <dcbw>
Component: NetworkingAssignee: Dan Williams <dcbw>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: high CC: anbhat, rbrattai
Version: 4.10   
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-10 16:40:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan Williams 2022-01-17 16:46:03 UTC
OVN support added via https://bugzilla.redhat.com/show_bug.cgi?id=1831778

This change ensures that the DBs will have the correct RAFT timer from the beginning and closes a race where DB leadership changes before the timer is able to be set by the CNO container script (which gets later fixed by the dbchecker, but we shouldn't have to do that).

QE verification: we should see the DB container logs immediately set the election timer to the speficied value, like so:

++ /usr/share/ovn/scripts/ovn-ctl --help
++ grep '\--db-nb-election-timer'
+ test -n '  --db-nb-election-timer=MS OVN Northbound RAFT db election timer to use on db creation (in milliseconds)'
+ election_timer=--db-nb-election-timer=10000
+ wait 130
+ exec /usr/share/ovn/scripts/ovn-ctl --db-nb-cluster-local-port=9643 --db-nb-cluster-local-addr=10.0.0.2 --no-monitor --db-nb-cluster-local-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '--ovn-nb-log=-vconsole:info -vfile:off -vPATTERN:console:%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m' --db-nb-election-timer=10000 run_nb_ovsdb
Creating cluster database /etc/ovn/ovnnb_db.db.
2022-01-17T16:52:28.726Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2022-01-17T16:52:28.728Z|00002|raft|INFO|term 2: 109914 ms timeout expired, starting election
2022-01-17T16:52:28.728Z|00003|raft|INFO|term 2: elected leader by 1+ of 1 servers
2022-01-17T16:52:28.728Z|00004|raft|INFO|local server ID is 5df2
ovn-nbctl: unix:/var/run/ovn/ovnnb_db.sock: database connection failed (No such file or directory)
2022-01-17T16:52:28.732Z|00005|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.16.2
2022-01-17T16:52:28.732Z|00006|raft|INFO|Election timer changed from 1000 to 10000

eg in the the "changed from X to Y", that Y should match the OVN_[N|S]B_RAFT_ELECTION_TIMER env var specified in the CNO's manifests/0000_70_cluster-network-operator_03_deployment.yaml, times 1000.

Comment 2 Surya Seetharaman 2022-01-27 15:13:20 UTC
*** Bug 2033514 has been marked as a duplicate of this bug. ***

Comment 9 errata-xmlrpc 2022-03-10 16:40:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056