Bug 2007009

Summary: 120 node OVNK cluster is not stable after cluster-density 1000 projects
Product: OpenShift Container Platform Reporter: Mohit Sheth <msheth>
Component: NetworkingAssignee: Tim Rozet <trozet>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED DUPLICATE Docs Contact:
Severity: unspecified    
Priority: unspecified CC: bpickard, trozet
Version: 4.9   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: perfscale-ovn
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-29 15:30:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mohit Sheth 2021-09-22 19:36:32 UTC
Description of problem:
Running cluster-density 1000 successfully at 120 node scale is one of the scale targets for OVN as the default SDN.
Currently the cluster is not stable when we run the above test. At this point we are not able to get a must-gather pod running as well.

-------------------------------------------------------------------------
ovnkube-master-5scpt   5/6     CrashLoopBackOff   11 (80s ago)    5h41m
ovnkube-master-h2p9p   6/6     Running            5 (99m ago)     5h44m
ovnkube-master-lpv5m   6/6     Running            7 (94m ago)     5h46m
-------------------------------------------------------------------------

-------------------------------------------------------------------------
  Warning  Unhealthy     88m                  kubelet          Readiness probe failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 cluster/status OVN_Northbound
++ grep 'Leader: unknown'
++ true
+ leader_status=
  Warning  Unhealthy  88m (x2 over 88m)  kubelet  Readiness probe failed: NB DB Raft leader is unknown to the cluster node.
++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 cluster/status OVN_Northbound
++ grep 'Leader: unknown'
+ leader_status='Leader: unknown'
+ [[ ! -z Leader: unknown ]]
+ echo 'NB DB Raft leader is unknown to the cluster node.'
+ exit 1
-------------------------------------------------------------------------

-------------------------------------------------------------------------
F0922 19:01:19.303715       1 ovnkube.go:130] error when trying to initialize go-ovn NB client: couldn't initialize NBDB client: error creating SSL OVNDBClient for database OVN_Northbound at address ssl:10.0.144.235:9641,ssl:10.0.164.234:9641,ssl:10.0.197.135:9641: failed to connec
-------------------------------------------------------------------------


Version-Release number of selected component (if applicable):
4.9

Comment 2 Tim Rozet 2021-09-29 15:30:23 UTC
From the latest scale run the cluster was stable after 120 node cluster density test. The fixes in 1959352 will resolve this issue.

*** This bug has been marked as a duplicate of bug 1959352 ***