Bug 2141066

Summary: Possible OVN scale issue - help needed determining the issue
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Andreas Karis <akaris>
Component: ovn22.12Assignee: OVN Team <ovnteam>
Status: CLOSED WONTFIX QA Contact: Jianlin Shi <jishi>
Severity: high Docs Contact:
Priority: high    
Version: FDP 22.LCC: ctrautma, jiji, mmichels, pjagtap, skharat
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-28 17:51:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Andreas Karis 2022-11-08 16:38:49 UTC
Description of problem:
This is an OVN clone/fork of https://issues.redhat.com/browse/OCPBUGS-3020
The customer case is 03334226 (data is attached there and can be retrieved with support-shell)

On the customer case, the ovn-dbs can be found in:
tmp.lxlYQSQfbN.tar.gz latest dbs
ovnb_db.rar dbs before the last db rebuild

The databases can we extracted from the archives and must then be converted with dos2unix before they can be analyzed locally, e.g:
~~~
dos2unix ovnsb_db.db.ovnkube-master-vnlj5
~~~

After converting the dbs and  loading them into a container that's running the databases I can then have a look at them (https://github.com/andreaskaris/ovn-trace-container):
~~~
[root@c767284545bf /]# ovn-sbctl list Logical_Flow | grep uuid | wc -l
2022-11-08T16:35:36Z|00001|ovsdb_idl|WARN|Logical_Flow table in OVN_Southbound database lacks controller_meter column (database needs upgrade?)
2022-11-08T16:35:36Z|00002|ovsdb_idl|WARN|Logical_Flow table in OVN_Southbound database lacks tags column (database needs upgrade?)
74647
[root@c767284545bf /]# ovn-sbctl list Port_Binding | grep uuid | wc -l
2022-11-08T16:35:42Z|00001|ovsdb_idl|WARN|Port_Binding table in OVN_Southbound database lacks additional_chassis column (database needs upgrade?)
2022-11-08T16:35:42Z|00002|ovsdb_idl|WARN|Port_Binding table in OVN_Southbound database lacks additional_encap column (database needs upgrade?)
2022-11-08T16:35:42Z|00003|ovsdb_idl|WARN|Port_Binding table in OVN_Southbound database lacks port_security column (database needs upgrade?)
2022-11-08T16:35:42Z|00004|ovsdb_idl|WARN|Port_Binding table in OVN_Southbound database lacks requested_additional_chassis column (database needs upgrade?)
2022-11-08T16:35:42Z|00005|ovsdb_idl|WARN|Port_Binding table in OVN_Southbound database lacks requested_chassis column (database needs upgrade?)
3839
[root@c767284545bf /]# ovn-sbctl list Logical_DP_Group | grep uuid | wc -l
125
[root@c767284545bf /]# ovn-sbctl list Port_Group | grep uuid | wc -l
618
[root@c767284545bf /]# 
[root@c767284545bf /]# 
[root@c767284545bf /]# ovn-sbctl list Load_Balancer | grep uuid | wc -l
933
[root@c767284545bf /]# ovn-sbctl dump-flows | wc -l
246758
~~~

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Andreas Karis 2022-11-08 16:41:06 UTC
I went through the case history - at October 14th, the cluster only had ca 130 nodes; at that point, RH support recommended an upgrade of the master nodes and they also defrag'ed the etcd db.
On October 20th, we see that the cluster is in a much better shape.
Around that time, a massive addition of nodes happens (October 19th / 20th), more than doubling the cluster's size from ~130 nodes to over 300 nodes. Ever since, then cluster could not be recovered - therefore we expect that this is a scale issue

Comment 4 Andreas Karis 2022-11-08 17:09:50 UTC
$ omg get clusterversion
NAME     VERSION  AVAILABLE  PROGRESSING  SINCE  STATUS
version  4.8.23   True       False        9m40s  Error while reconciling 4.8.23: some cluster operators have not yet rolled out

https://github.com/openshift/ovn-kubernetes/blob/b5183e8b7b7b9551600dea317bf5c212db0cf4e6/Dockerfile#L36
ARG ovnver=20.12.0-183.el8fdp

Comment 10 Mark Michelson 2023-07-28 17:51:18 UTC
I'm closing this since there has been no activity for over 8 months and the customer issue is closed.