Bug 2141066 - Possible OVN scale issue - help needed determining the issue
Summary: Possible OVN scale issue - help needed determining the issue
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: ovn22.12
Version: FDP 22.L
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: OVN Team
QA Contact: Jianlin Shi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-11-08 16:38 UTC by Andreas Karis
Modified: 2023-07-28 17:51 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-28 17:51:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-2440 0 None None None 2022-11-08 16:44:15 UTC

Description Andreas Karis 2022-11-08 16:38:49 UTC
Description of problem:
This is an OVN clone/fork of https://issues.redhat.com/browse/OCPBUGS-3020
The customer case is 03334226 (data is attached there and can be retrieved with support-shell)

On the customer case, the ovn-dbs can be found in:
tmp.lxlYQSQfbN.tar.gz latest dbs
ovnb_db.rar dbs before the last db rebuild

The databases can we extracted from the archives and must then be converted with dos2unix before they can be analyzed locally, e.g:
~~~
dos2unix ovnsb_db.db.ovnkube-master-vnlj5
~~~

After converting the dbs and  loading them into a container that's running the databases I can then have a look at them (https://github.com/andreaskaris/ovn-trace-container):
~~~
[root@c767284545bf /]# ovn-sbctl list Logical_Flow | grep uuid | wc -l
2022-11-08T16:35:36Z|00001|ovsdb_idl|WARN|Logical_Flow table in OVN_Southbound database lacks controller_meter column (database needs upgrade?)
2022-11-08T16:35:36Z|00002|ovsdb_idl|WARN|Logical_Flow table in OVN_Southbound database lacks tags column (database needs upgrade?)
74647
[root@c767284545bf /]# ovn-sbctl list Port_Binding | grep uuid | wc -l
2022-11-08T16:35:42Z|00001|ovsdb_idl|WARN|Port_Binding table in OVN_Southbound database lacks additional_chassis column (database needs upgrade?)
2022-11-08T16:35:42Z|00002|ovsdb_idl|WARN|Port_Binding table in OVN_Southbound database lacks additional_encap column (database needs upgrade?)
2022-11-08T16:35:42Z|00003|ovsdb_idl|WARN|Port_Binding table in OVN_Southbound database lacks port_security column (database needs upgrade?)
2022-11-08T16:35:42Z|00004|ovsdb_idl|WARN|Port_Binding table in OVN_Southbound database lacks requested_additional_chassis column (database needs upgrade?)
2022-11-08T16:35:42Z|00005|ovsdb_idl|WARN|Port_Binding table in OVN_Southbound database lacks requested_chassis column (database needs upgrade?)
3839
[root@c767284545bf /]# ovn-sbctl list Logical_DP_Group | grep uuid | wc -l
125
[root@c767284545bf /]# ovn-sbctl list Port_Group | grep uuid | wc -l
618
[root@c767284545bf /]# 
[root@c767284545bf /]# 
[root@c767284545bf /]# ovn-sbctl list Load_Balancer | grep uuid | wc -l
933
[root@c767284545bf /]# ovn-sbctl dump-flows | wc -l
246758
~~~

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Andreas Karis 2022-11-08 16:41:06 UTC
I went through the case history - at October 14th, the cluster only had ca 130 nodes; at that point, RH support recommended an upgrade of the master nodes and they also defrag'ed the etcd db.
On October 20th, we see that the cluster is in a much better shape.
Around that time, a massive addition of nodes happens (October 19th / 20th), more than doubling the cluster's size from ~130 nodes to over 300 nodes. Ever since, then cluster could not be recovered - therefore we expect that this is a scale issue

Comment 4 Andreas Karis 2022-11-08 17:09:50 UTC
$ omg get clusterversion
NAME     VERSION  AVAILABLE  PROGRESSING  SINCE  STATUS
version  4.8.23   True       False        9m40s  Error while reconciling 4.8.23: some cluster operators have not yet rolled out

https://github.com/openshift/ovn-kubernetes/blob/b5183e8b7b7b9551600dea317bf5c212db0cf4e6/Dockerfile#L36
ARG ovnver=20.12.0-183.el8fdp

Comment 10 Mark Michelson 2023-07-28 17:51:18 UTC
I'm closing this since there has been no activity for over 8 months and the customer issue is closed.


Note You need to log in before you can comment on or make changes to this bug.