Description of problem: This is an OVN clone/fork of https://issues.redhat.com/browse/OCPBUGS-3020 The customer case is 03334226 (data is attached there and can be retrieved with support-shell) On the customer case, the ovn-dbs can be found in: tmp.lxlYQSQfbN.tar.gz latest dbs ovnb_db.rar dbs before the last db rebuild The databases can we extracted from the archives and must then be converted with dos2unix before they can be analyzed locally, e.g: ~~~ dos2unix ovnsb_db.db.ovnkube-master-vnlj5 ~~~ After converting the dbs and loading them into a container that's running the databases I can then have a look at them (https://github.com/andreaskaris/ovn-trace-container): ~~~ [root@c767284545bf /]# ovn-sbctl list Logical_Flow | grep uuid | wc -l 2022-11-08T16:35:36Z|00001|ovsdb_idl|WARN|Logical_Flow table in OVN_Southbound database lacks controller_meter column (database needs upgrade?) 2022-11-08T16:35:36Z|00002|ovsdb_idl|WARN|Logical_Flow table in OVN_Southbound database lacks tags column (database needs upgrade?) 74647 [root@c767284545bf /]# ovn-sbctl list Port_Binding | grep uuid | wc -l 2022-11-08T16:35:42Z|00001|ovsdb_idl|WARN|Port_Binding table in OVN_Southbound database lacks additional_chassis column (database needs upgrade?) 2022-11-08T16:35:42Z|00002|ovsdb_idl|WARN|Port_Binding table in OVN_Southbound database lacks additional_encap column (database needs upgrade?) 2022-11-08T16:35:42Z|00003|ovsdb_idl|WARN|Port_Binding table in OVN_Southbound database lacks port_security column (database needs upgrade?) 2022-11-08T16:35:42Z|00004|ovsdb_idl|WARN|Port_Binding table in OVN_Southbound database lacks requested_additional_chassis column (database needs upgrade?) 2022-11-08T16:35:42Z|00005|ovsdb_idl|WARN|Port_Binding table in OVN_Southbound database lacks requested_chassis column (database needs upgrade?) 3839 [root@c767284545bf /]# ovn-sbctl list Logical_DP_Group | grep uuid | wc -l 125 [root@c767284545bf /]# ovn-sbctl list Port_Group | grep uuid | wc -l 618 [root@c767284545bf /]# [root@c767284545bf /]# [root@c767284545bf /]# ovn-sbctl list Load_Balancer | grep uuid | wc -l 933 [root@c767284545bf /]# ovn-sbctl dump-flows | wc -l 246758 ~~~ Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I went through the case history - at October 14th, the cluster only had ca 130 nodes; at that point, RH support recommended an upgrade of the master nodes and they also defrag'ed the etcd db. On October 20th, we see that the cluster is in a much better shape. Around that time, a massive addition of nodes happens (October 19th / 20th), more than doubling the cluster's size from ~130 nodes to over 300 nodes. Ever since, then cluster could not be recovered - therefore we expect that this is a scale issue
$ omg get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.23 True False 9m40s Error while reconciling 4.8.23: some cluster operators have not yet rolled out https://github.com/openshift/ovn-kubernetes/blob/b5183e8b7b7b9551600dea317bf5c212db0cf4e6/Dockerfile#L36 ARG ovnver=20.12.0-183.el8fdp
I'm closing this since there has been no activity for over 8 months and the customer issue is closed.