Description of problem: When a bootstrap controller node is replaced the OVN database cluster is left in a partitioned state, using controller0(c0) as an example (Full output at the bottom, IPs are from lab system): c0: Cluster ID: 0eea (0eea95da-5ff3-4719-9475-7bb479e7e07b) Servers: 1143 (1143 at tcp:172.17.1.89:6643) (self) next_index=3 match_index=3 c1: Cluster ID: b2c9 (b2c9d033-f770-4763-998d-f0d7a819fd6c) Servers: a7ec (a7ec at tcp:172.17.1.61:6643) next_index=71 match_index=70 last msg 3122 ms ago ba93 (ba93 at tcp:172.17.1.89:6643) next_index=71 match_index=70 last msg 79802 ms ago ce2b (ce2b at tcp:172.17.1.10:6643) (self) next_index=64 match_index=70 c2: Cluster ID: b2c9 (b2c9d033-f770-4763-998d-f0d7a819fd6c) Servers: a7ec (a7ec at tcp:172.17.1.61:6643) (self) ba93 (ba93 at tcp:172.17.1.89:6643) ce2b (ce2b at tcp:172.17.1.10:6643) last msg 796 ms ago What we can see is c0 is in a single cluster of it's own, while c1 and c2 are in the old cluster with a stale entry for c0 I'm using the northbound database as an example but this affects both North and South in the same way. We configure all clients (ovn_controller, ovn_meta, neutron_api, etc) to connect via an array of all 3 servers. With this partition some requests will route to the empty, partitioned cluster and fails in weird ways like Neutron server could fail to find NB/SB resources if connected to the new single node NB/SB cluster, and failures related to vm create/shelve/migration could be seen when ovn-controller on compute node is connected to this single node cluster(this reconnection generally could happen after ovn-controller restart or compute node reboot) We need to avoid bootstrapping new cluster during controller replacement. One way could be it first checks if there is an existing ovn raft cluster and if so, joins it rather than bootstrapping its own. Not yet verified the behavior for non bootstrap controller(ex c1/c2) replacement. But seems just "cluster/kick" might be enough before deploying the new controller node. If yes this would only require documentation changes. This needs verification if just this enough or any other steps are required. Version-Release number of selected component (if applicable): podman exec ovn_cluster_north_db_server rpm -qa | egrep 'ovn|ovs|neutron' ovn22.03-22.03.0-69.el9fdp.x86_64 rhosp-ovn-22.03-5.el9ost.noarch ovn22.03-central-22.03.0-69.el9fdp.x86_64 rhosp-ovn-central-22.03-5.el9ost.noarch [root@controller-0 ~]# rpm -qa | egrep 'ovn|ovs|neutron' python3-neutronclient-7.3.0-0.20220707060727.4963c7a.el9ost.noarch puppet-ovn-18.5.0-0.20220218021734.d496e5a.el9ost.noarch puppet-neutron-18.5.1-0.20220714150330.3bdf311.el9ost.noarch [root@controller-0 ~]# cat /etc/rhosp-release Red Hat OpenStack Platform release 17.0.0 (Wallaby) [root@controller-0 ~]# podman images |egrep 'ovn|ovs|neutron' undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp17-openstack-neutron-server 17.0_20220908.1 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp17-openstack-nova-novncproxy 17.0_20220908.1 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp17-openstack-ovn-nb-db-server 17.0_20220908.1 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp17-openstack-ovn-sb-db-server 17.0_20220908.1 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp17-openstack-ovn-controller 17.0_20220908.1 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp17-openstack-ovn-northd 17.0_20220908.1 How reproducible: Every time Steps to Reproduce: To reproduce single node cluster on c0 and 3 node cluster on c1 and c2 with stale entry for c0 do: On c0: ```bash podman exec ovn_cluster_north_db_server rm /var/run/ovn/ovnnb_db.db /var/lib/ovn/.ovnnb_db.db.~lock~ systemctl restart tripleo_ovn_cluster_north_db_server sleep 3 podman exec ovn_cluster_north_db_server ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound ``` to fix, on c1 or c2 run: ``` podman exec ovn_cluster_north_db_server ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/kick OVN_Northbound <server id of c0> ``` on c0 run: ``` podman exec ovn_cluster_north_db_server rm /var/run/ovn/ovnnb_db.db /var/lib/ovn/.ovnnb_db.db.~lock~ podman exec ovn_cluster_north_db_server ovsdb-tool join-cluster /var/lib/ovn/ovnnb_db.db OVN_Northbound tcp:172.17.1.89:6643 tcp:172.17.1.10:6643 systemctl restart ``` Actual results: Bootstrap Controller is replaced and creates it's own single node cluster, partitioning its self from the original Expected results: Bootstrap Controller is replaced and connects to existing OVN database cluster. Additional info: [root@controller-0 ~]# podman exec ovn_cluster_north_db_server rm /var/run/ovn/ovnnb_db.db /var/lib/ovn/.ovnnb_db.db.~lock~ systemctl restart tripleo_ovn_cluster_north_db_server sleep 3 podman exec ovn_cluster_north_db_server ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound 1143 Name: OVN_Northbound Cluster ID: 0eea (0eea95da-5ff3-4719-9475-7bb479e7e07b) Server ID: 1143 (1143c224-6a6b-4e0e-9ca1-b24390e44555) Address: tcp:172.17.1.89:6643 Status: cluster member Role: leader Term: 2 Leader: self Vote: self Last Election started 2908 ms ago, reason: timeout Last Election won: 2908 ms ago Election timer: 10000 Log: [2, 4] Entries not yet committed: 0 Entries not yet applied: 0 Connections: <-0000 <-0000 Disconnections: 0 Servers: 1143 (1143 at tcp:172.17.1.89:6643) (self) next_index=3 match_index=3 [root@controller-1 ~]# podman exec ovn_cluster_north_db_server ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound ce2b Name: OVN_Northbound Cluster ID: b2c9 (b2c9d033-f770-4763-998d-f0d7a819fd6c) Server ID: ce2b (ce2be8c7-5536-4c0f-a3a8-4ee17fc202f8) Address: tcp:172.17.1.10:6643 Status: cluster member Role: leader Term: 33 Leader: self Vote: self Last Election started 1565452 ms ago, reason: timeout Last Election won: 1565448 ms ago Election timer: 10000 Log: [59, 71] Entries not yet committed: 0 Entries not yet applied: 0 Connections: ->a7ec <-a7ec ->ba93 Disconnections: 11 Servers: a7ec (a7ec at tcp:172.17.1.61:6643) next_index=71 match_index=70 last msg 3122 ms ago ba93 (ba93 at tcp:172.17.1.89:6643) next_index=71 match_index=70 last msg 79802 ms ago ce2b (ce2b at tcp:172.17.1.10:6643) (self) next_index=64 match_index=70 [root@controller-2 ~]# podman exec ovn_cluster_north_db_server ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound a7ec Name: OVN_Northbound Cluster ID: b2c9 (b2c9d033-f770-4763-998d-f0d7a819fd6c) Server ID: a7ec (a7ec9d49-2281-4811-89e1-ac50f534ad56) Address: tcp:172.17.1.61:6643 Status: cluster member Role: follower Term: 33 Leader: ce2b Vote: ce2b Last Election started 98864902 ms ago, reason: leadership_transfer Last Election won: 98864894 ms ago Election timer: 10000 Log: [63, 71] Entries not yet committed: 0 Entries not yet applied: 0 Connections: ->ce2b <-ce2b ->ba93 Disconnections: 15 Servers: a7ec (a7ec at tcp:172.17.1.61:6643) (self) ba93 (ba93 at tcp:172.17.1.89:6643) ce2b (ce2b at tcp:172.17.1.10:6643) last msg 796 ms ago
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1.1 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:5138
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days