Just a quick update, this customer issue has a few moving parts:
Controller replacement was done on controller-0(bootstrap node)
Controller replacement caused Raft to partition (repaired)
Compute node was shutdown(still under investigation)
Around 13-15 VMs were evacuated but failed to evacuate due to Neutron failing to connect correctly to OVN databases (Still recovering)
Currently we have recovered the Raft cluster, Yatin++ discovered that controller-0 is in a single cluster of its own while controller 1 and controller 2 are in another, original cluster.
I performed an operation [1] with the customer on a remote session this morning to have the replaced c0 join the original cluster with c1 and c2. In addition we ran the Neutron sync tool in log mode[2], there were 3 changes required so repair mode was used[3].
So far we suspect this is a bug, when OpenStack is deployed, c0 is the bootstrap, a single node cluster is created then c1 and c2 join as they are deployed. c0 was replaced and rather than joining the existing cluster, it formed its own causing a partition in the cluster. Raft algorithm protected us from any data loss but as mentioned, manual intervention was required to repair.
Moving forward we will continue with recovering the customers' instances now that we have confirmed Neutron(and OVN) are working correctly. Next we will continue the investigation into why the compute node rebooted.
So far, I believe we have 2 actions pending from this rhos-prio:
- Create a bug to handle replacing bootstrap node - I will touch base with Terry
- Create a KCS for the meantime to recover from Raft partition, the steps in [1] were a quick dump from Terry++ but some small changes are needed. I will do this asap and seek confirmation from Terry.
[1] https://docs.google.com/document/d/1KuzwcEko9feBp2Tx6w79TOK5iCaBmhDiyBO0MMH4SlA/edit
[2] podman exec -it neutron_api neutron-ovn-db-sync-util --config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-server --ovn-neutron_sync_mode=log
[3] podman exec -it neutron_api neutron-ovn-db-sync-util --config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-server --ovn-neutron_sync_mode=repair