Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2222154

Summary:	user's instance couldn't finish the unshelve, we have a solutions need review urgently.
Product:	Red Hat OpenStack	Reporter:	jiehuang
Component:	openstack-nova	Assignee:	OSP DFG:Compute <osp-dfg-compute>
Status:	CLOSED NOTABUG	QA Contact:	OSP DFG:Compute <osp-dfg-compute>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	17.0 (Wallaby)	CC:	alifshit, dasmith, dsedgmen, egarciar, eglynn, ekuris, ihrachys, jhakimra, jinjlee, kchamart, ldenny, ltamagno, njohnston, osp-dfg-compute, rcernin, sbauza, sgordon, twilson, vromanso, ykarel
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-08-08 16:45:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 15 ldenny 2023-07-13 02:24:58 UTC

Just a quick update, this customer issue has a few moving parts:

Controller replacement was done on controller-0(bootstrap node)
Controller replacement caused Raft to partition (repaired)
Compute node was shutdown(still under investigation)
Around 13-15 VMs were evacuated but failed to evacuate due to Neutron failing to connect correctly to OVN databases (Still recovering)

Currently we have recovered the Raft cluster, Yatin++ discovered that controller-0 is in a single cluster of its own while controller 1 and controller 2 are in another, original cluster.

I performed an operation [1] with the customer on a remote session this morning to have the replaced c0 join the original cluster with c1 and c2. In addition we ran the Neutron sync tool in log mode[2], there were 3 changes required so repair mode was used[3].

So far we suspect this is a bug, when OpenStack is deployed, c0 is the bootstrap, a single node cluster is created then c1 and c2 join as they are deployed. c0 was replaced and rather than joining the existing cluster, it formed its own causing a partition in the cluster. Raft algorithm protected us from any data loss but as mentioned, manual intervention was required to repair. 

Moving forward we will continue with recovering the customers' instances now that we have confirmed Neutron(and OVN) are working correctly. Next we will continue the investigation into why the compute node rebooted.

So far, I believe we have 2 actions pending from this rhos-prio:
- Create a bug to handle replacing bootstrap node - I will touch base with Terry
- Create a KCS for the meantime to recover from Raft partition, the steps in [1] were a quick dump from Terry++ but some small changes are needed. I will do this asap and seek confirmation from Terry.

[1] https://docs.google.com/document/d/1KuzwcEko9feBp2Tx6w79TOK5iCaBmhDiyBO0MMH4SlA/edit

[2] podman exec -it neutron_api neutron-ovn-db-sync-util --config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-server --ovn-neutron_sync_mode=log

[3] podman exec -it neutron_api neutron-ovn-db-sync-util --config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-server --ovn-neutron_sync_mode=repair

Comment 19 Red Hat Bugzilla 2023-12-07 04:25:57 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days