Bug 2222154 - user's instance couldn't finish the unshelve, we have a solutions need review urgently. [NEEDINFO]
Summary: user's instance couldn't finish the unshelve, we have a solutions need review...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 17.0 (Wallaby)
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: ---
Assignee: OSP DFG:Compute
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-12 02:28 UTC by jiehuang
Modified: 2023-08-08 16:45 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-08-08 16:45:54 UTC
Target Upstream Version:
Embargoed:
ihrachys: needinfo? (jiehuang)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-26530 0 None None None 2023-07-12 02:29:01 UTC

Comment 15 ldenny 2023-07-13 02:24:58 UTC
Just a quick update, this customer issue has a few moving parts:

Controller replacement was done on controller-0(bootstrap node)
Controller replacement caused Raft to partition (repaired)
Compute node was shutdown(still under investigation)
Around 13-15 VMs were evacuated but failed to evacuate due to Neutron failing to connect correctly to OVN databases (Still recovering)

Currently we have recovered the Raft cluster, Yatin++ discovered that controller-0 is in a single cluster of its own while controller 1 and controller 2 are in another, original cluster.

I performed an operation [1] with the customer on a remote session this morning to have the replaced c0 join the original cluster with c1 and c2. In addition we ran the Neutron sync tool in log mode[2], there were 3 changes required so repair mode was used[3].

So far we suspect this is a bug, when OpenStack is deployed, c0 is the bootstrap, a single node cluster is created then c1 and c2 join as they are deployed. c0 was replaced and rather than joining the existing cluster, it formed its own causing a partition in the cluster. Raft algorithm protected us from any data loss but as mentioned, manual intervention was required to repair. 

Moving forward we will continue with recovering the customers' instances now that we have confirmed Neutron(and OVN) are working correctly. Next we will continue the investigation into why the compute node rebooted.

So far, I believe we have 2 actions pending from this rhos-prio:
- Create a bug to handle replacing bootstrap node - I will touch base with Terry
- Create a KCS for the meantime to recover from Raft partition, the steps in [1] were a quick dump from Terry++ but some small changes are needed. I will do this asap and seek confirmation from Terry.

[1] https://docs.google.com/document/d/1KuzwcEko9feBp2Tx6w79TOK5iCaBmhDiyBO0MMH4SlA/edit

[2] podman exec -it neutron_api neutron-ovn-db-sync-util --config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-server --ovn-neutron_sync_mode=log

[3] podman exec -it neutron_api neutron-ovn-db-sync-util --config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-server --ovn-neutron_sync_mode=repair


Note You need to log in before you can comment on or make changes to this bug.