1823324 – [QE-Tracker][RFE][IMP] The OVN migration code should revise with revert plan. [Neutron&NFV use cases]

Bug 1823324 - [QE-Tracker][RFE][IMP] The OVN migration code should revise with revert plan. [Neutron&NFV use cases]

Summary: [QE-Tracker][RFE][IMP] The OVN migration code should revise with revert plan....

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	17.0 (Wallaby)
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	z2
Target Release:	17.1
Assignee:	Arnau Verdaguer
QA Contact:	Roman Safronov
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1948579 2025910 (view as bug list)
Depends On:	1792500 1818866 2144492 2216778 2222624 2223350
Blocks:	2155253 2210773
TreeView+	depends on / blocked

Reported:	2020-04-13 10:33 UTC by Pradipta Kumar Sahoo
Modified:	2024-01-05 11:40 UTC (History)
CC List:	22 users (show)
Fixed In Version:	openstack-neutron-18.6.1-1.20230221161409.94c2c92.el9ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-01-05 11:40:01 UTC
Target Upstream Version:
Embargoed:
Flags:	gurpsing: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	432423	None	MERGED	Skip unittests of cinder if not related to code	2022-10-25 09:27:07 UTC
OpenStack gerrit	432427	None	ABANDONED	Disable sitepackages=True for functional tox targets	2022-10-25 09:27:07 UTC
OpenStack gerrit	432430	None	ABANDONED	WIP: Remove now unused db-jobs	2022-10-25 09:27:07 UTC
OpenStack gerrit	835638	None	MERGED	Migration revert plan	2022-09-29 15:45:30 UTC
Red Hat Issue Tracker	OSP-511	None	None	None	2021-11-18 15:19:57 UTC
Red Hat Issue Tracker	RHOSPDOC-1126	None	None	None	2023-11-10 11:14:45 UTC

Description Pradipta Kumar Sahoo 2020-04-13 10:33:51 UTC

Description of problem:
In our recent experience with OVN migration activity in scale lab environment [1], we noticed that the migration activity is broken due to the ambiguous status of tripleo stack deployment dependencies.

Version-Release number of selected component (if applicable):
python3-networking-ovn-migration-tool-7.1.0-0.20200204065607.57ac389.el8ost.noarch
Red Hat OpenStack Platform release 16.0.1 (Train)

How reproducible:
100% reproducible in Scale lab

Steps to Reproduce:
1. After the ml2-ovn migration script break [1], the existing overcloud all tenant environment including pre-migration resources are completely down.
2. In the state, Neutron ml2 and conf files are overridden with OVN service paramers.
3. Neutron opennvswitch services were not cleaned up. So both ovn and ovs service containers are existing after the stack update.
4. All tunnel ports are reflected in both br-tun and br-migration.
5. In this situation, the OC environment is in a dead-lock state where the customer cannot restore the overcloud environment as the underlying layer is completely messed.

Actual results:
All the overcloud tenant resources are down and not accessible. In the customer scenario, it would critical situation if the migration steps break and there no way to restore back to ml2-ovs with limited maintenance period.

Expected results:
The ovn migration code should enhance with an ml2-ovs restore plan to avoid any deadlock situation.

Additional info:
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1818866

Comment 1 Daniel Alvarez Sanchez 2020-04-14 09:26:41 UTC

I believe this is something that falls outside the migration tool. Same mechanism that we have for general updates/upgrades should come into picture right?
It'd be great to have inputs from the backups&restore team here.

Comment 2 Daniel Alvarez Sanchez 2020-04-14 09:28:54 UTC

Just to be clear, I'm talking about the revert plan. Of course the migration tool needs to be resilient enough to minimize the revert scenarios.

Comment 3 Jakub Libosvar 2020-04-14 09:32:50 UTC

Setting needinfo on Juan to get some inputs. We can improve our docs to mention backup&restore procedures prior to the migration.

Comment 4 Juan Badia Payno 2020-04-14 10:41:46 UTC

After talking to Daniel Alvarez and checking the BZs, I saw that the migration script updates the controllers and the computes. The Backup and Restore procedure was only tested on the control-plane.
The Backup and Restore procedure uses ReaR which is a Disaster recovery tool.

I only see a couple of options here:
1.- Try to backup computes... which I think it's going to be a long journey
2.- Backup the control-plane and execute the overcloud-deploy script to update the overcloud. So it ensures that the computes are configured properly.

To do a proper backup of the control plane we need to stop all the services on them, which means that there is a production disruption (Ceph, Network communication...)

If there is an environment to test it, we can test it. Furthermore, we should be able to do a proper migration and then do a restoration. (Well, not sure what changes on the computes.. but the outcome after the overcloud update should  be the initial environment)

Comment 5 Korry Nguyen 2021-06-01 22:03:10 UTC

elevate pri/sev to high as it's listed as important for perf and scale team.

Comment 8 Jakub Libosvar 2022-01-06 14:22:05 UTC

*** Bug 2025910 has been marked as a duplicate of this bug. ***

Comment 9 Jakub Libosvar 2022-01-06 14:51:29 UTC

*** Bug 1948579 has been marked as a duplicate of this bug. ***

Comment 19 Gurpreet Singh 2022-09-02 16:00:35 UTC

Pradipta, can we discuss the scope of the revert capability?

As of now we are telling out customers to take snapshot / backup and restore from the snapshot. Is automatic reversion something that we can handle in 17.1? Scope will be important (what needs to be done)

Comment 20 Pradipta Kumar Sahoo 2022-09-08 12:16:46 UTC

Hi Gurpreet,

Sure we can discuss the revert plan. Yes, the backup/restore from the snapshot can meet the requirement.
In the past, we had an upgrade activity, where we (executed by Jaison) did the ovn migration test.

I am not aware of the OSP17.1 feature which has a solution for the automatic revert. So, please schedule a meeting for further clarity.

BR,
Pradipta

Comment 21 Gurpreet Singh 2022-09-14 15:57:30 UTC

Moving to 18.0. Will not be addressed in 17.1 and will go as a known limitation

Comment 26 Gurpreet Singh 2022-10-16 16:22:40 UTC

Hi Eran

We need a qa ack for this item to make it in OSP 17.1

Regards
Gurpreet

Note You need to log in before you can comment on or make changes to this bug.

apevec
averdagu
bmv
chrisw
dalvarez
ekuris
gurpsing
hakhande
jamsmith
jbadiapa
jlibosva
jpalanis
jschluet
konguyen
lhh
majopela
mariel
pgrist
rsafrono
scohen
skaplons
supadhya