Bug 1823324 - [RFE][IMP] The OVN migration code should revise with revert plan.
Summary: [RFE][IMP] The OVN migration code should revise with revert plan.
Keywords:
Status: ON_QA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 17.0 (Wallaby)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: beta
: 17.1
Assignee: Arnau Verdaguer
QA Contact: Roman Safronov
URL:
Whiteboard:
: 1948579 2025910 (view as bug list)
Depends On: 1792500 1818866
Blocks: 2019745 2155253
TreeView+ depends on / blocked
 
Reported: 2020-04-13 10:33 UTC by Pradipta Kumar Sahoo
Modified: 2023-03-21 14:56 UTC (History)
19 users (show)

Fixed In Version: openstack-neutron-18.6.1-1.20230221161409.94c2c92.el9ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
gurpsing: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 432423 0 None MERGED Skip unittests of cinder if not related to code 2022-10-25 09:27:07 UTC
OpenStack gerrit 432427 0 None ABANDONED Disable sitepackages=True for functional tox targets 2022-10-25 09:27:07 UTC
OpenStack gerrit 432430 0 None ABANDONED WIP: Remove now unused db-jobs 2022-10-25 09:27:07 UTC
OpenStack gerrit 835638 0 None MERGED Migration revert plan 2022-09-29 15:45:30 UTC
Red Hat Issue Tracker OSP-511 0 None None None 2021-11-18 15:19:57 UTC

Description Pradipta Kumar Sahoo 2020-04-13 10:33:51 UTC
Description of problem:
In our recent experience with OVN migration activity in scale lab environment [1], we noticed that the migration activity is broken due to the ambiguous status of tripleo stack deployment dependencies.

Version-Release number of selected component (if applicable):
python3-networking-ovn-migration-tool-7.1.0-0.20200204065607.57ac389.el8ost.noarch
Red Hat OpenStack Platform release 16.0.1 (Train)

How reproducible:
100% reproducible in Scale lab

Steps to Reproduce:
1. After the ml2-ovn migration script break [1], the existing overcloud all tenant environment including pre-migration resources are completely down.
2. In the state, Neutron ml2 and conf files are overridden with OVN service paramers.
3. Neutron opennvswitch services were not cleaned up. So both ovn and ovs service containers are existing after the stack update.
4. All tunnel ports are reflected in both br-tun and br-migration.
5. In this situation, the OC environment is in a dead-lock state where the customer cannot restore the overcloud environment as the underlying layer is completely messed.


Actual results:
All the overcloud tenant resources are down and not accessible. In the customer scenario, it would critical situation if the migration steps break and there no way to restore back to ml2-ovs with limited maintenance period.

Expected results:
The ovn migration code should enhance with an ml2-ovs restore plan to avoid any deadlock situation.

Additional info:
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1818866

Comment 1 Daniel Alvarez Sanchez 2020-04-14 09:26:41 UTC
I believe this is something that falls outside the migration tool. Same mechanism that we have for general updates/upgrades should come into picture right?
It'd be great to have inputs from the backups&restore team here.

Comment 2 Daniel Alvarez Sanchez 2020-04-14 09:28:54 UTC
Just to be clear, I'm talking about the revert plan. Of course the migration tool needs to be resilient enough to minimize the revert scenarios.

Comment 3 Jakub Libosvar 2020-04-14 09:32:50 UTC
Setting needinfo on Juan to get some inputs. We can improve our docs to mention backup&restore procedures prior to the migration.

Comment 4 Juan Badia Payno 2020-04-14 10:41:46 UTC
After talking to Daniel Alvarez and checking the BZs, I saw that the migration script updates the controllers and the computes. The Backup and Restore procedure was only tested on the control-plane.
The Backup and Restore procedure uses ReaR which is a Disaster recovery tool.

I only see a couple of options here:
1.- Try to backup computes... which I think it's going to be a long journey
2.- Backup the control-plane and execute the overcloud-deploy script to update the overcloud. So it ensures that the computes are configured properly.

To do a proper backup of the control plane we need to stop all the services on them, which means that there is a production disruption (Ceph, Network communication...)

If there is an environment to test it, we can test it. Furthermore, we should be able to do a proper migration and then do a restoration. (Well, not sure what changes on the computes.. but the outcome after the overcloud update should  be the initial environment)

Comment 5 Korry Nguyen 2021-06-01 22:03:10 UTC
elevate pri/sev to high as it's listed as important for perf and scale team.

Comment 8 Jakub Libosvar 2022-01-06 14:22:05 UTC
*** Bug 2025910 has been marked as a duplicate of this bug. ***

Comment 9 Jakub Libosvar 2022-01-06 14:51:29 UTC
*** Bug 1948579 has been marked as a duplicate of this bug. ***

Comment 19 Gurpreet Singh 2022-09-02 16:00:35 UTC
Pradipta, can we discuss the scope of the revert capability?

As of now we are telling out customers to take snapshot / backup and restore from the snapshot. Is automatic reversion something that we can handle in 17.1? Scope will be important (what needs to be done)

Comment 20 Pradipta Kumar Sahoo 2022-09-08 12:16:46 UTC
Hi Gurpreet,

Sure we can discuss the revert plan. Yes, the backup/restore from the snapshot can meet the requirement.
In the past, we had an upgrade activity, where we (executed by Jaison) did the ovn migration test.

I am not aware of the OSP17.1 feature which has a solution for the automatic revert. So, please schedule a meeting for further clarity.

BR,
Pradipta

Comment 21 Gurpreet Singh 2022-09-14 15:57:30 UTC
Moving to 18.0. Will not be addressed in 17.1 and will go as a known limitation

Comment 26 Gurpreet Singh 2022-10-16 16:22:40 UTC
Hi Eran

We need a qa ack for this item to make it in OSP 17.1

Regards
Gurpreet


Note You need to log in before you can comment on or make changes to this bug.