Bug 1823324

Summary: [QE-Tracker][RFE][IMP] The OVN migration code should revise with revert plan. [Neutron&NFV use cases]
Product: Red Hat OpenStack Reporter: Pradipta Kumar Sahoo <psahoo>
Component: openstack-neutronAssignee: Arnau Verdaguer <averdagu>
Status: CLOSED CURRENTRELEASE QA Contact: Roman Safronov <rsafrono>
Severity: high Docs Contact:
Priority: high    
Version: 17.0 (Wallaby)CC: apevec, averdagu, bmv, chrisw, dalvarez, ekuris, gurpsing, hakhande, jamsmith, jbadiapa, jlibosva, jpalanis, jschluet, konguyen, lhh, majopela, mariel, pgrist, rsafrono, scohen, skaplons, supadhya
Target Milestone: z2Keywords: FutureFeature, TestOnly, Triaged
Target Release: 17.1Flags: gurpsing: needinfo-
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-neutron-18.6.1-1.20230221161409.94c2c92.el9ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-01-05 11:40:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1792500, 1818866, 2144492, 2216778, 2222624, 2223350    
Bug Blocks: 2155253, 2210773    

Description Pradipta Kumar Sahoo 2020-04-13 10:33:51 UTC
Description of problem:
In our recent experience with OVN migration activity in scale lab environment [1], we noticed that the migration activity is broken due to the ambiguous status of tripleo stack deployment dependencies.

Version-Release number of selected component (if applicable):
python3-networking-ovn-migration-tool-7.1.0-0.20200204065607.57ac389.el8ost.noarch
Red Hat OpenStack Platform release 16.0.1 (Train)

How reproducible:
100% reproducible in Scale lab

Steps to Reproduce:
1. After the ml2-ovn migration script break [1], the existing overcloud all tenant environment including pre-migration resources are completely down.
2. In the state, Neutron ml2 and conf files are overridden with OVN service paramers.
3. Neutron opennvswitch services were not cleaned up. So both ovn and ovs service containers are existing after the stack update.
4. All tunnel ports are reflected in both br-tun and br-migration.
5. In this situation, the OC environment is in a dead-lock state where the customer cannot restore the overcloud environment as the underlying layer is completely messed.


Actual results:
All the overcloud tenant resources are down and not accessible. In the customer scenario, it would critical situation if the migration steps break and there no way to restore back to ml2-ovs with limited maintenance period.

Expected results:
The ovn migration code should enhance with an ml2-ovs restore plan to avoid any deadlock situation.

Additional info:
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1818866

Comment 1 Daniel Alvarez Sanchez 2020-04-14 09:26:41 UTC
I believe this is something that falls outside the migration tool. Same mechanism that we have for general updates/upgrades should come into picture right?
It'd be great to have inputs from the backups&restore team here.

Comment 2 Daniel Alvarez Sanchez 2020-04-14 09:28:54 UTC
Just to be clear, I'm talking about the revert plan. Of course the migration tool needs to be resilient enough to minimize the revert scenarios.

Comment 3 Jakub Libosvar 2020-04-14 09:32:50 UTC
Setting needinfo on Juan to get some inputs. We can improve our docs to mention backup&restore procedures prior to the migration.

Comment 4 Juan Badia Payno 2020-04-14 10:41:46 UTC
After talking to Daniel Alvarez and checking the BZs, I saw that the migration script updates the controllers and the computes. The Backup and Restore procedure was only tested on the control-plane.
The Backup and Restore procedure uses ReaR which is a Disaster recovery tool.

I only see a couple of options here:
1.- Try to backup computes... which I think it's going to be a long journey
2.- Backup the control-plane and execute the overcloud-deploy script to update the overcloud. So it ensures that the computes are configured properly.

To do a proper backup of the control plane we need to stop all the services on them, which means that there is a production disruption (Ceph, Network communication...)

If there is an environment to test it, we can test it. Furthermore, we should be able to do a proper migration and then do a restoration. (Well, not sure what changes on the computes.. but the outcome after the overcloud update should  be the initial environment)

Comment 5 Korry Nguyen 2021-06-01 22:03:10 UTC
elevate pri/sev to high as it's listed as important for perf and scale team.

Comment 8 Jakub Libosvar 2022-01-06 14:22:05 UTC
*** Bug 2025910 has been marked as a duplicate of this bug. ***

Comment 9 Jakub Libosvar 2022-01-06 14:51:29 UTC
*** Bug 1948579 has been marked as a duplicate of this bug. ***

Comment 19 Gurpreet Singh 2022-09-02 16:00:35 UTC
Pradipta, can we discuss the scope of the revert capability?

As of now we are telling out customers to take snapshot / backup and restore from the snapshot. Is automatic reversion something that we can handle in 17.1? Scope will be important (what needs to be done)

Comment 20 Pradipta Kumar Sahoo 2022-09-08 12:16:46 UTC
Hi Gurpreet,

Sure we can discuss the revert plan. Yes, the backup/restore from the snapshot can meet the requirement.
In the past, we had an upgrade activity, where we (executed by Jaison) did the ovn migration test.

I am not aware of the OSP17.1 feature which has a solution for the automatic revert. So, please schedule a meeting for further clarity.

BR,
Pradipta

Comment 21 Gurpreet Singh 2022-09-14 15:57:30 UTC
Moving to 18.0. Will not be addressed in 17.1 and will go as a known limitation

Comment 26 Gurpreet Singh 2022-10-16 16:22:40 UTC
Hi Eran

We need a qa ack for this item to make it in OSP 17.1

Regards
Gurpreet