Description of problem:
In our recent experience with OVN migration activity in scale lab environment , we noticed that the migration activity is broken due to the ambiguous status of tripleo stack deployment dependencies.
Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform release 16.0.1 (Train)
100% reproducible in Scale lab
Steps to Reproduce:
1. After the ml2-ovn migration script break , the existing overcloud all tenant environment including pre-migration resources are completely down.
2. In the state, Neutron ml2 and conf files are overridden with OVN service paramers.
3. Neutron opennvswitch services were not cleaned up. So both ovn and ovs service containers are existing after the stack update.
4. All tunnel ports are reflected in both br-tun and br-migration.
5. In this situation, the OC environment is in a dead-lock state where the customer cannot restore the overcloud environment as the underlying layer is completely messed.
All the overcloud tenant resources are down and not accessible. In the customer scenario, it would critical situation if the migration steps break and there no way to restore back to ml2-ovs with limited maintenance period.
The ovn migration code should enhance with an ml2-ovs restore plan to avoid any deadlock situation.
I believe this is something that falls outside the migration tool. Same mechanism that we have for general updates/upgrades should come into picture right?
It'd be great to have inputs from the backups&restore team here.
Just to be clear, I'm talking about the revert plan. Of course the migration tool needs to be resilient enough to minimize the revert scenarios.
Setting needinfo on Juan to get some inputs. We can improve our docs to mention backup&restore procedures prior to the migration.
After talking to Daniel Alvarez and checking the BZs, I saw that the migration script updates the controllers and the computes. The Backup and Restore procedure was only tested on the control-plane.
The Backup and Restore procedure uses ReaR which is a Disaster recovery tool.
I only see a couple of options here:
1.- Try to backup computes... which I think it's going to be a long journey
2.- Backup the control-plane and execute the overcloud-deploy script to update the overcloud. So it ensures that the computes are configured properly.
To do a proper backup of the control plane we need to stop all the services on them, which means that there is a production disruption (Ceph, Network communication...)
If there is an environment to test it, we can test it. Furthermore, we should be able to do a proper migration and then do a restoration. (Well, not sure what changes on the computes.. but the outcome after the overcloud update should be the initial environment)
elevate pri/sev to high as it's listed as important for perf and scale team.
*** Bug 2025910 has been marked as a duplicate of this bug. ***
*** Bug 1948579 has been marked as a duplicate of this bug. ***
Pradipta, can we discuss the scope of the revert capability?
As of now we are telling out customers to take snapshot / backup and restore from the snapshot. Is automatic reversion something that we can handle in 17.1? Scope will be important (what needs to be done)
Sure we can discuss the revert plan. Yes, the backup/restore from the snapshot can meet the requirement.
In the past, we had an upgrade activity, where we (executed by Jaison) did the ovn migration test.
I am not aware of the OSP17.1 feature which has a solution for the automatic revert. So, please schedule a meeting for further clarity.
Moving to 18.0. Will not be addressed in 17.1 and will go as a known limitation
We need a qa ack for this item to make it in OSP 17.1