Bug 2216778

Summary: Long VM network downtime (~20 min) when running ovn migration revert playbook
Product: Red Hat OpenStack Reporter: Roman Safronov <rsafrono>
Component: documentationAssignee: James Smith <jamsmith>
Status: ASSIGNED --- QA Contact: RHOS Documentation Team <rhos-docs>
Severity: high Docs Contact:
Priority: high    
Version: 17.1 (Wallaby)CC: averdagu, chrisw, ekuris, gregraka, jamsmith, scohen
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1823324    

Description Roman Safronov 2023-06-22 14:16:25 UTC
Description of problem:
When starting ovn migration revert (already after controller nodes are up) there is still a long downtime until VMs are responsive again. 

Version-Release number of selected component (if applicable):
RHOS-17.1-RHEL-9-20230621.n.1
openstack-neutron-ovn-migration-tool-18.6.1-1.20230518200969.el9ost.noarch
python3-neutron-18.6.1-1.20230518200969.el9ost.noarch
ovn22.12-22.12.0-46.el9fdp.x86_64
openvswitch3.1-3.1.0-14.el9fdp.x86_64

How reproducible:
100%

Steps to Reproduce:
1.Deploy HA environment with OVS neutron backend
2.Create networks, subnets, security groups, spawn VMs connected to the created networks, make sure VMs are either connected to external network or have FIPs in order to be accessible from the external network
3.Create a backup of controller nodes in order to be able to revert afterwards
4.Migrate the neutron backend to OVS
5.Make sure VMs are responding to ping requests from the external network
6.Restore controller nodes from the backup
7.Make sure controller nodes are restored and VMs are still pingable.
8. Start an infinite ping processes to all VMs and run revert playbook (/usr/share/ansible/neutron-ovn-migration/playbooks/revert.yml)

Actual results:
VMs are not responding to pings for about 20 minutes

Expected results:
There is a rather short network downtime (several seconds, at least less than a minute).

Additional info:
If this long downtime is considered reasonable we need to document it. Customers would expect rather short network downtime.