Bug 1659066

Summary: [OVN migration] long downtime during migration from ml2/ovs to ml2/ovn
Product: Red Hat OpenStack Reporter: Eran Kuris <ekuris>
Component: python-networking-ovnAssignee: Lucas Alvares Gomes <lmartins>
Status: CLOSED CURRENTRELEASE QA Contact: Eran Kuris <ekuris>
Severity: urgent Docs Contact:
Priority: high    
Version: 14.0 (Rocky)CC: apevec, dalvarez, jlibosva, lhh, lmartins, majopela, nusiddiq, twilson
Target Milestone: z4Keywords: TestOnly, Triaged, ZStream
Target Release: 14.0 (Rocky)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-networking-ovn-5.0.2-0.20190430191338.e673daf.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-26 13:38:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1694572    
Bug Blocks:    
Attachments:
Description Flags
logs none

Description Eran Kuris 2018-12-13 13:55:42 UTC
Created attachment 1514063 [details]
logs

Description of problem:
during migration from ml2/ovs to ml2/ovn process we expect to have downtime of the live instances. when checking the downtime we can see that the downtime is around 5 to 6 minutes.


https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-14_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ml2ovs-to-ovn-migration/228/artifact/.workspaces/workspace_2018-12-12_11-44-32/ovn_migration/ovn_migration/
Version-Release number of selected component (if applicable):
core_puddle: 2018-12-12.4

How reproducible:
100%

Steps to Reproduce:
1. run the migration job and check the ping logs.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Miguel Angel Ajo 2019-02-04 23:40:51 UTC
I found the culprit.

It's the workaround introduced here: 
   https://github.com/openstack/networking-ovn/blob/47983cc61194888750f1c4cb08ff350a13914903/migration/tripleo_environment/playbooks/roles/migration/templates/clone-br-int.sh.j2#L79

The moment we delete the controller on br-int, all the openflow rules are removed = downtime. And then we wait 5 minutes after this script is ran :-/


The workaround was introduced because of:
   https://bugzilla.redhat.com/show_bug.cgi?id=1640045



We need to figure out if that workaround can now be removed, or otherwise, in that script:

1) Save the flows
2) Apply the workaround
3) Restore the flows

Comment 3 Miguel Angel Ajo 2019-02-05 16:30:53 UTC
I discovered the culprit because I was recording screen during the migration:
https://www.youtube.com/watch?v=sA7xfTpPMJc

Comment 4 Numan Siddique 2019-02-05 16:42:04 UTC
(In reply to Miguel Angel Ajo from comment #2)
> I found the culprit.
> 
> It's the workaround introduced here: 
>   
> https://github.com/openstack/networking-ovn/blob/
> 47983cc61194888750f1c4cb08ff350a13914903/migration/tripleo_environment/
> playbooks/roles/migration/templates/clone-br-int.sh.j2#L79
> 
> The moment we delete the controller on br-int, all the openflow rules are
> removed = downtime. And then we wait 5 minutes after this script is ran :-/
> 
> 
> The workaround was introduced because of:
>    https://bugzilla.redhat.com/show_bug.cgi?id=1640045
> 
> 
> 
> We need to figure out if that workaround can now be removed, or otherwise,
> in that script:
> 
> 1) Save the flows
> 2) Apply the workaround
> 3) Restore the flows

I think we have the fix available in ovs in the version - 2.10.0-21+ . So I don't think we need the workaround anymore.

Comment 5 Miguel Angel Ajo 2019-02-05 22:15:17 UTC
Correct, I verified we don't need the workaround anymore, it works without it.

Comment 10 Lon Hohberger 2019-07-10 10:40:35 UTC
According to our records, this should be resolved by python-networking-ovn-5.0.2-0.20190430191338.e673daf.el7ost.  This build is available now.

Comment 12 Eran Kuris 2019-10-30 10:11:05 UTC
cant verify - depends on https://bugzilla.redhat.com/show_bug.cgi?id=1694572

Comment 15 Jakub Libosvar 2019-11-26 13:38:21 UTC
Closing as bug 1694572 won't make it to OSP14 before EOL.