Bug 1659066 - [OVN migration] long downtime during migration from ml2/ovs to ml2/ovn
Summary: [OVN migration] long downtime during migration from ml2/ovs to ml2/ovn
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-ovn
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: z4
: 14.0 (Rocky)
Assignee: Lucas Alvares Gomes
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On: 1694572
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-12-13 13:55 UTC by Eran Kuris
Modified: 2019-11-26 13:39 UTC (History)
8 users (show)

Fixed In Version: python-networking-ovn-5.0.2-0.20190430191338.e673daf.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-26 13:38:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs (234.01 KB, application/zip)
2018-12-13 13:55 UTC, Eran Kuris
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1814831 0 None None None 2019-02-05 22:23:02 UTC
OpenStack gerrit 635063 0 'None' 'MERGED' 'Fix downtime bug during migration' 2019-11-25 11:25:39 UTC

Description Eran Kuris 2018-12-13 13:55:42 UTC
Created attachment 1514063 [details]
logs

Description of problem:
during migration from ml2/ovs to ml2/ovn process we expect to have downtime of the live instances. when checking the downtime we can see that the downtime is around 5 to 6 minutes.


https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-14_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ml2ovs-to-ovn-migration/228/artifact/.workspaces/workspace_2018-12-12_11-44-32/ovn_migration/ovn_migration/
Version-Release number of selected component (if applicable):
core_puddle: 2018-12-12.4

How reproducible:
100%

Steps to Reproduce:
1. run the migration job and check the ping logs.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Miguel Angel Ajo 2019-02-04 23:40:51 UTC
I found the culprit.

It's the workaround introduced here: 
   https://github.com/openstack/networking-ovn/blob/47983cc61194888750f1c4cb08ff350a13914903/migration/tripleo_environment/playbooks/roles/migration/templates/clone-br-int.sh.j2#L79

The moment we delete the controller on br-int, all the openflow rules are removed = downtime. And then we wait 5 minutes after this script is ran :-/


The workaround was introduced because of:
   https://bugzilla.redhat.com/show_bug.cgi?id=1640045



We need to figure out if that workaround can now be removed, or otherwise, in that script:

1) Save the flows
2) Apply the workaround
3) Restore the flows

Comment 3 Miguel Angel Ajo 2019-02-05 16:30:53 UTC
I discovered the culprit because I was recording screen during the migration:
https://www.youtube.com/watch?v=sA7xfTpPMJc

Comment 4 Numan Siddique 2019-02-05 16:42:04 UTC
(In reply to Miguel Angel Ajo from comment #2)
> I found the culprit.
> 
> It's the workaround introduced here: 
>   
> https://github.com/openstack/networking-ovn/blob/
> 47983cc61194888750f1c4cb08ff350a13914903/migration/tripleo_environment/
> playbooks/roles/migration/templates/clone-br-int.sh.j2#L79
> 
> The moment we delete the controller on br-int, all the openflow rules are
> removed = downtime. And then we wait 5 minutes after this script is ran :-/
> 
> 
> The workaround was introduced because of:
>    https://bugzilla.redhat.com/show_bug.cgi?id=1640045
> 
> 
> 
> We need to figure out if that workaround can now be removed, or otherwise,
> in that script:
> 
> 1) Save the flows
> 2) Apply the workaround
> 3) Restore the flows

I think we have the fix available in ovs in the version - 2.10.0-21+ . So I don't think we need the workaround anymore.

Comment 5 Miguel Angel Ajo 2019-02-05 22:15:17 UTC
Correct, I verified we don't need the workaround anymore, it works without it.

Comment 10 Lon Hohberger 2019-07-10 10:40:35 UTC
According to our records, this should be resolved by python-networking-ovn-5.0.2-0.20190430191338.e673daf.el7ost.  This build is available now.

Comment 12 Eran Kuris 2019-10-30 10:11:05 UTC
cant verify - depends on https://bugzilla.redhat.com/show_bug.cgi?id=1694572

Comment 15 Jakub Libosvar 2019-11-26 13:38:21 UTC
Closing as bug 1694572 won't make it to OSP14 before EOL.


Note You need to log in before you can comment on or make changes to this bug.