Bug 1659066

Summary:

[OVN migration] long downtime during migration from ml2/ovs to ml2/ovn

Product:

Red Hat OpenStack

Reporter:

Eran Kuris <ekuris>

Component:

python-networking-ovn

Assignee:

Lucas Alvares Gomes <lmartins>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Eran Kuris <ekuris>

Severity:

urgent

Docs Contact:

Priority:

high

Version:

14.0 (Rocky)

CC:

apevec, dalvarez, jlibosva, lhh, lmartins, majopela, nusiddiq, twilson

Target Milestone:

Keywords:

TestOnly, Triaged, ZStream

Target Release:

14.0 (Rocky)

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

python-networking-ovn-5.0.2-0.20190430191338.e673daf.el7ost

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-11-26 13:38:21 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1694572

Bug Blocks:

Attachments:

Description	Flags
logs	none

Description Eran Kuris 2018-12-13 13:55:42 UTC

Created attachment 1514063 [details]
logs

Description of problem:
during migration from ml2/ovs to ml2/ovn process we expect to have downtime of the live instances. when checking the downtime we can see that the downtime is around 5 to 6 minutes.


https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-14_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ml2ovs-to-ovn-migration/228/artifact/.workspaces/workspace_2018-12-12_11-44-32/ovn_migration/ovn_migration/
Version-Release number of selected component (if applicable):
core_puddle: 2018-12-12.4

How reproducible:
100%

Steps to Reproduce:
1. run the migration job and check the ping logs.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Miguel Angel Ajo 2019-02-04 23:40:51 UTC

I found the culprit.

It's the workaround introduced here: 
   https://github.com/openstack/networking-ovn/blob/47983cc61194888750f1c4cb08ff350a13914903/migration/tripleo_environment/playbooks/roles/migration/templates/clone-br-int.sh.j2#L79

The moment we delete the controller on br-int, all the openflow rules are removed = downtime. And then we wait 5 minutes after this script is ran :-/


The workaround was introduced because of:
   https://bugzilla.redhat.com/show_bug.cgi?id=1640045



We need to figure out if that workaround can now be removed, or otherwise, in that script:

1) Save the flows
2) Apply the workaround
3) Restore the flows

Comment 3 Miguel Angel Ajo 2019-02-05 16:30:53 UTC

I discovered the culprit because I was recording screen during the migration:
https://www.youtube.com/watch?v=sA7xfTpPMJc

Comment 4 Numan Siddique 2019-02-05 16:42:04 UTC

(In reply to Miguel Angel Ajo from comment #2)
> I found the culprit.
> 
> It's the workaround introduced here: 
>   
> https://github.com/openstack/networking-ovn/blob/
> 47983cc61194888750f1c4cb08ff350a13914903/migration/tripleo_environment/
> playbooks/roles/migration/templates/clone-br-int.sh.j2#L79
> 
> The moment we delete the controller on br-int, all the openflow rules are
> removed = downtime. And then we wait 5 minutes after this script is ran :-/
> 
> 
> The workaround was introduced because of:
>    https://bugzilla.redhat.com/show_bug.cgi?id=1640045
> 
> 
> 
> We need to figure out if that workaround can now be removed, or otherwise,
> in that script:
> 
> 1) Save the flows
> 2) Apply the workaround
> 3) Restore the flows

I think we have the fix available in ovs in the version - 2.10.0-21+ . So I don't think we need the workaround anymore.

Comment 5 Miguel Angel Ajo 2019-02-05 22:15:17 UTC

Correct, I verified we don't need the workaround anymore, it works without it.

Comment 10 Lon Hohberger 2019-07-10 10:40:35 UTC

According to our records, this should be resolved by python-networking-ovn-5.0.2-0.20190430191338.e673daf.el7ost.  This build is available now.

Comment 12 Eran Kuris 2019-10-30 10:11:05 UTC

cant verify - depends on https://bugzilla.redhat.com/show_bug.cgi?id=1694572

Comment 15 Jakub Libosvar 2019-11-26 13:38:21 UTC

Closing as bug 1694572 won't make it to OSP14 before EOL.