Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1498035

Summary: Neutron test_trunk_subport_lifecycle failure on OVN setup
Product: Red Hat OpenStack Reporter: Eran Kuris <ekuris>
Component: openvswitchAssignee: Daniel Alvarez Sanchez <dalvarez>
Status: CLOSED ERRATA QA Contact: Eran Kuris <ekuris>
Severity: high Docs Contact:
Priority: medium    
Version: 12.0 (Pike)CC: apevec, chrisw, dalvarez, ekuris, jlibosva, lhh, majopela, mariel, nyechiel, rhos-maint, sclewis, srevivo
Target Milestone: z1Keywords: Triaged, ZStream
Target Release: 12.0 (Pike)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openvswitch-2.7.3-2.git20171010.el7fdp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-01-30 20:25:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1498432    
Bug Blocks:    
Attachments:
Description Flags
log none

Description Eran Kuris 2017-10-03 11:24:07 UTC
Created attachment 1333602 [details]
log

Description of problem:
The following Neutron test is failing on OVN environment:
neutron.tests.tempest.scenario.test_trunk.TrunkTest.test_trunk_subport_lifecycle
 error :
Traceback (most recent call last):
  File "/home/centos/tempest-upstream/neutron/neutron/tests/tempest/scenario/test_trunk.py", line 185, in test_trunk_subport_lifecycle
    utils.wait_until_true(
  File "/home/centos/tempest-upstream/neutron/neutron/common/utils.py", line 678, in wait_until_true
    raise WaitTimeout("Timed out after %d seconds" % timeout)
neutron.common.utils.WaitTimeout: Timed out after 60 seconds


Version-Release number of selected component (if applicable):
# rpm -qa |grep ovn
openvswitch-ovn-common-2.7.2-4.git20170719.el7fdp.x86_64
novnc-0.6.1-1.el7ost.noarch
puppet-ovn-11.3.1-0.20170825135756.c03c3ed.el7ost.noarch
openvswitch-ovn-central-2.7.2-4.git20170719.el7fdp.x86_64
openvswitch-ovn-host-2.7.2-4.git20170719.el7fdp.x86_64
openstack-nova-novncproxy-16.0.1-0.20170921091002.edd59ae.el7ost.noarch
python-networking-ovn-3.0.1-0.20170906223255.c663db6.el7ost.noarch


How reproducible:
100%

Steps to Reproduce:
1.Deploy OSP12 with OVN HA setup
2.verify trunk is enabled in config file  
3.Run the following Neutron test: 
neutron.tests.tempest.scenario.test_trunk.TrunkTest.test_trunk_subport_lifecycle
Actual results:
test failed 

Expected results:
tests should pass

Additional info:

Comment 1 Daniel Alvarez Sanchez 2017-10-03 13:12:27 UTC
The bug description says it fails at [0] which means that the trunk doesn't transition to ACTIVE on time. While debugging the test, I could verify that the trunk transitions to ACTIVE but it fails later at [1] because subports don't transition to ACTIVE.

After a closer look at OVN databases with the help of Numan, this certainly looks like a bug in ovn-controller. Looks like ovn-controller won't set the chassis column for the subports so ovn-northd won't transition them to 'up' state and thus they won't get bind by neutron remaining DOWN.

In this particular test, subports are created in different (isolated) networks and this looks like the reason why ovn-controller doesn't bind them to the chassis. If the subports were in the same network as the trunk port or we connect them through a router, then the test would pass.

I'll try to debug ovn-controller and come up with a patch to fix it.

[0] https://github.com/openstack/neutron/blob/master/neutron/tests/tempest/scenario/test_trunk.py#L185
[1] https://github.com/openstack/neutron/blob/master/neutron/tests/tempest/scenario/test_trunk.py#L188

Comment 2 Daniel Alvarez Sanchez 2017-10-04 09:31:23 UTC
I have debugged this further with OVS master (v2.8.90) and it works: subports are bound to the chassis by ovn-controller and ovn-northd sets them to 'up' which is detected by networking-ovn and later neutron-server will try to bind them and therefore transition to ACTIVE in neutron.

After having a look at all possible commits from tag 2.7.2 to master I spotted
this [0] as the possible fix. I have tried this patch on top of v2.7.2 and I can confirm that it fixes the issue. Also, it has been backported to v2.7 branch so it's present in v2.7.3 [1].

One way to verify that it actually works:

- Create a logical switch net0
- Create a logical switch net1
- Add two logical ports to net0: port0 and port2
- Add another logical port to net1: port1
- Set the parent_name in Port_Binding table of both port1 and port2 to port0.
- Bind the port port0 to the chassis.

[root@controller-0 heat-admin]# ovn-nbctl --db=tcp:172.17.1.10:6641 show net0
    switch 2614b187-e531-487a-9f2e-95ba008e7a57 (net0)
        port port2
            parent: port0
            addresses: ["02:ac:10:ff:01:02 50.0.0.12"]
        port port0
            addresses: ["02:ac:10:ff:01:00 50.0.0.10"]


[root@controller-0 heat-admin]# ovn-nbctl --db=tcp:172.17.1.10:6641 show net1
    switch 3345e5fc-a513-4a59-9650-e6bc2ab84a62 (net1)
        port port1
            parent: port0
            addresses: ["02:ac:10:ff:01:01 60.0.0.11"]



Without the fix:
----------------

[root@controller-0 ovs]# ovn-sbctl --db=tcp:172.17.1.10:6642 list port_binding | grep port1 -C 3
_uuid               : 77cac6d9-1a6e-495b-a59b-dae2c495cae3
chassis             : []
datapath            : baf94869-7b80-49ad-b358-3718af68310a
logical_port        : "port1"
mac                 : ["02:ac:10:ff:01:01 60.0.0.11"]
options             : {}
parent_port         : "port0"



With the fix:
-------------

[root@controller-0 ovs]# ovn-sbctl --db=tcp:172.17.1.10:6642 list port_binding | grep port1 -C 3
_uuid               : 77cac6d9-1a6e-495b-a59b-dae2c495cae3
chassis             : 93508252-106f-46a8-9e90-8e7aeb7acdf9
datapath            : baf94869-7b80-49ad-b358-3718af68310a
logical_port        : "port1"
mac                 : ["02:ac:10:ff:01:01 60.0.0.11"]
options             : {}
parent_port         : "port0"


Since we need OVS 2.7.3 and it's only available upstream, I've requested it in Fedora [3] and will open a new BZ for OSP and RDO as well.

[0] https://github.com/openvswitch/ovs/commit/f80485a1f5a43e8d228fe81c1e9395a6bf5abb67
[1] https://github.com/openvswitch/ovs/commit/afaf4de07083863bccb1feaf01f37a59ca06a909
[2] https://github.com/openstack/neutron/blob/master/neutron/tests/tempest/scenario/test_trunk.py#L185
[3] https://bugzilla.redhat.com/show_bug.cgi?id=1468234

Comment 3 Daniel Alvarez Sanchez 2017-10-04 09:33:39 UTC
Apart from my previous comment which explains the actual bug that prevents this test from passing, the bug description states that it fails in a previous step where the trunk is supposed to transition to ACTIVE [0]. I have checked in the same environment and it indeed fails. Increasing the timeout to 120 (default is 60) worked for me but I don't believe this should be necessary as it looks like the environment was slow (maybe not enough hardware resources?):

diff --git a/neutron/tests/tempest/scenario/test_trunk.py b/neutron/tests/tempest/scenario/test_trunk.py
index a988ca8..e684730 100644
--- a/neutron/tests/tempest/scenario/test_trunk.py
+++ b/neutron/tests/tempest/scenario/test_trunk.py
@@ -182,7 +182,8 @@ class TrunkTest(base.BaseTempestTestCase):
         utils.wait_until_true(
             lambda: self._is_trunk_active(trunk1_id),
             exception=RuntimeError("Timed out waiting for trunk %s to "
-                                   "transition to ACTIVE." % trunk1_id))
+                                   "transition to ACTIVE." % trunk1_id),
+            timeout=120)
         # ensure all underlying subports transitioned to ACTIVE
         for s in subports:
             utils.wait_until_true(lambda: self._is_port_active(s['port_id']))


Before posting this patch for review, could you please confirm that the resources used for running tempest in CI are the same that have been used in this test environment?

[0] https://github.com/openstack/neutron/blob/master/neutron/tests/tempest/scenario/test_trunk.py#L185

Comment 4 Eran Kuris 2017-10-18 08:31:48 UTC
(In reply to Daniel Alvarez Sanchez from comment #3)
> Apart from my previous comment which explains the actual bug that prevents
> this test from passing, the bug description states that it fails in a
> previous step where the trunk is supposed to transition to ACTIVE [0]. I
> have checked in the same environment and it indeed fails. Increasing the
> timeout to 120 (default is 60) worked for me but I don't believe this should
> be necessary as it looks like the environment was slow (maybe not enough
> hardware resources?):
> 
> diff --git a/neutron/tests/tempest/scenario/test_trunk.py
> b/neutron/tests/tempest/scenario/test_trunk.py
> index a988ca8..e684730 100644
> --- a/neutron/tests/tempest/scenario/test_trunk.py
> +++ b/neutron/tests/tempest/scenario/test_trunk.py
> @@ -182,7 +182,8 @@ class TrunkTest(base.BaseTempestTestCase):
>          utils.wait_until_true(
>              lambda: self._is_trunk_active(trunk1_id),
>              exception=RuntimeError("Timed out waiting for trunk %s to "
> -                                   "transition to ACTIVE." % trunk1_id))
> +                                   "transition to ACTIVE." % trunk1_id),
> +            timeout=120)
>          # ensure all underlying subports transitioned to ACTIVE
>          for s in subports:
>              utils.wait_until_true(lambda:
> self._is_port_active(s['port_id']))
> 
> 
> Before posting this patch for review, could you please confirm that the
> resources used for running tempest in CI are the same that have been used in
> this test environment?



> [0]
> https://github.com/openstack/neutron/blob/master/neutron/tests/tempest/
> scenario/test_trunk.py#L185

Yes . they are the same

Comment 5 Miguel Angel Ajo 2017-12-04 13:26:25 UTC
I'd guess this is related to the OVSDB timeouts / keepalive

Comment 8 Daniel Alvarez Sanchez 2018-01-15 09:59:01 UTC
This was already fixed in OVS 2.7.3.

Comment 11 Eran Kuris 2018-01-28 12:20:37 UTC
Fix verified: 
(overcloud) [stack@undercloud-0 ~]$ cat /etc/yum.repos.d/latest-installed 
12   -p 2018-01-26.2
(overcloud) [stack@undercloud-0 ~]$ rpm -qa |grep openvswitch-2.7.3
openvswitch-2.7.3-3.git20180112.el7fdp.x86_64
python-openvswitch-2.7.3-3.git20180112.el7fdp.noarch
(overcloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.16
Last login: Sun Jan 28 12:18:36 2018 from 192.168.24.1
[heat-admin@controller-1 ~]$ sudo -i 
[root@controller-1 ~]# rpm -qa |grep openvswitch-2.7.3
python-openvswitch-2.7.3-3.git20180112.el7fdp.noarch
openvswitch-2.7.3-3.git20180112.el7fdp.x86_64

Comment 14 errata-xmlrpc 2018-01-30 20:25:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0248