Bug 1498035
| Summary: | Neutron test_trunk_subport_lifecycle failure on OVN setup | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Eran Kuris <ekuris> | ||||
| Component: | openvswitch | Assignee: | Daniel Alvarez Sanchez <dalvarez> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Eran Kuris <ekuris> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 12.0 (Pike) | CC: | apevec, chrisw, dalvarez, ekuris, jlibosva, lhh, majopela, mariel, nyechiel, rhos-maint, sclewis, srevivo | ||||
| Target Milestone: | z1 | Keywords: | Triaged, ZStream | ||||
| Target Release: | 12.0 (Pike) | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | openvswitch-2.7.3-2.git20171010.el7fdp | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2018-01-30 20:25:08 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1498432 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
The bug description says it fails at [0] which means that the trunk doesn't transition to ACTIVE on time. While debugging the test, I could verify that the trunk transitions to ACTIVE but it fails later at [1] because subports don't transition to ACTIVE. After a closer look at OVN databases with the help of Numan, this certainly looks like a bug in ovn-controller. Looks like ovn-controller won't set the chassis column for the subports so ovn-northd won't transition them to 'up' state and thus they won't get bind by neutron remaining DOWN. In this particular test, subports are created in different (isolated) networks and this looks like the reason why ovn-controller doesn't bind them to the chassis. If the subports were in the same network as the trunk port or we connect them through a router, then the test would pass. I'll try to debug ovn-controller and come up with a patch to fix it. [0] https://github.com/openstack/neutron/blob/master/neutron/tests/tempest/scenario/test_trunk.py#L185 [1] https://github.com/openstack/neutron/blob/master/neutron/tests/tempest/scenario/test_trunk.py#L188 I have debugged this further with OVS master (v2.8.90) and it works: subports are bound to the chassis by ovn-controller and ovn-northd sets them to 'up' which is detected by networking-ovn and later neutron-server will try to bind them and therefore transition to ACTIVE in neutron.
After having a look at all possible commits from tag 2.7.2 to master I spotted
this [0] as the possible fix. I have tried this patch on top of v2.7.2 and I can confirm that it fixes the issue. Also, it has been backported to v2.7 branch so it's present in v2.7.3 [1].
One way to verify that it actually works:
- Create a logical switch net0
- Create a logical switch net1
- Add two logical ports to net0: port0 and port2
- Add another logical port to net1: port1
- Set the parent_name in Port_Binding table of both port1 and port2 to port0.
- Bind the port port0 to the chassis.
[root@controller-0 heat-admin]# ovn-nbctl --db=tcp:172.17.1.10:6641 show net0
switch 2614b187-e531-487a-9f2e-95ba008e7a57 (net0)
port port2
parent: port0
addresses: ["02:ac:10:ff:01:02 50.0.0.12"]
port port0
addresses: ["02:ac:10:ff:01:00 50.0.0.10"]
[root@controller-0 heat-admin]# ovn-nbctl --db=tcp:172.17.1.10:6641 show net1
switch 3345e5fc-a513-4a59-9650-e6bc2ab84a62 (net1)
port port1
parent: port0
addresses: ["02:ac:10:ff:01:01 60.0.0.11"]
Without the fix:
----------------
[root@controller-0 ovs]# ovn-sbctl --db=tcp:172.17.1.10:6642 list port_binding | grep port1 -C 3
_uuid : 77cac6d9-1a6e-495b-a59b-dae2c495cae3
chassis : []
datapath : baf94869-7b80-49ad-b358-3718af68310a
logical_port : "port1"
mac : ["02:ac:10:ff:01:01 60.0.0.11"]
options : {}
parent_port : "port0"
With the fix:
-------------
[root@controller-0 ovs]# ovn-sbctl --db=tcp:172.17.1.10:6642 list port_binding | grep port1 -C 3
_uuid : 77cac6d9-1a6e-495b-a59b-dae2c495cae3
chassis : 93508252-106f-46a8-9e90-8e7aeb7acdf9
datapath : baf94869-7b80-49ad-b358-3718af68310a
logical_port : "port1"
mac : ["02:ac:10:ff:01:01 60.0.0.11"]
options : {}
parent_port : "port0"
Since we need OVS 2.7.3 and it's only available upstream, I've requested it in Fedora [3] and will open a new BZ for OSP and RDO as well.
[0] https://github.com/openvswitch/ovs/commit/f80485a1f5a43e8d228fe81c1e9395a6bf5abb67
[1] https://github.com/openvswitch/ovs/commit/afaf4de07083863bccb1feaf01f37a59ca06a909
[2] https://github.com/openstack/neutron/blob/master/neutron/tests/tempest/scenario/test_trunk.py#L185
[3] https://bugzilla.redhat.com/show_bug.cgi?id=1468234
Apart from my previous comment which explains the actual bug that prevents this test from passing, the bug description states that it fails in a previous step where the trunk is supposed to transition to ACTIVE [0]. I have checked in the same environment and it indeed fails. Increasing the timeout to 120 (default is 60) worked for me but I don't believe this should be necessary as it looks like the environment was slow (maybe not enough hardware resources?):
diff --git a/neutron/tests/tempest/scenario/test_trunk.py b/neutron/tests/tempest/scenario/test_trunk.py
index a988ca8..e684730 100644
--- a/neutron/tests/tempest/scenario/test_trunk.py
+++ b/neutron/tests/tempest/scenario/test_trunk.py
@@ -182,7 +182,8 @@ class TrunkTest(base.BaseTempestTestCase):
utils.wait_until_true(
lambda: self._is_trunk_active(trunk1_id),
exception=RuntimeError("Timed out waiting for trunk %s to "
- "transition to ACTIVE." % trunk1_id))
+ "transition to ACTIVE." % trunk1_id),
+ timeout=120)
# ensure all underlying subports transitioned to ACTIVE
for s in subports:
utils.wait_until_true(lambda: self._is_port_active(s['port_id']))
Before posting this patch for review, could you please confirm that the resources used for running tempest in CI are the same that have been used in this test environment?
[0] https://github.com/openstack/neutron/blob/master/neutron/tests/tempest/scenario/test_trunk.py#L185
(In reply to Daniel Alvarez Sanchez from comment #3) > Apart from my previous comment which explains the actual bug that prevents > this test from passing, the bug description states that it fails in a > previous step where the trunk is supposed to transition to ACTIVE [0]. I > have checked in the same environment and it indeed fails. Increasing the > timeout to 120 (default is 60) worked for me but I don't believe this should > be necessary as it looks like the environment was slow (maybe not enough > hardware resources?): > > diff --git a/neutron/tests/tempest/scenario/test_trunk.py > b/neutron/tests/tempest/scenario/test_trunk.py > index a988ca8..e684730 100644 > --- a/neutron/tests/tempest/scenario/test_trunk.py > +++ b/neutron/tests/tempest/scenario/test_trunk.py > @@ -182,7 +182,8 @@ class TrunkTest(base.BaseTempestTestCase): > utils.wait_until_true( > lambda: self._is_trunk_active(trunk1_id), > exception=RuntimeError("Timed out waiting for trunk %s to " > - "transition to ACTIVE." % trunk1_id)) > + "transition to ACTIVE." % trunk1_id), > + timeout=120) > # ensure all underlying subports transitioned to ACTIVE > for s in subports: > utils.wait_until_true(lambda: > self._is_port_active(s['port_id'])) > > > Before posting this patch for review, could you please confirm that the > resources used for running tempest in CI are the same that have been used in > this test environment? > [0] > https://github.com/openstack/neutron/blob/master/neutron/tests/tempest/ > scenario/test_trunk.py#L185 Yes . they are the same I'd guess this is related to the OVSDB timeouts / keepalive This was already fixed in OVS 2.7.3. Fix verified: (overcloud) [stack@undercloud-0 ~]$ cat /etc/yum.repos.d/latest-installed 12 -p 2018-01-26.2 (overcloud) [stack@undercloud-0 ~]$ rpm -qa |grep openvswitch-2.7.3 openvswitch-2.7.3-3.git20180112.el7fdp.x86_64 python-openvswitch-2.7.3-3.git20180112.el7fdp.noarch (overcloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.16 Last login: Sun Jan 28 12:18:36 2018 from 192.168.24.1 [heat-admin@controller-1 ~]$ sudo -i [root@controller-1 ~]# rpm -qa |grep openvswitch-2.7.3 python-openvswitch-2.7.3-3.git20180112.el7fdp.noarch openvswitch-2.7.3-3.git20180112.el7fdp.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0248 |
Created attachment 1333602 [details] log Description of problem: The following Neutron test is failing on OVN environment: neutron.tests.tempest.scenario.test_trunk.TrunkTest.test_trunk_subport_lifecycle error : Traceback (most recent call last): File "/home/centos/tempest-upstream/neutron/neutron/tests/tempest/scenario/test_trunk.py", line 185, in test_trunk_subport_lifecycle utils.wait_until_true( File "/home/centos/tempest-upstream/neutron/neutron/common/utils.py", line 678, in wait_until_true raise WaitTimeout("Timed out after %d seconds" % timeout) neutron.common.utils.WaitTimeout: Timed out after 60 seconds Version-Release number of selected component (if applicable): # rpm -qa |grep ovn openvswitch-ovn-common-2.7.2-4.git20170719.el7fdp.x86_64 novnc-0.6.1-1.el7ost.noarch puppet-ovn-11.3.1-0.20170825135756.c03c3ed.el7ost.noarch openvswitch-ovn-central-2.7.2-4.git20170719.el7fdp.x86_64 openvswitch-ovn-host-2.7.2-4.git20170719.el7fdp.x86_64 openstack-nova-novncproxy-16.0.1-0.20170921091002.edd59ae.el7ost.noarch python-networking-ovn-3.0.1-0.20170906223255.c663db6.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1.Deploy OSP12 with OVN HA setup 2.verify trunk is enabled in config file 3.Run the following Neutron test: neutron.tests.tempest.scenario.test_trunk.TrunkTest.test_trunk_subport_lifecycle Actual results: test failed Expected results: tests should pass Additional info: