Bug 1691044 - Adding ports to VM trunk takes too long with OVN neutron backend
Summary: Adding ports to VM trunk takes too long with OVN neutron backend
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-ovn
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: z8
: 13.0 (Queens)
Assignee: Lucas Alvares Gomes
QA Contact: Jon Uriarte
URL:
Whiteboard:
Depends On:
Blocks: 1724557
TreeView+ depends on / blocked
 
Reported: 2019-03-20 17:36 UTC by Jon Uriarte
Modified: 2020-02-11 11:33 UTC (History)
12 users (show)

Fixed In Version: python-networking-ovn-4.0.3-8.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1724557 (view as bug list)
Environment:
Last Closed: 2019-09-03 16:53:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1828585 0 None None None 2019-05-10 14:29:38 UTC
Launchpad 1834637 0 None None None 2019-08-22 13:42:41 UTC
Launchpad 1841062 0 None None None 2019-08-22 13:37:52 UTC
OpenStack gerrit 658383 0 'None' MERGED Don't wait for metadata service unnecessarily 2020-12-03 10:08:01 UTC
OpenStack gerrit 676182 0 'None' MERGED Set binding profile directly from OVNTrunkDriver 2020-12-03 10:07:35 UTC
Red Hat Knowledge Base (Solution) 4373171 0 None None None 2019-08-26 17:29:35 UTC
Red Hat Product Errata RHBA-2019:2629 0 None None None 2019-09-03 16:53:22 UTC

Internal Links: 1703198

Description Jon Uriarte 2019-03-20 17:36:15 UTC
Description of problem:

OCP installation on OSP is taking much more time with OVN neutron backend rather than with ML2/OVS backend.
It seems to be related to the time it takes now to add neutron subports to VM trunks, which in a OCP on OSP installation
is made frequently when pods are created in each new OCP project.

Adding 5 subports to a trunk (creating a pod in a new project) takes:
  20 seconds with ML2/OVS
  > 2 minutes with ML2/OVN


Version-Release number of selected component (if applicable):

OSP 13 2019-03-18.1 puddle
python-networking-ovn-4.0.3-3.el7ost.noarch
python-networking-ovn-metadata-agent-4.0.3-3.el7ost.noarch
openvswitch-ovn-central-2.9.0-97.el7fdp.x86_64
openvswitch-ovn-common-2.9.0-97.el7fdp.x86_64
puppet-ovn-12.4.0-2.el7ost.noarch
openvswitch-ovn-host-2.9.0-97.el7fdp.x86_64


How reproducible: 100%


Steps to Reproduce:
1. Deploy OSP with Octavia and OVN neutron backend (no DVR)
2. Run OCP 3.11 playbooks with Kuryr
3. Once OCP is deployed, ssh into master node
4. Create a new project:
[openshift@master-0 ~]$ oc new-project hello-openshift

5. Obtain the IP range assigned to the network namespace for the project
[openshift@master-0 ~]$ oc get kuryrnets 
NAME                            AGE
ns-default                      19h
ns-hello-openshift              43s
(...)

[openshift@master-0 ~]$ oc get kuryrnets ns-hello-openshift -o yaml
(...)
spec:
  netId: 72e3b67f-76c5-41e5-817f-e3a3fac2941c
  routerId: 5396d97c-f8d0-4bec-a096-0025afdb75dc
  sgId: 590cd3e3-571c-4126-a707-b50d3a64803b
  subnetCIDR: 10.11.15.0/24                                      <<<<-----------------
  subnetId: b522840e-c1d2-4565-85be-41fdb9fae0cf

6. Monitor the ports status in that IP range and see how much takes to all of them be in ACTIVE status after the pod creation is launched
[stack@undercloud-0 ~]$ watch "openstack port list | grep 10.11.15"

7. Deploy hello-openshift pod:
[openshift@master-0 ~]$ oc create -f https://raw.githubusercontent.com/openshift/origin/master/examples/hello-openshift/hello-pod.json

Actual results:
The 5 ports created by Kuryr do need more than 2 minutes to be added to VM trunk (and go to ACTIVE status).
And they are moved to ACTIVE one by one, not as one operation, although they are requested by a unique call [1].

(shiftstack) [stack@undercloud-0 ~]$ openstack port list | grep 10.11.15
| 30d8ae16-04bf-48c1-91ad-b7d4b6f02712 | fa:16:3e:1a:99:08 | ip_address='10.11.15.7', subnet_id='b522840e-c1d2-4565-85be-41fdb9fae0cf'     | DOWN   |  
| 3e4ff6a6-735c-4e3c-8f41-61af0f342fe4 | fa:16:3e:c4:7f:59 | ip_address='10.11.15.17', subnet_id='b522840e-c1d2-4565-85be-41fdb9fae0cf'    | DOWN   |  
| 3ec4a554-1812-4e26-9321-c47c67271ad1 | fa:16:3e:6c:5c:38 | ip_address='10.11.15.12', subnet_id='b522840e-c1d2-4565-85be-41fdb9fae0cf'    | DOWN   |  
| 4ec67995-7fa9-43d5-b606-34f15e9c174f | fa:16:3e:8e:3e:12 | ip_address='10.11.15.5', subnet_id='b522840e-c1d2-4565-85be-41fdb9fae0cf'     | ACTIVE |
| 943270b4-6163-4306-bed7-1266249d1dbb | fa:16:3e:3c:6d:d1 | ip_address='10.11.15.14', subnet_id='b522840e-c1d2-4565-85be-41fdb9fae0cf'    | DOWN   |  
| c07b39f9-6973-48d7-aa08-ece9745c0e9e | fa:16:3e:7d:33:00 | ip_address='10.11.15.1', subnet_id='b522840e-c1d2-4565-85be-41fdb9fae0cf'     | ACTIVE |


Expected results:
With ML2/OVS backend it takes 20 seconds, and the ports are moved to ACTIVE at the same time, not one by one.

Additional info:

[1] https://github.com/openstack/kuryr-kubernetes/blob/master/kuryr_kubernetes/controller/drivers/nested_vlan_vif.py#L95-L96

Comment 1 Daniel Alvarez Sanchez 2019-03-21 13:30:44 UTC
I see a bunch of these messages:

2019-03-21 12:01:04.077 35 DEBUG neutron.pecan_wsgi.hooks.policy_enforcement [req-ad4b7334-17d5-4bde-90c1-fb5d141af326 c13e0cc8325b47d99780b07e32c1c29e 240611d0d1c84092a8105e7e9f7ec85c - default default] Attributes excluded by policy engine: [u'binding:profile', u'binding:vif_details', u'binding:vif_type', u'binding:host_id'] _exclude_attributes_by_policy /usr/lib/python2.7/site-packages/neutron/pecan_wsgi/hooks/policy_enforcement.py:256
2019-03-21 12:01:04.079 35 DEBUG neutron.pecan_wsgi.hooks.policy_enforcement [req-ad4b7334-17d5-4bde-90c1-fb5d141af326 c13e0cc8325b47d99780b07e32c1c29e 240611d0d1c84092a8105e7e9f7ec85c - default default] Attributes excluded by policy engine: [u'binding:profile', u'binding:vif_details', u'binding:vif_type', u'binding:host_id'] _exclude_attributes_by_policy /usr/lib/python2.7/site-packages/neutron/pecan_wsgi/hooks/policy_enforcement.py:256
2019-03-21 12:01:04.081 35 DEBUG neutron.pecan_wsgi.hooks.policy_enforcement [req-ad4b7334-17d5-4bde-90c1-fb5d141af326 c13e0cc8325b47d99780b07e32c1c29e 240611d0d1c84092a8105e7e9f7ec85c - default default] Attributes excluded by policy engine: [u'binding:profile', u'binding:vif_details', u'binding:vif_type', u'binding:host_id'] _exclude_attributes_by_policy /usr/lib/python2.7/site-packages/neutron/pecan_wsgi/hooks/policy_enforcement.py:256
2019-03-21 12:01:04.083 35 DEBUG neutron.pecan_wsgi.hooks.policy_enforcement [req-ad4b7334-17d5-4bde-90c1-fb5d141af326 c13e0cc8325b47d99780b07e32c1c29e 240611d0d1c84092a8105e7e9f7ec85c - default default] Attributes excluded by policy engine: [u'binding:profile', u'binding:vif_details', u'binding:vif_type', u'binding:host_id'] _exclude_attributes_by_policy /usr/lib/python2.7/site-packages/neutron/pecan_wsgi/hooks/policy_enforcement.py:256
2019-03-21 12:01:04.084 35 DEBUG neutron.pecan_wsgi.hooks.policy_enforcement [req-ad4b7334-17d5-4bde-90c1-fb5d141af326 c13e0cc8325b47d99780b07e32c1c29e 240611d0d1c84092a8105e7e9f7ec85c - default default] Attributes excluded by policy engine: [u'binding:profile', u'binding:vif_details', u'binding:vif_type', u'binding:host_id'] _exclude_attributes_by_policy /usr/lib/python2.7/site-packages/neutron/pecan_wsgi/hooks/policy_enforcement.py:256
2019-03-21 12:01:04.086 35 DEBUG neutron.pecan_wsgi.hooks.policy_enforcement [req-ad4b7334-17d5-4bde-90c1-fb5d141af326 c13e0cc8325b47d99780b07e32c1c29e 240611d0d1c84092a8105e7e9f7ec85c - default default] Attributes excluded by policy engine: [u'binding:profile', u'binding:vif_details', u'binding:vif_type', u'binding:host_id'] _exclude_attributes_by_policy /usr/lib/python2.7/site-packages/neutron/pecan_wsgi/hooks/policy_enforcement.py:256
2019-03-21 12:01:04.088 35 DEBUG neutron.pecan_wsgi.hooks.policy_enforcement [req-ad4b7334-17d5-4bde-90c1-fb5d141af326 c13e0cc8325b47d99780b07e32c1c29e 240611d0d1c84092a8105e7e9f7ec85c - default default] Attributes excluded by policy engine: [u'binding:profile', u'binding:vif_details', u'binding:vif_type', u'binding:host_id'] _exclude_attributes_by_policy /usr/lib/python2.7/site-packages/neutron/pecan_wsgi/hooks/policy_enforcement.py:256
2019-03-21 12:01:04.090 35 DEBUG neutron.pecan_wsgi.hooks.policy_enforcement [req-ad4b7334-17d5-4bde-90c1-fb5d141af326 c13e0cc8325b47d99780b07e32c1c29e 240611d0d1c84092a8105e7e9f7ec85c - default default] Attributes excluded by policy engine: [u'binding:profile', u'binding:vif_details', u'binding:vif_type', u'binding:host_id'] _exclude_attributes_by_policy /usr/lib/python2.7/site-packages/neutron/pecan_wsgi/hooks/policy_enforcement.py:256
2019-03-21 12:01:04.092 35 DEBUG neutron.pecan_wsgi.hooks.policy_enforcement [req-ad4b7334-17d5-4bde-90c1-fb5d141af326 c13e0cc8325b47d99780b07e32c1c29e 240611d0d1c84092a8105e7e9f7ec85c - default default] Attributes excluded by policy engine: [u'binding:profile', u'binding:vif_details', u'binding:vif_type', u'binding:host_id'] _exclude_attributes_by_policy /usr/lib/python2.7/site-packages/neutron/pecan_wsgi/hooks/policy_enforcement.py:256
2019-03-21 12:01:04.094 35 DEBUG neutron.pecan_wsgi.hooks.policy_enforcement [req-ad4b7334-17d5-4bde-90c1-fb5d141af326 c13e0cc8325b47d99780b07e32c1c29e 240611d0d1c84092a8105e7e9f7ec85c - default default] Attributes excluded by policy engine: [u'binding:profile', u'binding:vif_details', u'binding:vif_type', u'binding:host_id'] _exclude_attributes_by_policy /usr/lib/python2.7/site-packages/neutron/pecan_wsgi/hooks/policy_enforcement.py:256
2019-03-21 12:01:04.096 35 DEBUG neutron.pecan_wsgi.hooks.policy_enforcement [req-ad4b7334-17d5-4bde-90c1-fb5d141af326 c13e0cc8325b47d99780b07e32c1c29e 240611d0d1c84092a8105e7e9f7ec85c - default default] Attributes excluded by policy engine: [u'binding:profile', u'binding:vif_details', u'binding:vif_type', u'binding:host_id'] _exclude_attributes_by_policy /usr/lib/python2.7/site-packages/neutron/pecan_wsgi/hooks/policy_enforcement.py:256



Which I believe is the reason of the slowness.
The binding details come from this:
https://github.com/openstack/networking-ovn/blob/6.0.0.0b1/networking_ovn/ml2/trunk_driver.py#L53

Comment 2 Slawek Kaplonski 2019-03-22 10:11:35 UTC
I was able to reproduce this issue on devstack with ml2/ovn with steps like:

1. Create trunk parent port
openstack port create --network private trunk-port

2. Create trunk
openstack network trunk create --parent-port a103ed20-ea9c-46b6-a89d-4441b6d4fb72 trunk

3. Create trunked net
neutron net-create trunked-net
neutron subnet-create trunked-net 192.168.100.0/24

4. Create subports:
openstack port create --network trunked-net subport1
openstack port create --network trunked-net subport2
openstack port create --network trunked-net subport3
openstack port create --network trunked-net subport4
openstack port create --network trunked-net subport5

4. Add subports to trunk:
openstack network trunk set --subport port=b41659ba-2444-4a34-96b3-9bcd7ec89296,segmentation-type=vlan,segmentation-id=101 --subport port=a70f1dc9-8fe7-481e-9608-d3d51cdc1eac,segmentation-type=vlan,segmentation-id=102 --subport port=2b7dc21f-18ee-42ea-854c-01e4654b7f0c,segmentation-type=vlan,segmentation-id=103 --subport port=8d1e3151-992c-48d4-85ec-618449189423,segmentation-type=vlan,segmentation-id=104 --subport port=91d4d286-81d8-4101-a3c2-359e0c800a7e,segmentation-type=vlan,segmentation-id=105  9da4ef0f-72bf-4031-8c4f-5d3b739b8e5b

5. Spawm vm connected to this trunk's parent port:

nova boot --image cirros-0.3.4-i386-disk --flavor m1.micro --nic port-id=a103ed20-ea9c-46b6-a89d-4441b6d4fb72 vm-trunks

After that watch status of subports, it took a lot of time to switch all of them to ACTIVE.

When I did the same steps on devstack with ml2/ovs it was much faster.

Comment 3 Slawek Kaplonski 2019-03-22 16:03:46 UTC
Some update with what I did and what I found today.

I used simple script to reproduce the issue: http://pastebin.test.redhat.com/742759

with this script I got results like:

$ bash test_trunks.sh
+------------+------------------------------------------------+
| Property   | Value                                          |
+------------+------------------------------------------------+
| ip_address | 10.0.0.31,fdd5:442b:9728:0:f816:3eff:fef8:f8ac |
| mac_addr   | fa:16:3e:f8:f8:ac                              |
| net_id     | 120fe645-30f8-478a-9446-b357550acb3a           |
| port_id    | adf3ee81-2ab1-478f-abfd-6024991af9b0           |
| port_state | DOWN                                           |
| tag        | -                                              |
+------------+------------------------------------------------+
Number of active ports: 0 ....
Number of active ports: 0 ....
Number of active ports: 1 ....
Number of active ports: 1 ....
Number of active ports: 2 ....
Number of active ports: 2 ....
Number of active ports: 2 ....
Number of active ports: 3 ....
Number of active ports: 3 ....
Number of active ports: 4 ....
Number of active ports: 4 ....
Number of active ports: 4 ....
Number of active ports: 4 ....
Number of active ports: 5 ....
Transition of ports to ACTIVE took 99.169112739 seconds

So it took almost 100 seconds to turn all subports to active after trunk port was connected to instance.

I was debugging this issue and I found that it takes about 20 seconds for each subport in set_port_status_up:

Mar 22 16:44:13 devstack-ovn-ubuntu-controller neutron-server[12800]: INFO networking_ovn.ml2.mech_driver [None req-93d6d785-05ee-42a4-b909-206eb9bfd983 None None] OVN reports status up for port: 824dd0b0-5e31-4b44-9718-b4e1c8dc9af5 (id of subport-3)
Mar 22 16:44:37 devstack-ovn-ubuntu-controller neutron-server[12800]: INFO networking_ovn.ml2.mech_driver [None req-93d6d785-05ee-42a4-b909-206eb9bfd983 None None] OVN reports status up for port: f0ebaacd-22ce-454c-a8ac-8dc1dc408f62 (id of subport-5)
Mar 22 16:44:53 devstack-ovn-ubuntu-controller neutron-server[12800]: INFO networking_ovn.ml2.mech_driver [None req-93d6d785-05ee-42a4-b909-206eb9bfd983 None None] OVN reports status up for port: 07c7396b-5c64-4ada-a992-fd6fca1d130a (id of subport-4)
Mar 22 16:45:10 devstack-ovn-ubuntu-controller neutron-server[12800]: INFO networking_ovn.ml2.mech_driver [None req-93d6d785-05ee-42a4-b909-206eb9bfd983 None None] OVN reports status up for port: e5406129-114a-414b-b923-869864e45fd9 (id of subport-2)
Mar 22 16:45:26 devstack-ovn-ubuntu-controller neutron-server[12800]: INFO networking_ovn.ml2.mech_driver [None req-93d6d785-05ee-42a4-b909-206eb9bfd983 None None] OVN reports status up for port: adf3ee81-2ab1-478f-abfd-6024991af9b0 (id of trunk port)
Mar 22 16:45:31 devstack-ovn-ubuntu-controller neutron-server[12800]: INFO networking_ovn.ml2.mech_driver [None req-93d6d785-05ee-42a4-b909-206eb9bfd983 None None] OVN reports status up for port: e80b876f-ee81-484a-b645-4ea616bc4ef9 (id of subport-1)

So, next I wanted to found out where we are loosing time in this set_port_status_up method. So I added some "dummy" logging (patch like http://pastebin.test.redhat.com/742768) to check where it spends most of the time. It looks that we lost most of the time in _update_subport_host_if_needed() method, and inside this method vast majority of time is spend in self._plugin.update_port() - in overall it is around 10 seconds there for each subport.

Second bottleneck is in provisioning_blocks.provisioning_complete() method, it's around 5 seconds.

Both those methods are not from networking-ovn but from core neutron code.

For now I don't know yet how to improve it. First thing which comes to my mind is to call set_port_status_up in parallel but I don't know networking-ovn mech driver well thus I have no idea what other problems it may cause and why it isn't done like that for now.

Comment 4 Daniel Alvarez Sanchez 2019-04-24 09:43:42 UTC
Lucas' working on this BZ and a general optimization mechanism in the context of this design document:

https://docs.openstack.org/networking-ovn/latest/contributor/design/distributed_ovsdb_events.html

Comment 27 Udi Shkalim 2019-08-22 10:02:56 UTC
Verified on python-networking-ovn-4.0.3-10.el7ost.noarch

Comment 30 errata-xmlrpc 2019-09-03 16:53:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2629


Note You need to log in before you can comment on or make changes to this bug.