Bug 1390065 - OSPD 10 OVS DPDK deployment fails when overcloud deploy bridge mapping contains DPDK bridge and non-DPDK bridge
Summary: OSPD 10 OVS DPDK deployment fails when overcloud deploy bridge mapping contai...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 10.0 (Newton)
Assignee: Saravanan KR
QA Contact: Maxim Babushkin
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-31 07:12 UTC by Maxim Babushkin
Modified: 2017-03-19 15:07 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
When using OVS-DPDK, all bridges should be of type ovs_user_bridge on the Compute node. Red Hat OpenStack Platform director does not support mixing ovs_bridge and ovs_user_bridges as it kills the OVS-DPDK performances.
Clone Of:
Environment:
Last Closed: 2017-03-19 15:07:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
compute-node-openvswitch-agent-vm-create-fail (78.39 KB, text/plain)
2016-11-08 14:57 UTC, Saravanan KR
no flags Details

Description Maxim Babushkin 2016-10-31 07:12:24 UTC
Description of problem:
In RHOS 10 during the deployment of OVS + DPDK two ports, compute node looses its ip addresses on the PostDeploySteps.
The bug refers to the following BZ, where proper heat template configuration for one OVS DPDK port disscussed.
https://bugzilla.redhat.com/show_bug.cgi?id=1384562#c20

Version-Release number of selected component (if applicable):
RHOS 10
Product version: 10
Product core version: 10
Product core build: 2016-10-21.3

How reproducible:
Perform the deployment of two OVS DPDK port overcloud.
Use the following templates:
network-environment.yaml - http://pastebin.test.redhat.com/425321
controller.yaml - http://pastebin.test.redhat.com/425323
compute.yaml - http://pastebin.test.redhat.com/425324
first-boot.yaml - http://pastebin.test.redhat.com/425325
post-install.yaml - http://pastebin.test.redhat.com/425326
overcloud deploy command - http://pastebin.test.redhat.com/425327

Actual results:
During the deployment, the compute node looses ip addresses during the PostDeploySteps

Expected results:
The deployment of OVS DPDK two ports should succeed.

Additional info:
If trying to restart the network service, getting the following error:
http://pastebin.test.redhat.com/425330

Comment 2 Saravanan KR 2016-11-03 05:47:33 UTC
I have modified the overcloud deploy command to change the bridge mapping, following is the diff.

--- overcloud_deploy.orig.sh	2016-11-03 07:37:38.817539412 +0200
+++ overcloud_deploy.sh	2016-11-03 07:37:08.368675684 +0200
@@ -13,8 +13,8 @@ openstack overcloud deploy --debug \
   --ntp-server clock.redhat.com \
   --neutron-network-type vlan \
   --neutron-disable-tunneling \
-  --neutron-bridge-mappings datacentre:br-isolated,dpdk0:br-link0,dpdk1:br-link1 \
-  --neutron-network-vlan-ranges datacentre:399:399,dpdk0:423:423,dpdk1:424:424 \
+  --neutron-bridge-mappings dpdk0:br-link0,dpdk1:br-link1 \
+  --neutron-network-vlan-ranges dpdk0:423:423,dpdk1:424:424 \
   --control-scale 1 \
   --control-flavor baremetal \
   --compute-scale 1 \


Basically removed the bridge mapping for the tenant network, after which the deployment is successful. In the single nic, datecentre is mapped to external network (br-ex), where as in double nic, due to nic unavailability, br-ex is not present and datacentre is mapped to br-isolated. So there is no difference because of the number of dpdk nics.

But it looks like, there is an issue if the tenant network bridge (br-isolated), which is a regular ovs nic, is provided in the bridge mapping. When applying the bridge mapping and restarting the openvswitch, all the nics are loosing IP. 

This has to be investigated further from ovs/neutron SMEs. Maxim, I will leave it to you verify the change and to decide whether we should continue with this bug or open a new bug for further investigation.

Comment 3 Maxim Babushkin 2016-11-03 11:17:10 UTC
I just tried to deploy an environment without dpdk config at all, where datacentre physnet were mapped to br-isolated bridge.
The deployment succeeded.

Comment 4 Maxim Babushkin 2016-11-04 14:19:36 UTC
As previously discussed with Saravanan, I checked the following.
Deployed previously working scenario - single dpdk port deployment.
On overcloud deploy command I have changed the neutron bridge mappings to the same as showed in Saravanan comment (#2):

--neutron-bridge-mappings datacentre:br-isolated,dpdk:br-link
instead of regular:
--neutron-bridge-mappings datacentre:br-ex,dpdk:br-link

The result was the same as on the two port deployment.
During the PostDeploySteps, once openvswitch restarted, compute node nics loosing IP.

Looks like the issue is happening once we are using dpdk templates mapping.

Comment 5 Saravanan KR 2016-11-07 05:19:58 UTC
(In reply to Maxim Babushkin from comment #4)
> Looks like the issue is happening once we are using dpdk templates mapping.
To be more specific:
Mapping contains DPDK bridges only - No issue
Mapping contains DPDK bridge and a non existent bridge - No issue
Mapping contains DPDK bridge and non-DPDK bridge - Issue

Comment 6 Franck Baudin 2016-11-07 08:04:56 UTC
See Slide 34 of:
https://docs.google.com/presentation/d/1FsR7dfydSfYE7l_01nDVnlOAuo7vB2XZYZF1y62M4Ck/edit#slide=id.g132a4086ba_69_0

The compute node management/infrastructure interfaces are not connected to OVS, but are Linux regular interfaces (kernel bridge/bond). By design, when using OVS-DPDK, all bridges have to be OVS-DPDK, meaning that any interface connected to OVS-DPDK cannot be shared with the kernel. Another way to understand this is that when using OVS-DPDK, we can (should?) unload OVS kernel module.

Comment 7 Saravanan KR 2016-11-08 14:54:55 UTC
We tried in the Maxim's environment to deploy with Linux Bridge for br-isolated, but the deployment itself is failing. Maxim's environment is all in one interface (single-nic network isolation). As this is not validated with Linux Bridge, I have tried the same in another environment. 

New environment has 1 provisioning interface + 1 Linux Bridge + 1 OVS DPDK Bridge. With this environment, the deployment is successful, but VM creation with tenant network is failing, with port binding error. We got several error messages like below on neutron openvswitch agent log on the compute node.

2016-11-08 13:28:09.073 40119 ERROR neutron.agent.ovsdb.impl_idl TimeoutException: Commands [SetControllerCommand(bridge=br-int, targets=['tcp:127.0.0.1:6633'])] exceeded timeout 10 seconds post-commit


Two issues identified from this bz:
1) Deployment fails in Maxim's env when DPDK and non-DPDK networks are added in the bridge mapping (all openvswitch bridges)
2) VM Creation fails on tenant network with Linux Bridge for network isolation and OVS bridge for DPDK

Comment 8 Saravanan KR 2016-11-08 14:57:50 UTC
Created attachment 1218575 [details]
compute-node-openvswitch-agent-vm-create-fail

Comment 9 Saravanan KR 2016-11-08 15:03:40 UTC
Comment on attachment 1218575 [details]
compute-node-openvswitch-agent-vm-create-fail

New environment has 1 provisioning interface + 1 Linux Bridge + 1 OVS DPDK Bridge. With this environment, the deployment is successful, but VM creation with tenant network is failing, with port binding error. We got several error messages like below on neutron openvswitch agent log on the compute node.

2016-11-08 13:28:09.073 40119 ERROR neutron.agent.ovsdb.impl_idl TimeoutException: Commands [SetControllerCommand(bridge=br-int, targets=['tcp:127.0.0.1:6633'])] exceeded timeout 10 seconds post-commit

Comment 10 Franck Baudin 2016-11-09 15:45:45 UTC
Saravanan, can you check that ovs-vswitchd is launched?

2016-11-08 13:28:09.073 40119 ERROR neutron.agent.ovsdb.impl_idl TimeoutException: Commands [SetControllerCommand(bridge=br-int, targets=['tcp:127.0.0.1:6633'])] exceeded timeout 10 seconds post-commit

=> to me it seems that it hangs/crash

Comment 11 Saravanan KR 2016-11-09 16:13:25 UTC
(In reply to Franck Baudin from comment #10)
> Saravanan, can you check that ovs-vswitchd is launched?
> 
> 2016-11-08 13:28:09.073 40119 ERROR neutron.agent.ovsdb.impl_idl
> TimeoutException: Commands [SetControllerCommand(bridge=br-int,
> targets=['tcp:127.0.0.1:6633'])] exceeded timeout 10 seconds post-commit
> 
> => to me it seems that it hangs/crash

ovs-vswitchd process is running. All the pid files are present. But there is an error log in the OVS. I will attach the complete logs in the attachment. 

2016-11-09T12:10:20.995Z|00001|ofproto_dpif_upcall(handler1)|INFO|received packet on unassociated datapath port 0
2016-11-09T12:10:20.996Z|00016|bridge|INFO|bridge br-isol: added interface br-isol on port 65534
2016-11-09T12:10:20.997Z|00017|bridge|INFO|bridge br-isol: using datapath ID 0000ea2186ad8b42
2016-11-09T12:10:20.997Z|00018|connmgr|INFO|br-isol: added service controller "punix:/var/run/openvswitch/br-isol.mgmt"
2016-11-09T12:10:21.076Z|00019|dpif|WARN|system@ovs-system: failed to add br-isol as port: File exists
2016-11-09T12:10:21.078Z|00020|bridge|INFO|bridge br-isol: added interface br-isol on port 65534

Comment 12 Franck Baudin 2016-11-09 16:49:01 UTC
2016-11-09T12:10:21.076Z|00019|dpif|WARN|system@ovs-system: failed to add
br-isol as port: File exists

So we try to create a port that is already there. When moving a node from a regular OVS role into an OVS-DPDK role, the OVSDB has to be rested first, as we re-create all ports/bridges.

Comment 19 Yariv 2017-03-19 15:07:35 UTC
It is verified and closed
But port mix in OVS cause perf degradation.. not recommended see NFV config guide


Note You need to log in before you can comment on or make changes to this bug.