Bug 1566544

Summary: [Deployment] Flat network doesn't work with DPDK in ODL
Product: Red Hat OpenStack Reporter: jianzzha
Component: opendaylightAssignee: Victor Pickard <vpickard>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Itzik Brown <itbrown>
Severity: high Docs Contact:
Priority: medium    
Version: 12.0 (Pike)CC: aadam, itbrown, jianzzha, mkolesni, nyechiel, trozet, vpickard
Target Milestone: z1Keywords: Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: x86_64   
OS: Linux   
Whiteboard: odl_deployment
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-30 11:43:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
odl-ovsdb-dump
none
karaf.log none

Description jianzzha 2018-04-12 13:42:47 UTC
Description of problem:
This is a follow up on https://bugzilla.redhat.com/show_bug.cgi?id=1557526

The configuration side of work for flat provider network is done and the guest can be started with port attached to the flat provider network. However, the data couldn't flow though the provider networks.

trafficgen1---flat1---dpdk0--VM--dpdk1---flat2----trafficgen2

The trafficgen1 traffic failed to reach trafficgen2

If flat1 and flat2 network are replaced with vlan network, then it works fine.

so it is possible that the flow table rule on the compute node need some adjustment.

Version-Release number of selected component (if applicable):


How reproducible:
every time

Steps to Reproduce:
1. see https://bugzilla.redhat.com/show_bug.cgi?id=1557526

2.
3.

Actual results:
trafficgen traffic failed to pass through the flat network

Expected results:
trafficgen traffic should pass through the flat network

Additional info:

Comment 1 Mike Kolesnik 2018-04-15 06:12:37 UTC
Itzik,

Can you please check if this works without DPDK?

Comment 4 Victor Pickard 2018-04-25 21:09:59 UTC
Can you please provide the following output:

openstack port show nfv1-port

"ip a", from VM console attached to nfv1-port

I have booted 2 VMs in a local setup on a flat network (without dpdk), and what I am seeing is that the VM IP does not match the IP on the nfvx-port. As a result, the VMs could not ping each other.

When I manually changed the vmx IP to match nfvx-port IP, the vms can ping. 

I'd like to confirm if this is happening in your setup also.

Thanks,
Vic

Comment 5 jianzzha 2018-04-26 01:10:28 UTC
Vic, I have to run some osp13 test this and next week for the summit before I can get the info back to you, sorry about that.

Comment 6 Victor Pickard 2018-04-30 19:04:35 UTC
This appears to be either a networking-odl issue or neutron issue. 

From the neutron logs, I see the port fixed_ips is 10.0.0.221.

However, the actual VM IP is 10.0.0.37.

Networking odl tells ODL that the IP is 10.0.0.221, so rules are installed with this IP, which doesn't match the VM IP. So, ODL is being mis-informed of the VM port IP address.

Mike,
Can you take a look to see if this is networking-odl or neutron issue? 


Neutron log
============
2018-04-30 19:44:55.819 25 DEBUG networking_odl.trunk.trunk_driver_v2 [req-80aa3efa-648b-47fe-bdde-c9c4dc6a3da7 4e0dd2e359324f598722bf153ecfc2d5 38c64e22945f40e4a239f        fe7f4078fca - default default] networking_odl.trunk.trunk_driver_v2.OpenDaylightTrunkHandlerV2 method trunk_subports_update_status called with arguments ('port', 'aft        er_update', <neutron.plugins.ml2.plugin.Ml2Plugin object at 0x7f827a01aa50>) {'original_port': {'status': u'DOWN', 'binding:host_id': u'', 'description': u'', 'allowe        d_address_pairs': [], 'tags': [], 'extra_dhcp_opts': [], 'updated_at': '2018-04-30T18:44:54Z', 'device_owner': u'', 'revision_number': 6, 'port_security_enabled': Tru        e, 'binding:profile': {}, 'fixed_ips': [{'subnet_id': u'e40f30e2-aae9-4fd0-a6e5-f9b681c7fa92', 'ip_address': u'10.0.0.221'}], 'id': u'43455843-b7d2-42b5-b10a-fcbf5c91        3f2e', 'security_groups': [u'76f34b19-d174-4bb2-aa70-582ab75d896d'], 'device_id': u'22969342-5baa-41bb-9a8e-ab01745d9c4d', 'name': u'', 'admin_state_up': True, 'netwo        rk_id': u'9a53a3c1-63d6-465c-b173-e5750329ed82', 'tenant_id': u'3e8ebfe51fbe472eb187a12f28e2309d', 'binding:vif_details': {}, 'binding:vnic_type': u'normal', 'binding        :vif_type': u'unbound', 'mac_address': u'fa:16:3e:8a:72:96', 'project_id': u'3e8ebfe51fbe472eb187a12f28e2309d', 'created_at': '2018-04-30T18:44:54Z'}, 'port': {'allow        ed_address_pairs': [], 'extra_dhcp_opts': [], 'updated_at': '2018-04-30T18:44:55Z', 'device_owner': u'compute:nova', 'revision_number': 7, 'binding:profile': {}, 'por        t_security_enabled': True, 'fixed_ips': [{'subnet_id': u'e40f30e2-aae9-4fd0-a6e5-f9b681c7fa92', 'ip_address': u'10.0.0.221'}], 'id': u'43455843-b7d2-42b5-b10a-fcbf5c9        13f2e', 'security_groups': [u'76f34b19-d174-4bb2-aa70-582ab75d896d'], 'binding:vif_details': {}, 'binding:vif_type': 'unbound', 'mac_address': u'fa:16:3e:8a:72:96', '        project_id': u'3e8ebfe51fbe472eb187a12f28e2309d', 'status': 'DOWN', 'binding:host_id': u'compute-1.localdomain', 'description': u'', 'tags': [], 'device_id': u'229693        42-5baa-41bb-9a8e-ab01745d9c4d', 'name': u'', 'admin_state_up': True, 'network_id': u'9a53a3c1-63d6-465c-b173-e5750329ed82', 'tenant_id': u'3e8ebfe51fbe472eb187a12f28        e2309d', 'created_at': '2018-04-30T18:44:54Z', 'binding:vnic_type': u'normal'}, 'context': <neutron_lib.context.Context object at 0x7f82780b2210>, 'mac_address_update        d': False} wrapper /usr/lib/python2.7/site-packages/oslo_log/helpers.py:66



VM IP
=====

[root@compute-1 ~]# virsh list
 Id    Name                           State
----------------------------------------------------
 4     instance-00000017              running

[root@compute-1 ~]# virsh console 4
Connected to domain instance-00000017
Escape character is ^]

login as 'cirros' user. default password: 'cubswin:)'. use 'sudo' for root.
cirros login: cirros
Password: 
Login incorrect
cirros login: 
cirros login: cirros
Password: 
$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether fa:16:3e:8a:72:96 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.37/24 brd 10.0.0.255 scope global eth0
    inet6 fe80::f816:3eff:fe8a:7296/64 scope link 
       valid_lft forever preferred_lft forever




(overcloud) [stack@undercloud-0 ~]$ openstack server show vm2
+-------------------------------------+----------------------------------------------------------+
| Field                               | Value                                                    |
+-------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                                   |
| OS-EXT-AZ:availability_zone         | nova                                                     |
| OS-EXT-SRV-ATTR:host                | compute-1.localdomain                                    |
| OS-EXT-SRV-ATTR:hypervisor_hostname | compute-1.localdomain                                    |
| OS-EXT-SRV-ATTR:instance_name       | instance-00000017                                        |
| OS-EXT-STS:power_state              | Running                                                  |
| OS-EXT-STS:task_state               | None                                                     |
| OS-EXT-STS:vm_state                 | active                                                   |
| OS-SRV-USG:launched_at              | 2018-04-30T18:44:58.000000                               |
| OS-SRV-USG:terminated_at            | None                                                     |
| accessIPv4                          |                                                          |
| accessIPv6                          |                                                          |
| addresses                           | nova=10.0.0.221                                          |
| config_drive                        |                                                          |
| created                             | 2018-04-30T18:44:51Z                                     |
| flavor                              | rhel (200)                                               |
| hostId                              | 06f0f1b554d0a92f137f50bc1cb808f53ff295166fb71dfeb7149507 |
| id                                  | 22969342-5baa-41bb-9a8e-ab01745d9c4d                     |
| image                               | cirros (f3b8fddd-7b1f-4c77-9033-f60674602cb3)            |
| key_name                            | admin_key                                                |
| name                                | vm2                                                      |
| progress                            | 0                                                        |
| project_id                          | 3e8ebfe51fbe472eb187a12f28e2309d                         |
| properties                          |                                                          |
| security_groups                     | name='goPacketGo'                                        |
| status                              | ACTIVE                                                   |
| updated                             | 2018-04-30T18:44:58Z                                     |
| user_id                             | 8ba6b1ec44a541fd9fb33259cf5cf628                         |
| volumes_attached                    |                                                          |
+-------------------------------------+----------------------------------------------------------+

Comment 7 Victor Pickard 2018-04-30 20:23:04 UTC
Note: This issue was observed without dpdk.

Comment 8 Mike Kolesnik 2018-05-01 05:12:35 UTC
(In reply to Victor Pickard from comment #6)
> This appears to be either a networking-odl issue or neutron issue. 

I think there's a misunderstanding here.
First, networking-odl is just a pipe, so it most certainly doesn't determine IPs.
Second, neutron itself might determine the IP if the subnet in question is marked as a DHCP enabled [1] subnet.

The question is, who gave the IP to the VM?

Also since I see somehow trunk ports are involved this may or may not be related.

[1] https://developer.openstack.org/api-ref/network/v2/#list-subnets
> 
> From the neutron logs, I see the port fixed_ips is 10.0.0.221.
> 
> However, the actual VM IP is 10.0.0.37.
> 
> Networking odl tells ODL that the IP is 10.0.0.221, so rules are installed
> with this IP, which doesn't match the VM IP. So, ODL is being mis-informed
> of the VM port IP address.
> 
> Mike,
> Can you take a look to see if this is networking-odl or neutron issue? 
> 

It's impossible to do anything without any logs attached..

Comment 9 Victor Pickard 2018-05-01 14:48:19 UTC
I had a chat with amuller on #neutron irc this am. The summary is as follows:

1. When dhcp is disabled on the subnet, the vm gets an ip from some other means (external dhcp server, static ip). Neutron has no mechanism to determine the ip that is assigned to the vm.

2. Neutron still informs ml2 drivers of the ip that neutron thinks the vm has been assigned. It is up to the ml2 drivers to look at the dhcp attribute on the subnet to decide what to do. In this case, dhcp is disabled on the subnet, so the ip that neutron hands out, when dhcp is disabled on the subnet, is likely not the ip that is assigned to the vm.

3. The suggestion was to consider the above, and configure security groups and spoofing so that traffic is allowed for this port. Perhaps by disabling port security on this port? 

This would also likely be an issue for ml2/ovs. Has that been tested with this configuration?

Given the above, I don't see how this use case ever worked in the past. Is this a new test case? 

Also, this has nothing to do with dpdk. 

Going forward, can you do the following:

1. Enable dhcp on this subnet, and see if traffic flows.

2. With dhcp disabled on the subnet, disable port-security for the port, and see if traffic flows.

Comment 10 Mike Kolesnik 2018-05-02 10:50:08 UTC
I'd also like to ask Jianzhu if this exact same scenario works with ML2/OVS?

Comment 11 Victor Pickard 2018-05-04 16:41:13 UTC
I've looked at this some more. Thanks to Andre for pointing me to this page, which states to disable port security, as suggested in my previous update:

https://access.redhat.com/solutions/2428301

The real issue here, is using an external dhcp server instead of neutron server. In order for that to work, security gropus and port security have to be disabled on the port, as show in the article above. For reference, here are the commands:

[stack@rh-director ~]$ neutron port-update --no-security-groups  fb2d64f5-3ef3-4a86-9650-c8695d42a82e
Updated port: fb2d64f5-3ef3-4a86-9650-c8695d42a82e

[stack@rh-director ~]$ neutron port-update fb2d64f5-3ef3-4a86-9650-c8695d42a82e --port-security-enabled=False
Updated port: fb2d64f5-3ef3-4a86-9650-c8695d42a82e

may need to set port_security driver to be able to use this feature:
/etc/neutron/plugins/ml2/ml2_conf.ini): For example:

[ml2]
extension_drivers = port_security

Comment 12 Victor Pickard 2018-05-04 16:50:05 UTC
It seems this may have been addressed for ironic, with the following patch:

https://review.openstack.org/#/c/112351/

And the corresponding blueprint:
https://blueprints.launchpad.net/ironic/+spec/support-external-dhcp

Jianzhu,
When you get a chance, please retest with port-security and security groups disabled on this port.

If using ironic, it would be good if you could also try that approach as described in the blueprint. 

I'll also ask the upstream community for input, to see if there are any other options at this time.

Comment 13 jianzzha 2018-05-07 14:38:14 UTC
(In reply to Mike Kolesnik from comment #10)
> I'd also like to ask Jianzhu if this exact same scenario works with ML2/OVS?


I think the discussion and reproducing is deviated from the issue I reported.

in the test  I have 3 provider networks. the first one is on vlan provider network, this port is for ssh access to the guest, its subnet has dhcp enabled and port security disabled, this port works and I can ssh into the guest. The other two ports are flat network and for data traffic through the guest; these two ports have dhcp disabled and port security disabled.  the traffic didn't flow through the guest via these two data ports. From within the guest, I didn't see the data traffic arrive on the guest nic. So for some reason the data packets were dropped by the flow table.

We had the exact same test with plain ml2/ovs (no odl) and that works fine.

Also, if I replace the flat networks with vlan provider network, that will work. 

If I run wireshark, I can see the data traffic hits the compute node port, but it didn't get to the guest.

What kind of log do you want to see?

Comment 14 jianzzha 2018-05-07 18:43:13 UTC
the flow table when the data ports using provider network:
https://gist.github.com/jianzzha/96e6aa392f21f4a6524c758c0abf6918

in this table I didn't see any entry which points to output:1 or output:2 (these two ports are the data ports); there is only output:3 which is the access port.

I will setup a vlan provider network and compare the flow table

Comment 15 Victor Pickard 2018-05-07 18:53:53 UTC
Jianzhu,
Thanks for the update. After reading your reply, I have to agree my attempt to reproduce this issue locally was flawed. 

I see now where you have the first vlan provider network, with dhcp enabled, for ssh access. And this part is working. I got sidetracked on this when I saw the IP mismatch in my local setup, and hence the above discussion. 

What I realize now is that I somehow missed that you have multiple provider networks, and the issue you are reporting is no data flow on the other provider network.

I'm taking another look at this now (without dpdk).

It would be good to verify that ODL has the correct provider_mappings. Can you provide the following:

    1. output of "ovsdb-client dump" on the compute node

    2. curl -s -u admin:admin -X GET 
       http://${CONTROLLER_IP}:8081/restconf/operational/network- 
       topology:network-topology/topology/ovsdb:1 | python -m json.tool

    3. /etc/neutron/plugins/ml2/ml2_conf.ini on control and compute
    4. /etc/neutron/plugins/ml2/ml2_conf_odl.ini on control

Comment 16 jianzzha 2018-05-07 18:54:57 UTC
as a comparison, when the data ports use vlan provider network, the flow table has entry points to the data ports (output:1 and output:2). Here is the vlan provider network flow table:
https://gist.github.com/jianzzha/bec2f0c76c069ff962ca6a6b8fa18334

Comment 17 jianzzha 2018-05-07 18:58:14 UTC
ovsdb-client dump:
https://gist.github.com/jianzzha/3536fe3562d4e83c41d5038762347433

Comment 18 jianzzha 2018-05-07 19:01:36 UTC
(In reply to jianzzha from comment #17)
> ovsdb-client dump:
> https://gist.github.com/jianzzha/3536fe3562d4e83c41d5038762347433

actaully this is for the vlan provider network. I will post the flat network info

Comment 20 jianzzha 2018-05-07 20:08:31 UTC
Created attachment 1432808 [details]
odl-ovsdb-dump

odl-ovsdb-dump from the controller. This file is too large for paste

Comment 21 Victor Pickard 2018-05-07 20:15:35 UTC
Thanks,
Can you also attach karaf logs from ODL?

Comment 22 jianzzha 2018-05-07 20:56:33 UTC
Created attachment 1432828 [details]
karaf.log

odl karaf.log

Comment 23 Victor Pickard 2018-05-08 12:10:13 UTC
OK, I think I have found the issue here.

From the karaf logs, we see the dpdkvhostuserclient interface type warn log:

2018-05-07 01:58:25,848 | WARN  | n-invoker-impl-0 | OpenVSwitchUpdateCommand         | 289 - org.opendaylight.ovsdb.southbound-impl - 1.4.2.Carbon-redhat-3 | Interface type dpdkvhostuserclient not present in model

From the ovs dump, we can see this interface is indeed defined and available:

iface_types 
-----------
[dpdk, dpdkr, dpdkvhostuser, dpdkvhostuserclient,

Code inspection of ovsdb in stable/carbon (upstream) shows that this interface type is not defined, so the interface type is not set, hence the warning.

For reference: OVSDB_INTERFACE_TYPE_MAP in SouthboundConstants.java.

The dpdkvhostuserclient interface type is defined in stable/oxygen. 

Can you test with stable/oxygen based rpm for ODL, and attach the karaf logs from that run if the test still fails?

In general, I think it would be better to be testing with the latest ODL rpm, which should be based on upstream stable/oxygen, instead of an older rpm based on stable/carbon, as OSP13 distribution will have ODL rpm based on upstream oxygen release, agree?

Comment 24 jianzzha 2018-05-08 14:41:02 UTC
(In reply to Victor Pickard from comment #23)
> OK, I think I have found the issue here.
> 
> From the karaf logs, we see the dpdkvhostuserclient interface type warn log:
> 
> 2018-05-07 01:58:25,848 | WARN  | n-invoker-impl-0 |
> OpenVSwitchUpdateCommand         | 289 -
> org.opendaylight.ovsdb.southbound-impl - 1.4.2.Carbon-redhat-3 | Interface
> type dpdkvhostuserclient not present in model
> 
> From the ovs dump, we can see this interface is indeed defined and available:
> 
> iface_types 
> -----------
> [dpdk, dpdkr, dpdkvhostuser, dpdkvhostuserclient,
> 
> Code inspection of ovsdb in stable/carbon (upstream) shows that this
> interface type is not defined, so the interface type is not set, hence the
> warning.
> 
> For reference: OVSDB_INTERFACE_TYPE_MAP in SouthboundConstants.java.
> 
> The dpdkvhostuserclient interface type is defined in stable/oxygen. 
> 
> Can you test with stable/oxygen based rpm for ODL, and attach the karaf logs
> from that run if the test still fails?
> 
> In general, I think it would be better to be testing with the latest ODL
> rpm, which should be based on upstream stable/oxygen, instead of an older
> rpm based on stable/carbon, as OSP13 distribution will have ODL rpm based on
> upstream oxygen release, agree?

the odl running from container. do you know how to replace that with the latest? also, I don't understand why the vlan provider works but not the flat?

Comment 27 Mike Kolesnik 2018-05-09 07:54:06 UTC
Jianzhu.

Based on Vic's analysis, can you please try this scenario with latest puddle that has ODL version 8 (Oxygen) and see if it reproduces?

Comment 30 jianzzha 2018-05-14 13:46:32 UTC
will try version 8 once I'm out of current OSP13 Mellanox perf evaluation

Comment 33 Mike Kolesnik 2018-05-30 11:43:34 UTC
Closing for now, please reopen if you see this issue happen with an OSP 13 based ODL (Oxygen, 8.0.0 or later).