Bug 1566544
Summary: | [Deployment] Flat network doesn't work with DPDK in ODL | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | jianzzha | ||||||
Component: | opendaylight | Assignee: | Victor Pickard <vpickard> | ||||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Itzik Brown <itbrown> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 12.0 (Pike) | CC: | aadam, itbrown, jianzzha, mkolesni, nyechiel, trozet, vpickard | ||||||
Target Milestone: | z1 | Keywords: | Triaged, ZStream | ||||||
Target Release: | 13.0 (Queens) | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | odl_deployment | ||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2018-05-30 11:43:34 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
jianzzha
2018-04-12 13:42:47 UTC
Itzik, Can you please check if this works without DPDK? Can you please provide the following output: openstack port show nfv1-port "ip a", from VM console attached to nfv1-port I have booted 2 VMs in a local setup on a flat network (without dpdk), and what I am seeing is that the VM IP does not match the IP on the nfvx-port. As a result, the VMs could not ping each other. When I manually changed the vmx IP to match nfvx-port IP, the vms can ping. I'd like to confirm if this is happening in your setup also. Thanks, Vic Vic, I have to run some osp13 test this and next week for the summit before I can get the info back to you, sorry about that. This appears to be either a networking-odl issue or neutron issue. From the neutron logs, I see the port fixed_ips is 10.0.0.221. However, the actual VM IP is 10.0.0.37. Networking odl tells ODL that the IP is 10.0.0.221, so rules are installed with this IP, which doesn't match the VM IP. So, ODL is being mis-informed of the VM port IP address. Mike, Can you take a look to see if this is networking-odl or neutron issue? Neutron log ============ 2018-04-30 19:44:55.819 25 DEBUG networking_odl.trunk.trunk_driver_v2 [req-80aa3efa-648b-47fe-bdde-c9c4dc6a3da7 4e0dd2e359324f598722bf153ecfc2d5 38c64e22945f40e4a239f fe7f4078fca - default default] networking_odl.trunk.trunk_driver_v2.OpenDaylightTrunkHandlerV2 method trunk_subports_update_status called with arguments ('port', 'aft er_update', <neutron.plugins.ml2.plugin.Ml2Plugin object at 0x7f827a01aa50>) {'original_port': {'status': u'DOWN', 'binding:host_id': u'', 'description': u'', 'allowe d_address_pairs': [], 'tags': [], 'extra_dhcp_opts': [], 'updated_at': '2018-04-30T18:44:54Z', 'device_owner': u'', 'revision_number': 6, 'port_security_enabled': Tru e, 'binding:profile': {}, 'fixed_ips': [{'subnet_id': u'e40f30e2-aae9-4fd0-a6e5-f9b681c7fa92', 'ip_address': u'10.0.0.221'}], 'id': u'43455843-b7d2-42b5-b10a-fcbf5c91 3f2e', 'security_groups': [u'76f34b19-d174-4bb2-aa70-582ab75d896d'], 'device_id': u'22969342-5baa-41bb-9a8e-ab01745d9c4d', 'name': u'', 'admin_state_up': True, 'netwo rk_id': u'9a53a3c1-63d6-465c-b173-e5750329ed82', 'tenant_id': u'3e8ebfe51fbe472eb187a12f28e2309d', 'binding:vif_details': {}, 'binding:vnic_type': u'normal', 'binding :vif_type': u'unbound', 'mac_address': u'fa:16:3e:8a:72:96', 'project_id': u'3e8ebfe51fbe472eb187a12f28e2309d', 'created_at': '2018-04-30T18:44:54Z'}, 'port': {'allow ed_address_pairs': [], 'extra_dhcp_opts': [], 'updated_at': '2018-04-30T18:44:55Z', 'device_owner': u'compute:nova', 'revision_number': 7, 'binding:profile': {}, 'por t_security_enabled': True, 'fixed_ips': [{'subnet_id': u'e40f30e2-aae9-4fd0-a6e5-f9b681c7fa92', 'ip_address': u'10.0.0.221'}], 'id': u'43455843-b7d2-42b5-b10a-fcbf5c9 13f2e', 'security_groups': [u'76f34b19-d174-4bb2-aa70-582ab75d896d'], 'binding:vif_details': {}, 'binding:vif_type': 'unbound', 'mac_address': u'fa:16:3e:8a:72:96', ' project_id': u'3e8ebfe51fbe472eb187a12f28e2309d', 'status': 'DOWN', 'binding:host_id': u'compute-1.localdomain', 'description': u'', 'tags': [], 'device_id': u'229693 42-5baa-41bb-9a8e-ab01745d9c4d', 'name': u'', 'admin_state_up': True, 'network_id': u'9a53a3c1-63d6-465c-b173-e5750329ed82', 'tenant_id': u'3e8ebfe51fbe472eb187a12f28 e2309d', 'created_at': '2018-04-30T18:44:54Z', 'binding:vnic_type': u'normal'}, 'context': <neutron_lib.context.Context object at 0x7f82780b2210>, 'mac_address_update d': False} wrapper /usr/lib/python2.7/site-packages/oslo_log/helpers.py:66 VM IP ===== [root@compute-1 ~]# virsh list Id Name State ---------------------------------------------------- 4 instance-00000017 running [root@compute-1 ~]# virsh console 4 Connected to domain instance-00000017 Escape character is ^] login as 'cirros' user. default password: 'cubswin:)'. use 'sudo' for root. cirros login: cirros Password: Login incorrect cirros login: cirros login: cirros Password: $ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000 link/ether fa:16:3e:8a:72:96 brd ff:ff:ff:ff:ff:ff inet 10.0.0.37/24 brd 10.0.0.255 scope global eth0 inet6 fe80::f816:3eff:fe8a:7296/64 scope link valid_lft forever preferred_lft forever (overcloud) [stack@undercloud-0 ~]$ openstack server show vm2 +-------------------------------------+----------------------------------------------------------+ | Field | Value | +-------------------------------------+----------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host | compute-1.localdomain | | OS-EXT-SRV-ATTR:hypervisor_hostname | compute-1.localdomain | | OS-EXT-SRV-ATTR:instance_name | instance-00000017 | | OS-EXT-STS:power_state | Running | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | active | | OS-SRV-USG:launched_at | 2018-04-30T18:44:58.000000 | | OS-SRV-USG:terminated_at | None | | accessIPv4 | | | accessIPv6 | | | addresses | nova=10.0.0.221 | | config_drive | | | created | 2018-04-30T18:44:51Z | | flavor | rhel (200) | | hostId | 06f0f1b554d0a92f137f50bc1cb808f53ff295166fb71dfeb7149507 | | id | 22969342-5baa-41bb-9a8e-ab01745d9c4d | | image | cirros (f3b8fddd-7b1f-4c77-9033-f60674602cb3) | | key_name | admin_key | | name | vm2 | | progress | 0 | | project_id | 3e8ebfe51fbe472eb187a12f28e2309d | | properties | | | security_groups | name='goPacketGo' | | status | ACTIVE | | updated | 2018-04-30T18:44:58Z | | user_id | 8ba6b1ec44a541fd9fb33259cf5cf628 | | volumes_attached | | +-------------------------------------+----------------------------------------------------------+ Note: This issue was observed without dpdk. (In reply to Victor Pickard from comment #6) > This appears to be either a networking-odl issue or neutron issue. I think there's a misunderstanding here. First, networking-odl is just a pipe, so it most certainly doesn't determine IPs. Second, neutron itself might determine the IP if the subnet in question is marked as a DHCP enabled [1] subnet. The question is, who gave the IP to the VM? Also since I see somehow trunk ports are involved this may or may not be related. [1] https://developer.openstack.org/api-ref/network/v2/#list-subnets > > From the neutron logs, I see the port fixed_ips is 10.0.0.221. > > However, the actual VM IP is 10.0.0.37. > > Networking odl tells ODL that the IP is 10.0.0.221, so rules are installed > with this IP, which doesn't match the VM IP. So, ODL is being mis-informed > of the VM port IP address. > > Mike, > Can you take a look to see if this is networking-odl or neutron issue? > It's impossible to do anything without any logs attached.. I had a chat with amuller on #neutron irc this am. The summary is as follows: 1. When dhcp is disabled on the subnet, the vm gets an ip from some other means (external dhcp server, static ip). Neutron has no mechanism to determine the ip that is assigned to the vm. 2. Neutron still informs ml2 drivers of the ip that neutron thinks the vm has been assigned. It is up to the ml2 drivers to look at the dhcp attribute on the subnet to decide what to do. In this case, dhcp is disabled on the subnet, so the ip that neutron hands out, when dhcp is disabled on the subnet, is likely not the ip that is assigned to the vm. 3. The suggestion was to consider the above, and configure security groups and spoofing so that traffic is allowed for this port. Perhaps by disabling port security on this port? This would also likely be an issue for ml2/ovs. Has that been tested with this configuration? Given the above, I don't see how this use case ever worked in the past. Is this a new test case? Also, this has nothing to do with dpdk. Going forward, can you do the following: 1. Enable dhcp on this subnet, and see if traffic flows. 2. With dhcp disabled on the subnet, disable port-security for the port, and see if traffic flows. I'd also like to ask Jianzhu if this exact same scenario works with ML2/OVS? I've looked at this some more. Thanks to Andre for pointing me to this page, which states to disable port security, as suggested in my previous update: https://access.redhat.com/solutions/2428301 The real issue here, is using an external dhcp server instead of neutron server. In order for that to work, security gropus and port security have to be disabled on the port, as show in the article above. For reference, here are the commands: [stack@rh-director ~]$ neutron port-update --no-security-groups fb2d64f5-3ef3-4a86-9650-c8695d42a82e Updated port: fb2d64f5-3ef3-4a86-9650-c8695d42a82e [stack@rh-director ~]$ neutron port-update fb2d64f5-3ef3-4a86-9650-c8695d42a82e --port-security-enabled=False Updated port: fb2d64f5-3ef3-4a86-9650-c8695d42a82e may need to set port_security driver to be able to use this feature: /etc/neutron/plugins/ml2/ml2_conf.ini): For example: [ml2] extension_drivers = port_security It seems this may have been addressed for ironic, with the following patch: https://review.openstack.org/#/c/112351/ And the corresponding blueprint: https://blueprints.launchpad.net/ironic/+spec/support-external-dhcp Jianzhu, When you get a chance, please retest with port-security and security groups disabled on this port. If using ironic, it would be good if you could also try that approach as described in the blueprint. I'll also ask the upstream community for input, to see if there are any other options at this time. (In reply to Mike Kolesnik from comment #10) > I'd also like to ask Jianzhu if this exact same scenario works with ML2/OVS? I think the discussion and reproducing is deviated from the issue I reported. in the test I have 3 provider networks. the first one is on vlan provider network, this port is for ssh access to the guest, its subnet has dhcp enabled and port security disabled, this port works and I can ssh into the guest. The other two ports are flat network and for data traffic through the guest; these two ports have dhcp disabled and port security disabled. the traffic didn't flow through the guest via these two data ports. From within the guest, I didn't see the data traffic arrive on the guest nic. So for some reason the data packets were dropped by the flow table. We had the exact same test with plain ml2/ovs (no odl) and that works fine. Also, if I replace the flat networks with vlan provider network, that will work. If I run wireshark, I can see the data traffic hits the compute node port, but it didn't get to the guest. What kind of log do you want to see? the flow table when the data ports using provider network: https://gist.github.com/jianzzha/96e6aa392f21f4a6524c758c0abf6918 in this table I didn't see any entry which points to output:1 or output:2 (these two ports are the data ports); there is only output:3 which is the access port. I will setup a vlan provider network and compare the flow table Jianzhu, Thanks for the update. After reading your reply, I have to agree my attempt to reproduce this issue locally was flawed. I see now where you have the first vlan provider network, with dhcp enabled, for ssh access. And this part is working. I got sidetracked on this when I saw the IP mismatch in my local setup, and hence the above discussion. What I realize now is that I somehow missed that you have multiple provider networks, and the issue you are reporting is no data flow on the other provider network. I'm taking another look at this now (without dpdk). It would be good to verify that ODL has the correct provider_mappings. Can you provide the following: 1. output of "ovsdb-client dump" on the compute node 2. curl -s -u admin:admin -X GET http://${CONTROLLER_IP}:8081/restconf/operational/network- topology:network-topology/topology/ovsdb:1 | python -m json.tool 3. /etc/neutron/plugins/ml2/ml2_conf.ini on control and compute 4. /etc/neutron/plugins/ml2/ml2_conf_odl.ini on control as a comparison, when the data ports use vlan provider network, the flow table has entry points to the data ports (output:1 and output:2). Here is the vlan provider network flow table: https://gist.github.com/jianzzha/bec2f0c76c069ff962ca6a6b8fa18334 ovsdb-client dump: https://gist.github.com/jianzzha/3536fe3562d4e83c41d5038762347433 (In reply to jianzzha from comment #17) > ovsdb-client dump: > https://gist.github.com/jianzzha/3536fe3562d4e83c41d5038762347433 actaully this is for the vlan provider network. I will post the flat network info controller ml2_conf.ini: https://gist.github.com/jianzzha/f6f12995db38327644d24e7e1a1262c7 compute ml2_conf.ini: https://gist.github.com/jianzzha/296ff1978704321be4ee3a35b29a1ce5 ml2_conf_odl.ini on control: https://gist.github.com/jianzzha/b9e845aea1b9f2f1fb56e4d3a9cae32b compute ovsdb-client dump: https://gist.github.com/jianzzha/45ff5269ac30f182a3a8ed5317842944 Created attachment 1432808 [details]
odl-ovsdb-dump
odl-ovsdb-dump from the controller. This file is too large for paste
Thanks, Can you also attach karaf logs from ODL? Created attachment 1432828 [details]
karaf.log
odl karaf.log
OK, I think I have found the issue here. From the karaf logs, we see the dpdkvhostuserclient interface type warn log: 2018-05-07 01:58:25,848 | WARN | n-invoker-impl-0 | OpenVSwitchUpdateCommand | 289 - org.opendaylight.ovsdb.southbound-impl - 1.4.2.Carbon-redhat-3 | Interface type dpdkvhostuserclient not present in model From the ovs dump, we can see this interface is indeed defined and available: iface_types ----------- [dpdk, dpdkr, dpdkvhostuser, dpdkvhostuserclient, Code inspection of ovsdb in stable/carbon (upstream) shows that this interface type is not defined, so the interface type is not set, hence the warning. For reference: OVSDB_INTERFACE_TYPE_MAP in SouthboundConstants.java. The dpdkvhostuserclient interface type is defined in stable/oxygen. Can you test with stable/oxygen based rpm for ODL, and attach the karaf logs from that run if the test still fails? In general, I think it would be better to be testing with the latest ODL rpm, which should be based on upstream stable/oxygen, instead of an older rpm based on stable/carbon, as OSP13 distribution will have ODL rpm based on upstream oxygen release, agree? (In reply to Victor Pickard from comment #23) > OK, I think I have found the issue here. > > From the karaf logs, we see the dpdkvhostuserclient interface type warn log: > > 2018-05-07 01:58:25,848 | WARN | n-invoker-impl-0 | > OpenVSwitchUpdateCommand | 289 - > org.opendaylight.ovsdb.southbound-impl - 1.4.2.Carbon-redhat-3 | Interface > type dpdkvhostuserclient not present in model > > From the ovs dump, we can see this interface is indeed defined and available: > > iface_types > ----------- > [dpdk, dpdkr, dpdkvhostuser, dpdkvhostuserclient, > > Code inspection of ovsdb in stable/carbon (upstream) shows that this > interface type is not defined, so the interface type is not set, hence the > warning. > > For reference: OVSDB_INTERFACE_TYPE_MAP in SouthboundConstants.java. > > The dpdkvhostuserclient interface type is defined in stable/oxygen. > > Can you test with stable/oxygen based rpm for ODL, and attach the karaf logs > from that run if the test still fails? > > In general, I think it would be better to be testing with the latest ODL > rpm, which should be based on upstream stable/oxygen, instead of an older > rpm based on stable/carbon, as OSP13 distribution will have ODL rpm based on > upstream oxygen release, agree? the odl running from container. do you know how to replace that with the latest? also, I don't understand why the vlan provider works but not the flat? Jianzhu. Based on Vic's analysis, can you please try this scenario with latest puddle that has ODL version 8 (Oxygen) and see if it reproduces? will try version 8 once I'm out of current OSP13 Mellanox perf evaluation Closing for now, please reopen if you see this issue happen with an OSP 13 based ODL (Oxygen, 8.0.0 or later). |