Description of problem: After deploying a telco OVN-DPDK + SR-IOV deployment with DVR disabled, it appears that the guest instances are unable to download metadata fron nova metadata API service. All networking-ovn-metadata-agent agents are running: +--------------------------------------+------------------------------+-----------------------------------+-------------------+-------+-------+-------------------------------+ | ID | Agent Type | Host | Availability Zone | Alive | State | Binary | +--------------------------------------+------------------------------+-----------------------------------+-------------------+-------+-------+-------------------------------+ | 71723433-d4a9-4ee9-9d9c-3e846544e603 | NIC Switch agent | computeovndpdksriov-1.localdomain | None | :-) | UP | neutron-sriov-nic-agent | | e9f36491-bca9-4c3e-8be6-74793777fa29 | NIC Switch agent | computeovndpdksriov-0.localdomain | None | :-) | UP | neutron-sriov-nic-agent | | a15f7f3e-f8c2-408a-a9dd-68570f9797e4 | OVN Controller agent | computeovndpdksriov-1.localdomain | | :-) | UP | ovn-controller | | 9e1a5e19-00fb-4229-b002-7832f90ababf | OVN Metadata agent | computeovndpdksriov-1.localdomain | | :-) | UP | networking-ovn-metadata-agent | | 6e4b94b6-70ce-44cf-ac0c-0869f9e146ac | OVN Controller agent | computeovndpdksriov-0.localdomain | | :-) | UP | ovn-controller | | f252611c-f8c0-424e-adbc-916a7afd4fad | OVN Metadata agent | computeovndpdksriov-0.localdomain | | :-) | UP | networking-ovn-metadata-agent | | f4d136e4-cc7d-4941-9e6d-661c59808a81 | OVN Controller Gateway agent | controller-1.localdomain | | :-) | UP | ovn-controller | | e6a989da-ed22-4246-949c-138252f33db7 | OVN Metadata agent | controller-1.localdomain | | :-) | UP | networking-ovn-metadata-agent | | 54a453c8-0674-4b72-978a-eb9740d1a9c7 | OVN Controller Gateway agent | controller-0.localdomain | | :-) | UP | ovn-controller | | 59417f27-d944-4220-8248-6b40862660d5 | OVN Metadata agent | controller-0.localdomain | | :-) | UP | networking-ovn-metadata-agent | | d8eec782-46db-40ca-9646-ac3876543254 | OVN Controller Gateway agent | controller-2.localdomain | | :-) | UP | ovn-controller | | 2d5695fd-7ed6-4041-80e2-c68ec4916e54 | OVN Metadata agent | controller-2.localdomain | | :-) | UP | networking-ovn-metadata-agent | +--------------------------------------+------------------------------+-----------------------------------+-------------------+-------+-------+-------------------------------+ Inside the guest, the routing appears to be correct: [root@localhost ~]# ip r default via 20.10.114.254 dev eth0 proto dhcp metric 100 20.10.114.0/24 dev eth0 proto kernel scope link src 20.10.114.199 metric 100 169.254.169.254 via 20.10.114.100 dev eth0 proto dhcp metric 100 And there is network connectivity to the metadata API service: [root@localhost ~]# ping -c 1 169.254.169.254 PING 169.254.169.254 (169.254.169.254) 56(84) bytes of data. 64 bytes from 169.254.169.254: icmp_seq=1 ttl=64 time=0.845 ms --- 169.254.169.254 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.845/0.845/0.845/0.000 ms But we're unable to download data from the service: [root@localhost ~]# curl http://169.254.169.254 curl: (7) Failed connect to 169.254.169.254:80; Connection timed out On the hypervisor we can see the created namespace: [root@computeovndpdksriov-1 ~]# ip netns ovnmeta-5737c80f-2b2f-4614-b6b2-94b494001256 (id: 0) We can view the tap interface: [root@computeovndpdksriov-1 ~]# ip netns exec ovnmeta-5737c80f-2b2f-4614-b6b2-94b494001256 ip r 20.10.114.0/24 dev tap5737c80f-21 proto kernel scope link src 20.10.114.100 169.254.0.0/16 dev tap5737c80f-21 proto kernel scope link src 169.254.169.254 The required tap interface is attached to br-int (dev tap5737c80f-20): [root@computeovndpdksriov-1 ~]# ovs-vsctl show | less Bridge br-int fail_mode: secure datapath_type: netdev Port br-int Interface br-int type: internal Port tap5737c80f-20 Interface tap5737c80f-20 Port vhu17bd1131-4a Interface vhu17bd1131-4a type: dpdkvhostuserclient options: {vhost-server-path="/var/lib/vhost_sockets/vhu17bd1131-4a"} Port ovn-54a453-0 Interface ovn-54a453-0 type: geneve options: {csum="true", key=flow, remote_ip="10.10.111.180"} bfd_status: {diagnostic="No Diagnostic", flap_count="1", forwarding="true", remote_diagnostic="No Diagnostic", remote_state=up, state=up} Port ovn-d8eec7-0 Interface ovn-d8eec7-0 type: geneve options: {csum="true", key=flow, remote_ip="10.10.111.138"} bfd_status: {diagnostic="No Diagnostic", flap_count="1", forwarding="true", remote_diagnostic="No Diagnostic", remote_state=up, state=up} Port ovn-f4d136-0 Interface ovn-f4d136-0 type: geneve options: {csum="true", key=flow, remote_ip="10.10.111.194"} bfd_status: {diagnostic="No Diagnostic", flap_count="1", forwarding="true", remote_diagnostic="No Diagnostic", remote_state=up, state=up} Port patch-br-int-to-provnet-0c21d1e2-0d15-4467-bbcf-8d7d9c13549f Interface patch-br-int-to-provnet-0c21d1e2-0d15-4467-bbcf-8d7d9c13549f type: patch options: {peer=patch-provnet-0c21d1e2-0d15-4467-bbcf-8d7d9c13549f-to-br-int} Port ovn-6e4b94-0 Interface ovn-6e4b94-0 type: geneve options: {csum="true", key=flow, remote_ip="10.10.111.197"} There doesn't appear to be any errors in /var/log/containers/neutron/ovn-metadata-agent.log logs, I'm unsure on how to proceed with further debugging of this issue. Version-Release number of selected component (if applicable): RHOS-16.1-RHEL-8-20201021.n.0 How reproducible: Always Steps to Reproduce: 1. Deploy OVN-DPDK + SR-IOV setup with DVR disabled 2. Spawn guest instances 3. Attempt to access nova metadata service via network Actual results: Nova metadata service is unreachable Expected results: Nova metadata service is reachable Additional info: Will attach deployment templates and SOS report
Forgot to mention that our neutron networks are set with MTU of 9000. Together with Eran we've tried several things with no success: 1) Set ovn_emit_need_to_frag=True in ml2 configuration. 2) Attempt to update kernel of overcloud nodes based on BZ#1854084 3) Lower MTU of neutron networks to 1500
Based on https://bugzilla.redhat.com/show_bug.cgi?id=1876459 the bug is fixed in OVS and this limitation will disappear as soon as backported in OVS 2.13, right?
Hi Franck: BZ#1876459 is fixing the connection tracking for OVS-DPDK. But the problem in this BZ is not the ct but the checksum between a kernel port and a DPDK port. In this BZ, the TCP traffic from the kernel namespace (OVN metadata) to the OVS-DPDK is being dropped because of the wrong checksum calculation. This is fixed with an iptables mangle rule. We don't use ct for the metadata traffic. Regards.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.4 director bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0817