Description of problem: While scale testing Octavia loadbalancers we are noticing compute node dropping ARP requests from the VM to the controller node when the compute node is already hosting many VMs. We could see neutron setting port status to ACTIVE after the port up event notification from OVN. We could login to the VM (from metadata namespace) and manually tried issuing the arping request as well. In Octavia, controller nodes will have o-hm0 interface with ip on lb-mgmt-net (172.24.1.216) network. [root@controller-0 ~]# ip a s o-hm0 20: o-hm0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1442 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether fa:16:3e:f8:dd:30 brd ff:ff:ff:ff:ff:ff inet 172.24.1.216/16 brd 172.24.255.255 scope global o-hm0 valid_lft forever preferred_lft forever inet6 fe80::f816:3eff:fef8:dd30/64 scope link valid_lft forever preferred_lft forever After successfully booting the VM, Octavia worker service running inside the container will send http request (with port 9443) to the VM. Here VM is unable respond back for the http request as ARP resolve failed. TCP dump inside the VM - [cloud-user@amphora-408fff7b-2f9c-422e-861c-d26dfc72e3c0 ~]$ sudo tcpdump -vvv -n -e -i eth0 port not 22 tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes 11:36:53.341123 fa:16:3e:f8:dd:30 > fa:16:3e:66:7a:8b, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 29605, offset 0, flags [DF], proto TCP (6), length 60) 172.24.1.216.60334 > 172.24.19.232.tungsten-https: Flags [S], cksum 0x32fb (correct), seq 2216125894, win 28040, options [mss 1402,sackOK,TS val 2567677514 ecr 0,nop,wscale 7], length 0 11:36:53.404786 fa:16:3e:66:7a:8b > Broadcast, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 172.24.1.216 tell 172.24.19.232, length 28 11:36:54.428785 fa:16:3e:66:7a:8b > Broadcast, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 172.24.1.216 tell 172.24.19.232, length 28 11:36:54.485524 fa:16:3e:66:7a:8b > Broadcast, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 172.24.2.99 tell 172.24.19.232, length 28 11:36:55.516768 fa:16:3e:66:7a:8b > Broadcast, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 172.24.2.99 tell 172.24.19.232, length 28 11:36:56.540762 fa:16:3e:66:7a:8b > Broadcast, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 172.24.2.99 tell 172.24.19.232, length 28 11:37:01.220435 fa:16:3e:f8:dd:30 > fa:16:3e:66:7a:8b, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 55864, offset 0, flags [DF], proto TCP (6), length 60) 172.24.1.216.33194 > 172.24.19.232.tungsten-https: Flags [S], cksum 0x563e (correct), seq 302186453, win 28040, options [mss 1402,sackOK,TS val 2567685393 ecr 0,nop,wscale 7], length 0 11:37:01.220808 fa:16:3e:66:7a:8b > Broadcast, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 172.24.1.216 tell 172.24.19.232, length 28 11:37:02.236802 fa:16:3e:66:7a:8b > Broadcast, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 172.24.1.216 tell 172.24.19.232, length 28 Here 172.24.1.216 is ip address on o-hm0 interface on controller-0 node. 172.24.19.232 is amphorae VM's ip address. Note: This deployment is using the core OVN rpms provided by Dumitru http://brew-task-repos.usersys.redhat.com/repos/scratch/dceara/ovn2.13/20.12.0/20pvt_dp_groups.el8fdp/
Dumitru has logged into the environment and found the root cause for this issue. Neutron is setting "unknown" attribute in controller's logical port port dcf3cf5a-3e48-4082-bc7f-20976df0d398 (aka octavia-health-manager-controller-0.redhat.local-listen-port) addresses: ["fa:16:3e:f8:dd:30 172.24.1.216", "unknown"] Because of this, core OVN is not adding local arp reply flow in the compute node. In the absence of local ARP reply flow, ovs-vswitchd tried to broadcast the packet and hitting 4k limit as this network "lb-mgmt-net" has more than 5000 ports (i.e amphorae VMs). 2021-05-18T10:21:13.903Z|00419|ofproto_dpif_xlate(handler356)|WARN|over 4096 resubmit actions on bridge br-int while processing arp,tun _id=0x1,tun_src=0.0.0.0,tun_dst=172.17.2.47,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=64,tun_erspa n_ver=0,tun_flags=df|csum|key,tun_metadata0=0x18608000,in_port=1941,vlan_tci=0x0000,dl_src=fa:16:3e:20:71:28,dl_dst=ff:ff:ff:ff:ff:ff,a rp_spa=172.24.19.180,arp_tpa=172.24.1.216,arp_op=1,arp_sha=fa:16:3e:20:71:28,arp_tha=00:00:00:00:00:00 2021-05-18T10:22:15.629Z|00740|ofproto_dpif_xlate(handler355)|WARN|over 4096 resubmit actions on bridge br-int while processing arp,tun _id=0x1,tun_src=0.0.0.0,tun_dst=172.17.2.47,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=64,tun_erspa n_ver=0,tun_flags=df|csum|key,tun_metadata0=0x149d8000,in_port=1480,vlan_tci=0x0000,dl_src=fa:16:3e:04:9f:a1,dl_dst=ff:ff:ff:ff:ff:ff,a rp_spa=172.24.21.111,arp_tpa=172.24.0.151,arp_op=1,arp_sha=fa:16:3e:04:9f:a1,arp_tha=00:00:00:00:00:00 If the neutron removes the "unknown" attribute, OVN can add the local arp reply flow. As this neutron network is created by administrators (tripleo is creating this for now), if any external ports (not controlled by neutron) try to reach the VM on this network, they will definitely hit this 4k resubmit issue. So OVN has the limitation for larger broadcast domain. We reached a limitation (5000) in Octavia on scaling the load balancers.
May be this can be the reason we see high cpu usage for unknown ports on lb-mgmt-net as reported in in https://bugzilla.redhat.com/show_bug.cgi?id=1961162
As we have discussed on the email chain around this issue, there is no limitation in Octavia that is being hit here. 5,000 is not an Octavia limit. As was discussed above, Dumitru (thank you), found the issue as to why OVN was dropping the ARP requests from newly created nova VM instances. At some number of ports in a neutron network, OVN will start dropping broadcast traffic to ports. From what we understand, there is a flow chain limit in OVN that is being reached, where rules for newly created ports may stop being added to the chain. This will cause those new ports to be unable to communicate with any port configured with port security disabled on the network. This applies to IPv4 networks and was not tested with IPv6 (which does not use ARP). When a port in neutron has port security disabled, the compute node ARP responder in OVN is not configured for the port (this is the "unknown" setting in ovn-nb listed above). This leads to the ARP request from nova VMs to need to cross the OVN fabric, and getting dropped by the OVN limitation. The workaround applied in this test environment was to add a security group and enable port security on the controller o-hm0 ports in neutron. Once that was done, new VMs booted in nova could communicate with the controllers again and new load balancers could be created. With that said, maybe this should be closed as NOTABUG given it is a designed limit? Maybe it should remain open for this limit to be documented in the OVN OSP documentation?
(In reply to Michael Johnson from comment #4) > As we have discussed on the email chain around this issue, there is no > limitation in Octavia that is being hit here. 5,000 is not an Octavia limit. > > As was discussed above, Dumitru (thank you), found the issue as to why OVN > was dropping the ARP requests from newly created nova VM instances. > > At some number of ports in a neutron network, OVN will start dropping > broadcast traffic to ports. > From what we understand, there is a flow chain limit in OVN that is being > reached, where rules for newly created ports may stop being added to the > chain. This will cause those new ports to be unable to communicate with any > port configured with port security disabled on the network. This applies to > IPv4 networks and was not tested with IPv6 (which does not use ARP). > > When a port in neutron has port security disabled, the compute node ARP > responder in OVN is not configured for the port (this is the "unknown" > setting > in ovn-nb listed above). This leads to the ARP request from nova VMs to > need to cross the OVN fabric, and getting dropped by the OVN limitation. > > The workaround applied in this test environment was to add a security group > and enable port security on the controller o-hm0 ports in neutron. Octavia team opened a private bug https://bugzilla.redhat.com/show_bug.cgi?id=1961845 to apply this workaround. Once that > was done, new VMs booted in nova could communicate with the controllers > again and new load balancers could be created. > > With that said, maybe this should be closed as NOTABUG given it is a > designed limit? Maybe it should remain open for this limit to be documented > in the OVN OSP documentation?
Hi, The change has been made to the RHOSP 16.1 "Networking Guide." Customers can see this change here: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/networking_guide/index#con_limit-nonsecure-port-ovn_networking-concepts Thanks, --Greg