Bug 1961386 - Doc Limit for non-secure ports with ML2/OVN
Summary: Doc Limit for non-secure ports with ML2/OVN
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: documentation
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: async
: 16.1 (Train on RHEL 8.2)
Assignee: Greg Rakauskas
QA Contact: RHOS Documentation Team
URL:
Whiteboard:
Depends On:
Blocks: 2021329 2021330
TreeView+ depends on / blocked
 
Reported: 2021-05-17 19:23 UTC by anil venkata
Modified: 2022-05-05 09:44 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2021329 2021330 (view as bug list)
Environment:
Last Closed: 2021-11-08 21:37:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-3965 0 None None None 2022-05-05 09:44:59 UTC
Red Hat Knowledge Base (Solution) 6956496 0 None None None 2022-05-05 09:39:54 UTC

Description anil venkata 2021-05-17 19:23:50 UTC
Description of problem:
While scale testing Octavia loadbalancers we are noticing compute node dropping ARP requests from the VM to the controller node when the compute node is already hosting many VMs.

We could see neutron setting port status to ACTIVE after the port up event notification from OVN.

We could login to the VM (from metadata namespace) and manually tried issuing the arping request as well. 

In Octavia, controller nodes will have o-hm0 interface with ip on lb-mgmt-net (172.24.1.216) network.
[root@controller-0 ~]# ip a s o-hm0
20: o-hm0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1442 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:f8:dd:30 brd ff:ff:ff:ff:ff:ff
    inet 172.24.1.216/16 brd 172.24.255.255 scope global o-hm0
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fef8:dd30/64 scope link 
       valid_lft forever preferred_lft forever

After successfully booting the VM, Octavia worker service running inside the container will send http request (with port 9443) to the VM. Here VM is unable respond back for the http request as ARP resolve failed.

TCP dump inside the VM -
[cloud-user@amphora-408fff7b-2f9c-422e-861c-d26dfc72e3c0 ~]$ sudo tcpdump -vvv -n -e -i eth0 port not 22
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
11:36:53.341123 fa:16:3e:f8:dd:30 > fa:16:3e:66:7a:8b, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 29605, offset 0, flags [DF], proto TCP (6), length 60)
    172.24.1.216.60334 > 172.24.19.232.tungsten-https: Flags [S], cksum 0x32fb (correct), seq 2216125894, win 28040, options [mss 1402,sackOK,TS val 2567677514 ecr 0,nop,wscale 7], length 0
11:36:53.404786 fa:16:3e:66:7a:8b > Broadcast, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 172.24.1.216 tell 172.24.19.232, length 28
11:36:54.428785 fa:16:3e:66:7a:8b > Broadcast, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 172.24.1.216 tell 172.24.19.232, length 28
11:36:54.485524 fa:16:3e:66:7a:8b > Broadcast, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 172.24.2.99 tell 172.24.19.232, length 28
11:36:55.516768 fa:16:3e:66:7a:8b > Broadcast, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 172.24.2.99 tell 172.24.19.232, length 28
11:36:56.540762 fa:16:3e:66:7a:8b > Broadcast, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 172.24.2.99 tell 172.24.19.232, length 28
11:37:01.220435 fa:16:3e:f8:dd:30 > fa:16:3e:66:7a:8b, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 55864, offset 0, flags [DF], proto TCP (6), length 60)
    172.24.1.216.33194 > 172.24.19.232.tungsten-https: Flags [S], cksum 0x563e (correct), seq 302186453, win 28040, options [mss 1402,sackOK,TS val 2567685393 ecr 0,nop,wscale 7], length 0
11:37:01.220808 fa:16:3e:66:7a:8b > Broadcast, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 172.24.1.216 tell 172.24.19.232, length 28
11:37:02.236802 fa:16:3e:66:7a:8b > Broadcast, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 172.24.1.216 tell 172.24.19.232, length 28

Here 172.24.1.216 is ip address on o-hm0 interface on controller-0 node. 172.24.19.232 is amphorae VM's ip address.

Note: This deployment is using the core OVN rpms provided by Dumitru http://brew-task-repos.usersys.redhat.com/repos/scratch/dceara/ovn2.13/20.12.0/20pvt_dp_groups.el8fdp/

Comment 1 anil venkata 2021-05-18 11:08:08 UTC
Dumitru has logged into the environment and found the root cause for this issue.

Neutron is setting "unknown" attribute in controller's logical port

    port dcf3cf5a-3e48-4082-bc7f-20976df0d398 (aka octavia-health-manager-controller-0.redhat.local-listen-port)                      
        addresses: ["fa:16:3e:f8:dd:30 172.24.1.216", "unknown"]

Because of this, core OVN is not adding local arp reply flow in the compute node. In the absence of local ARP reply flow,  ovs-vswitchd tried to broadcast the packet and hitting 4k limit as this network "lb-mgmt-net" has more than 5000 ports (i.e amphorae  VMs).  

2021-05-18T10:21:13.903Z|00419|ofproto_dpif_xlate(handler356)|WARN|over 4096 resubmit actions on bridge br-int while processing arp,tun
_id=0x1,tun_src=0.0.0.0,tun_dst=172.17.2.47,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=64,tun_erspa
n_ver=0,tun_flags=df|csum|key,tun_metadata0=0x18608000,in_port=1941,vlan_tci=0x0000,dl_src=fa:16:3e:20:71:28,dl_dst=ff:ff:ff:ff:ff:ff,a
rp_spa=172.24.19.180,arp_tpa=172.24.1.216,arp_op=1,arp_sha=fa:16:3e:20:71:28,arp_tha=00:00:00:00:00:00                                
2021-05-18T10:22:15.629Z|00740|ofproto_dpif_xlate(handler355)|WARN|over 4096 resubmit actions on bridge br-int while processing arp,tun
_id=0x1,tun_src=0.0.0.0,tun_dst=172.17.2.47,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=64,tun_erspa
n_ver=0,tun_flags=df|csum|key,tun_metadata0=0x149d8000,in_port=1480,vlan_tci=0x0000,dl_src=fa:16:3e:04:9f:a1,dl_dst=ff:ff:ff:ff:ff:ff,a
rp_spa=172.24.21.111,arp_tpa=172.24.0.151,arp_op=1,arp_sha=fa:16:3e:04:9f:a1,arp_tha=00:00:00:00:00:00     

If the neutron removes the "unknown" attribute, OVN can add the local arp reply flow.

As this neutron network is created by administrators  (tripleo is creating this for now), if any external ports (not controlled by neutron) try to reach the VM on this network, they will definitely hit this 4k resubmit issue. So OVN has the limitation for larger broadcast domain. 

We reached a limitation (5000) in Octavia on scaling the load balancers.

Comment 2 anil venkata 2021-05-18 11:13:13 UTC
May be this can be the reason we see high cpu usage for unknown ports on lb-mgmt-net as reported in in https://bugzilla.redhat.com/show_bug.cgi?id=1961162

Comment 4 Michael Johnson 2021-05-19 17:30:57 UTC
As we have discussed on the email chain around this issue, there is no limitation in Octavia that is being hit here. 5,000 is not an Octavia limit.

As was discussed above, Dumitru (thank you), found the issue as to why OVN was dropping the ARP requests from newly created nova VM instances.

At some number of ports in a neutron network, OVN will start dropping broadcast traffic to ports.
From what we understand, there is a flow chain limit in OVN that is being reached, where rules for newly created ports may stop being added to the chain. This will cause those new ports to be unable to communicate with any port configured with port security disabled on the network. This applies to IPv4 networks and was not tested with IPv6 (which does not use ARP).

When a port in neutron has port security disabled, the compute node ARP responder in OVN is not configured for the port (this is the "unknown" setting 
 in ovn-nb listed above). This leads to the ARP request from nova VMs to need to cross the OVN fabric, and getting dropped by the OVN limitation.

The workaround applied in this test environment was to add a security group and enable port security on the controller o-hm0 ports in neutron. Once that was done, new VMs booted in nova could communicate with the controllers again and new load balancers could be created.

With that said, maybe this should be closed as NOTABUG given it is a designed limit? Maybe it should remain open for this limit to be documented in the OVN OSP documentation?

Comment 6 anil venkata 2021-06-09 12:57:41 UTC
(In reply to Michael Johnson from comment #4)
> As we have discussed on the email chain around this issue, there is no
> limitation in Octavia that is being hit here. 5,000 is not an Octavia limit.
> 
> As was discussed above, Dumitru (thank you), found the issue as to why OVN
> was dropping the ARP requests from newly created nova VM instances.
> 
> At some number of ports in a neutron network, OVN will start dropping
> broadcast traffic to ports.
> From what we understand, there is a flow chain limit in OVN that is being
> reached, where rules for newly created ports may stop being added to the
> chain. This will cause those new ports to be unable to communicate with any
> port configured with port security disabled on the network. This applies to
> IPv4 networks and was not tested with IPv6 (which does not use ARP).
> 
> When a port in neutron has port security disabled, the compute node ARP
> responder in OVN is not configured for the port (this is the "unknown"
> setting 
>  in ovn-nb listed above). This leads to the ARP request from nova VMs to
> need to cross the OVN fabric, and getting dropped by the OVN limitation.
> 
> The workaround applied in this test environment was to add a security group
> and enable port security on the controller o-hm0 ports in neutron.

Octavia team opened a private bug  https://bugzilla.redhat.com/show_bug.cgi?id=1961845 to apply this workaround.


 Once that
> was done, new VMs booted in nova could communicate with the controllers
> again and new load balancers could be created.
> 
> With that said, maybe this should be closed as NOTABUG given it is a
> designed limit? Maybe it should remain open for this limit to be documented
> in the OVN OSP documentation?

Comment 17 Greg Rakauskas 2021-11-08 21:37:14 UTC
Hi,

The change has been made to the RHOSP 16.1 "Networking Guide." Customers can see
this change here:

   https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/networking_guide/index#con_limit-nonsecure-port-ovn_networking-concepts

Thanks,
--Greg


Note You need to log in before you can comment on or make changes to this bug.