Bug 2294876

Summary: "over 4096 resubmit actions" error occurs when there are 250 neutron routers on a provider network
Product: Red Hat OpenStack Reporter: yatanaka
Component: openstack-neutronAssignee: Lucas Alvares Gomes <lmartins>
Status: CLOSED ERRATA QA Contact: Bharath M V <bmv>
Severity: high Docs Contact:
Priority: high    
Version: 17.1 (Wallaby)CC: apverma, bcafarel, bmv, chrisw, gkadam, ihrachys, i.maximets, jamsmith, lmartins, mariel, mflusche, scohen
Target Milestone: z4Keywords: Triaged
Target Release: 17.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-neutron-18.6.1-17.1.20240918120815.85ff760.el9ost Doc Type: Release Note
Doc Text:
This update adds a configuration option called `broadcast_arps_to_all_routers` to the "[ovn]" config section. + This option configures the external networks with the `broadcast-arps-to-all-routers` config option that became available in OVN 23.06. This option is enabled by default. It causes OVN to flood ARP requests to all attached ports on a network. + ---- [ovn] broadcast_arps_to_all_routers=true ---- + If you disable `broadcast_arps_to_all_routers`, ARP requests are only sent to routers on a network if the target MAC address matches. ARP requests that do not match a router are only forwarded to non-router ports.
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-11-21 09:41:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description yatanaka 2024-07-01 08:12:15 UTC
Description of problem:

On RHOSP 17.1 + ml2/OVN environment, I created a provider network and associated 250 routers to the network.
After that, I created an instance on a provider network, but it is not reachable.
ARP packet arrived at the compute node where the instance exists, but it doesn't arrive at tapXXX interface.
The ARP packets are dropped inside the Open vSwitch of the compute node.

ovs-vswitchd.log and "ovs-appctl ofproto/trace" shows "over 4096 resubmit actions" error.

~~~
[root@central-novacompute-2 ~]# tail /var/log/openvswitch/ovs-vswitchd.log  -n 20
2024-07-01T07:53:08.746Z|00009|ofproto_dpif_upcall(handler3)|WARN|Dropped 1 log messages in last 454 seconds (most recently, 454 seconds ago) due to excessive rate
2024-07-01T07:53:08.746Z|00010|ofproto_dpif_upcall(handler3)|WARN|Flow: arp,in_port=8,vlan_tci=0x0000,dl_src=fa:16:3e:44:b7:2b,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=10.2.1.78,arp_tpa=10.2.1.1,arp_op=1,arp_sha=fa:16:3e:44:b7:2b,arp_tha=00:00:00:00:00:00

bridge("br-int")
----------------
 0. in_port=8, priority 100
    move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23]
     -> OXM_OF_METADATA[0..23] is now 0
    move:NXM_NX_TUN_METADATA0[16..30]->NXM_NX_REG14[0..14]
     -> NXM_NX_REG14[0..14] is now 0
    move:NXM_NX_TUN_METADATA0[0..15]->NXM_NX_REG15[0..15]
     -> NXM_NX_REG15[0..15] is now 0
    resubmit(,40)
40. priority 0
    drop

Final flow: unchanged
Megaflow: recirc_id=0,eth,arp,in_port=8,dl_src=fa:16:3e:44:b7:2b
Datapath actions: drop
2024-07-01T07:54:44.483Z|00025|ofproto_dpif_xlate(handler7)|WARN|over 4096 resubmit actions on bridge br-int while processing icmp6,in_port=1,dl_vlan=104,dl_vlan_pcp=0,vlan_tci1=0x0000,dl_src=52:54:00:8b:ad:99,dl_dst=33:33:00:00:00:02,ipv6_src=fe80::8278:2869:d6ee:2276,ipv6_dst=ff02::2,ipv6_label=0xcf0bf,nw_tos=0,nw_ecn=0,nw_ttl=255,nw_frag=no,icmp_type=133,icmp_code=0



[root@central-novacompute-2 ~]# sudo ovs-appctl ofproto/trace br-ex in_port=enp2s0,dl_src=${GATEWAY_MAC},dl_dst=ff:ff:ff:ff:ff:ff,dl_type=0x806,arp_spa=${GATEWAY_IP},arp_tpa=10.1.1.201,arp_sha=${GATEWAY_MAC},arp_tha=ff:ff:ff:ff:ff:ff,arp_op=1,vlan_vid=4200 -generate

  :

45. ct_state=-trk,metadata=0x2, priority 5, cookie 0xa723bfb4
    set_field:0x100000000000000000000000000/0x100000000000000000000000000->xxreg0
    set_field:0x200000000000000000000000000/0x200000000000000000000000000->xxreg0
    resubmit(,46)
46. metadata=0x2, priority 0, cookie 0xf50e72ec
    resubmit(,47)
47. metadata=0x2, priority 0, cookie 0xe4b6e6ed
    resubmit(,48)
     >>>> over 4096 resubmit actions <<<<

Final flow: unchanged
Megaflow: recirc_id=0,ct_state=-new-est-rel-rpl-inv-trk,ct_mark=0/0x1,eth,arp,in_port=1,dl_vlan=104,dl_vlan_pcp=0,dl_src=52:54:00:8b:ad:99,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=10.1.1.254,arp_tpa=10.1.1.201,arp_op=1
Datapath actions: 3,pop_vlan,9
~~~

I think this is the same issue as the following KCS, Bugzilla and document.
The following resources says that it is a known limitation in RHOSP 16.1 and 16.2.

- https://access.redhat.com/solutions/6956496
- https://bugzilla.redhat.com/show_bug.cgi?id=1961386
- https://docs.redhat.com/en/documentation/red_hat_openstack_platform/16.1/html-single/networking_guide/index#con_limit-nonsecure-port-ovn_work-ovn
- https://docs.redhat.com/en/documentation/red_hat_openstack_platform/16.2/html-single/networking_guide/index#con_limit-nonsecure-port-ovn_work-ovn

However, I couldn't find any RHOSP 17.1 documents which describe the same limitation.

My question here is, is this limitation still valid in RHOSP 17.1?
If so, I think our RHOSP 17.1 should have the same document section which describes this limitation.
And also, the above documents says this issue occurs when there are more than 4000 ports, but I can reproduce the issue only with 250 routers.
I think we should mention the number of routes as well.


Version-Release number of selected component (if applicable):
RHOSP 17.1 + ml2/OVN

How reproducible:
Steps to Reproduce:
1. Build a RHOSP 17.1 + ml2/OVN environment
2. Create a provider network
~~~
openstack network create public --external --provider-network-type vlan --provider-physical-network datacentre --provider-segment 104
openstack subnet create public --network public --dhcp --allocation-pool start=10.1.1.51,end=10.1.1.250 --gateway 10.1.1.1 --subnet-range 10.1.1.0/24
~~~
3. Create 250 routers and attach them to the above provider network
~~~
for i in `seq 250`
do
  openstack router create yatanaka_router$i
  openstack router set --external-gateway public yatanaka_router$i
  openstack network create yatanaka_network$i
  openstack subnet create --network yatanaka_network$i --subnet-range 192.168.${i}.0/24 yatanaka_subnet$i
  openstack router add subnet yatanaka_router$i yatanaka_subnet$i
done
~~~
4. Create an instance on the provider network
~~~
openstack server create --network public ....
~~~
5. The instance is not reachable. ARP cannot be resolved.



Actual results:
Instances are not reachable


Expected results:
Instances are reachable


Additional info:

Comment 2 Ilya Maximets 2024-07-01 12:03:18 UTC
I wonder if OSP can set LS.other_config:broadcast-arps-to-all-routers to 'false'
on the logical switches attached to these routers.  The feature is available
since OVN v23.06, so it should be available in latest RHOSP 17.1.

Commit that adds support to OVN:
  https://github.com/ovn-org/ovn/commit/37d308a2074515834692d442475a8e05310a152d

Comment 3 Lucas Alvares Gomes 2024-07-01 12:14:39 UTC
(In reply to Ilya Maximets from comment #2)
> I wonder if OSP can set LS.other_config:broadcast-arps-to-all-routers to
> 'false'
> on the logical switches attached to these routers.  The feature is available
> since OVN v23.06, so it should be available in latest RHOSP 17.1.
> 
> Commit that adds support to OVN:
>  
> https://github.com/ovn-org/ovn/commit/
> 37d308a2074515834692d442475a8e05310a152d

Thanks Ilya, I just checked the Neutron code and we do not set that option yet but, this could be easily added.

Do you think this would solve this limitation with the resubmits ?

Also, should this option be set to 'false' by default or is there cases where we want it to still be 'true'. So that we would need to make it configurable in OSP ?

Comment 4 Ilya Maximets 2024-07-01 12:42:21 UTC
(In reply to Lucas Alvares Gomes from comment #3)
> (In reply to Ilya Maximets from comment #2)
> > I wonder if OSP can set LS.other_config:broadcast-arps-to-all-routers to
> > 'false'
> > on the logical switches attached to these routers.  The feature is available
> > since OVN v23.06, so it should be available in latest RHOSP 17.1.
> > 
> > Commit that adds support to OVN:
> >  
> > https://github.com/ovn-org/ovn/commit/
> > 37d308a2074515834692d442475a8e05310a152d
> 
> Thanks Ilya, I just checked the Neutron code and we do not set that option
> yet but, this could be easily added.
> 
> Do you think this would solve this limitation with the resubmits ?

It should, because we'll no longer resubmit ARP requests to all the routers.

> Also, should this option be set to 'false' by default or is there cases
> where we want it to still be 'true'. So that we would need to make it
> configurable in OSP ?

We're discussing this within OVN team.  The side effect will be that routers
will stop learning from GARPs.  So, I'm not sure if you can turn this flag
on all the routers unconditionally, if you have a use case for learning.

Comment 5 Lucas Alvares Gomes 2024-07-01 13:10:22 UTC
(In reply to Ilya Maximets from comment #4)
> (In reply to Lucas Alvares Gomes from comment #3)
> > (In reply to Ilya Maximets from comment #2)
> > > I wonder if OSP can set LS.other_config:broadcast-arps-to-all-routers to
> > > 'false'
> > > on the logical switches attached to these routers.  The feature is available
> > > since OVN v23.06, so it should be available in latest RHOSP 17.1.
> > > 
> > > Commit that adds support to OVN:
> > >  
> > > https://github.com/ovn-org/ovn/commit/
> > > 37d308a2074515834692d442475a8e05310a152d
> > 
> > Thanks Ilya, I just checked the Neutron code and we do not set that option
> > yet but, this could be easily added.
> > 
> > Do you think this would solve this limitation with the resubmits ?
> 
> It should, because we'll no longer resubmit ARP requests to all the routers.
> 
> > Also, should this option be set to 'false' by default or is there cases
> > where we want it to still be 'true'. So that we would need to make it
> > configurable in OSP ?
> 
> We're discussing this within OVN team.  The side effect will be that routers
> will stop learning from GARPs.  So, I'm not sure if you can turn this flag
> on all the routers unconditionally, if you have a use case for learning.

I see, yeah definitely that would require more discussion. Perhaps making it configurable in OSP (keeping true as default) would be a way forward for OSP.

In the meantime, as a workaround for the issue and also to test this option to see if it works as intended. @Reporter, could you please set it to 'false' in the OVSDB and let us know if it works ? I believe the command would be:

$ ovn-nbctl set Logical_Switch neutron-<Neutron Network UUID> other_config:broadcast-arps-to-all-routers=false

Cheers,
Lucas

Comment 9 Lucas Alvares Gomes 2024-07-01 15:04:31 UTC
Forwarding the question asked by @Ihar on slack here:

Are the routers attached to networks with lots of ports with disabled port security ?

If so, this looks a lot like what has been discussed at this OVN ML thread [0] where if when we have many ports with the "unknown" address (port security off) arps will be broadcasted up to the point where it will hit this limitation in OVN. 

So, it would be nice if we had a better understanding of what the topology looks like and how these ports are created in the customer environment.

[0] https://mail.openvswitch.org/pipermail/ovs-discuss/2020-September/050716.html

Comment 14 Ihar Hrachyshka 2024-07-01 19:57:45 UTC
@Matt, thank you for reproducing and checking the workaround!!! Great work.

Next steps for this BZ would be:

1. write a KCS for the workaround;

2. document the (approximate?) limit for the number of routers attached to an external network, similar to what we do for port-security=off ports here: https://docs.redhat.com/en/documentation/red_hat_openstack_platform/16.1/html-single/networking_guide/index#con_limit-nonsecure-port-ovn_work-ovn

3. Expose the OVN network level configuration for broadcast-arps-to-all-routers="false" in neutron (probably as a config option for ml2/ovn)

4. (Not sure if needed) Tripleo config option to set the option value (maybe this can be done with a config snippet? whatever it is, update the KCS from 1. accordingly once implemented) (In RHOSO 18, we can already provide a custom snippet to NeutronApi CR.)

There's probably some BZ cloning to do here since we'll need to patch different components (docs, neutron, maybe tripleo).

Comment 15 Ihar Hrachyshka 2024-07-01 20:04:33 UTC
Forgot to mention in the last comment: there's also a path to improve this and get rid of the router limit by changing the way learning is done on OVN side (by learning once on switch side instead of in each router pipeline.) This was discussed before in upstream, e.g. here: https://mail.openvswitch.org/pipermail/ovs-dev/2023-March/402539.html I think it's worth requesting this improvement from FDP team. (This will go into Jira, since they use it to track their work.) This would of course require some more time to get implemented, but AFAIU the team is open to consider this change, even if it's invasive and probably won't be backported to older OVN releases.

Comment 18 Lucas Alvares Gomes 2024-07-02 07:54:45 UTC
Thanks @Matt for testing the workaround and confirming that it does mitigate the issue.

Comment 22 yatanaka 2024-07-03 07:17:14 UTC
Hi Ihar Hrachyshka, thank you for your help

> 1. write a KCS for the workaround;

I've just wrote a KCS for this issue: https://access.redhat.com/solutions/7077367

> 2. document the (approximate?) limit for the number of routers attached to an external network

In my RHOSP 17.1.2 lab, the limit was 237. If the number of routers was grater than 237 the issue is occurred.
However, the limit may vary depending on environments. I'd prefer to state an approximate limit, like 230 or 200.

Comment 43 errata-xmlrpc 2024-11-21 09:41:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHOSP 17.1.4 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:9974