Description of problem: On RHOSP 17.1 + ml2/OVN environment, I created a provider network and associated 250 routers to the network. After that, I created an instance on a provider network, but it is not reachable. ARP packet arrived at the compute node where the instance exists, but it doesn't arrive at tapXXX interface. The ARP packets are dropped inside the Open vSwitch of the compute node. ovs-vswitchd.log and "ovs-appctl ofproto/trace" shows "over 4096 resubmit actions" error. ~~~ [root@central-novacompute-2 ~]# tail /var/log/openvswitch/ovs-vswitchd.log -n 20 2024-07-01T07:53:08.746Z|00009|ofproto_dpif_upcall(handler3)|WARN|Dropped 1 log messages in last 454 seconds (most recently, 454 seconds ago) due to excessive rate 2024-07-01T07:53:08.746Z|00010|ofproto_dpif_upcall(handler3)|WARN|Flow: arp,in_port=8,vlan_tci=0x0000,dl_src=fa:16:3e:44:b7:2b,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=10.2.1.78,arp_tpa=10.2.1.1,arp_op=1,arp_sha=fa:16:3e:44:b7:2b,arp_tha=00:00:00:00:00:00 bridge("br-int") ---------------- 0. in_port=8, priority 100 move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23] -> OXM_OF_METADATA[0..23] is now 0 move:NXM_NX_TUN_METADATA0[16..30]->NXM_NX_REG14[0..14] -> NXM_NX_REG14[0..14] is now 0 move:NXM_NX_TUN_METADATA0[0..15]->NXM_NX_REG15[0..15] -> NXM_NX_REG15[0..15] is now 0 resubmit(,40) 40. priority 0 drop Final flow: unchanged Megaflow: recirc_id=0,eth,arp,in_port=8,dl_src=fa:16:3e:44:b7:2b Datapath actions: drop 2024-07-01T07:54:44.483Z|00025|ofproto_dpif_xlate(handler7)|WARN|over 4096 resubmit actions on bridge br-int while processing icmp6,in_port=1,dl_vlan=104,dl_vlan_pcp=0,vlan_tci1=0x0000,dl_src=52:54:00:8b:ad:99,dl_dst=33:33:00:00:00:02,ipv6_src=fe80::8278:2869:d6ee:2276,ipv6_dst=ff02::2,ipv6_label=0xcf0bf,nw_tos=0,nw_ecn=0,nw_ttl=255,nw_frag=no,icmp_type=133,icmp_code=0 [root@central-novacompute-2 ~]# sudo ovs-appctl ofproto/trace br-ex in_port=enp2s0,dl_src=${GATEWAY_MAC},dl_dst=ff:ff:ff:ff:ff:ff,dl_type=0x806,arp_spa=${GATEWAY_IP},arp_tpa=10.1.1.201,arp_sha=${GATEWAY_MAC},arp_tha=ff:ff:ff:ff:ff:ff,arp_op=1,vlan_vid=4200 -generate : 45. ct_state=-trk,metadata=0x2, priority 5, cookie 0xa723bfb4 set_field:0x100000000000000000000000000/0x100000000000000000000000000->xxreg0 set_field:0x200000000000000000000000000/0x200000000000000000000000000->xxreg0 resubmit(,46) 46. metadata=0x2, priority 0, cookie 0xf50e72ec resubmit(,47) 47. metadata=0x2, priority 0, cookie 0xe4b6e6ed resubmit(,48) >>>> over 4096 resubmit actions <<<< Final flow: unchanged Megaflow: recirc_id=0,ct_state=-new-est-rel-rpl-inv-trk,ct_mark=0/0x1,eth,arp,in_port=1,dl_vlan=104,dl_vlan_pcp=0,dl_src=52:54:00:8b:ad:99,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=10.1.1.254,arp_tpa=10.1.1.201,arp_op=1 Datapath actions: 3,pop_vlan,9 ~~~ I think this is the same issue as the following KCS, Bugzilla and document. The following resources says that it is a known limitation in RHOSP 16.1 and 16.2. - https://access.redhat.com/solutions/6956496 - https://bugzilla.redhat.com/show_bug.cgi?id=1961386 - https://docs.redhat.com/en/documentation/red_hat_openstack_platform/16.1/html-single/networking_guide/index#con_limit-nonsecure-port-ovn_work-ovn - https://docs.redhat.com/en/documentation/red_hat_openstack_platform/16.2/html-single/networking_guide/index#con_limit-nonsecure-port-ovn_work-ovn However, I couldn't find any RHOSP 17.1 documents which describe the same limitation. My question here is, is this limitation still valid in RHOSP 17.1? If so, I think our RHOSP 17.1 should have the same document section which describes this limitation. And also, the above documents says this issue occurs when there are more than 4000 ports, but I can reproduce the issue only with 250 routers. I think we should mention the number of routes as well. Version-Release number of selected component (if applicable): RHOSP 17.1 + ml2/OVN How reproducible: Steps to Reproduce: 1. Build a RHOSP 17.1 + ml2/OVN environment 2. Create a provider network ~~~ openstack network create public --external --provider-network-type vlan --provider-physical-network datacentre --provider-segment 104 openstack subnet create public --network public --dhcp --allocation-pool start=10.1.1.51,end=10.1.1.250 --gateway 10.1.1.1 --subnet-range 10.1.1.0/24 ~~~ 3. Create 250 routers and attach them to the above provider network ~~~ for i in `seq 250` do openstack router create yatanaka_router$i openstack router set --external-gateway public yatanaka_router$i openstack network create yatanaka_network$i openstack subnet create --network yatanaka_network$i --subnet-range 192.168.${i}.0/24 yatanaka_subnet$i openstack router add subnet yatanaka_router$i yatanaka_subnet$i done ~~~ 4. Create an instance on the provider network ~~~ openstack server create --network public .... ~~~ 5. The instance is not reachable. ARP cannot be resolved. Actual results: Instances are not reachable Expected results: Instances are reachable Additional info:
I wonder if OSP can set LS.other_config:broadcast-arps-to-all-routers to 'false' on the logical switches attached to these routers. The feature is available since OVN v23.06, so it should be available in latest RHOSP 17.1. Commit that adds support to OVN: https://github.com/ovn-org/ovn/commit/37d308a2074515834692d442475a8e05310a152d
(In reply to Ilya Maximets from comment #2) > I wonder if OSP can set LS.other_config:broadcast-arps-to-all-routers to > 'false' > on the logical switches attached to these routers. The feature is available > since OVN v23.06, so it should be available in latest RHOSP 17.1. > > Commit that adds support to OVN: > > https://github.com/ovn-org/ovn/commit/ > 37d308a2074515834692d442475a8e05310a152d Thanks Ilya, I just checked the Neutron code and we do not set that option yet but, this could be easily added. Do you think this would solve this limitation with the resubmits ? Also, should this option be set to 'false' by default or is there cases where we want it to still be 'true'. So that we would need to make it configurable in OSP ?
(In reply to Lucas Alvares Gomes from comment #3) > (In reply to Ilya Maximets from comment #2) > > I wonder if OSP can set LS.other_config:broadcast-arps-to-all-routers to > > 'false' > > on the logical switches attached to these routers. The feature is available > > since OVN v23.06, so it should be available in latest RHOSP 17.1. > > > > Commit that adds support to OVN: > > > > https://github.com/ovn-org/ovn/commit/ > > 37d308a2074515834692d442475a8e05310a152d > > Thanks Ilya, I just checked the Neutron code and we do not set that option > yet but, this could be easily added. > > Do you think this would solve this limitation with the resubmits ? It should, because we'll no longer resubmit ARP requests to all the routers. > Also, should this option be set to 'false' by default or is there cases > where we want it to still be 'true'. So that we would need to make it > configurable in OSP ? We're discussing this within OVN team. The side effect will be that routers will stop learning from GARPs. So, I'm not sure if you can turn this flag on all the routers unconditionally, if you have a use case for learning.
(In reply to Ilya Maximets from comment #4) > (In reply to Lucas Alvares Gomes from comment #3) > > (In reply to Ilya Maximets from comment #2) > > > I wonder if OSP can set LS.other_config:broadcast-arps-to-all-routers to > > > 'false' > > > on the logical switches attached to these routers. The feature is available > > > since OVN v23.06, so it should be available in latest RHOSP 17.1. > > > > > > Commit that adds support to OVN: > > > > > > https://github.com/ovn-org/ovn/commit/ > > > 37d308a2074515834692d442475a8e05310a152d > > > > Thanks Ilya, I just checked the Neutron code and we do not set that option > > yet but, this could be easily added. > > > > Do you think this would solve this limitation with the resubmits ? > > It should, because we'll no longer resubmit ARP requests to all the routers. > > > Also, should this option be set to 'false' by default or is there cases > > where we want it to still be 'true'. So that we would need to make it > > configurable in OSP ? > > We're discussing this within OVN team. The side effect will be that routers > will stop learning from GARPs. So, I'm not sure if you can turn this flag > on all the routers unconditionally, if you have a use case for learning. I see, yeah definitely that would require more discussion. Perhaps making it configurable in OSP (keeping true as default) would be a way forward for OSP. In the meantime, as a workaround for the issue and also to test this option to see if it works as intended. @Reporter, could you please set it to 'false' in the OVSDB and let us know if it works ? I believe the command would be: $ ovn-nbctl set Logical_Switch neutron-<Neutron Network UUID> other_config:broadcast-arps-to-all-routers=false Cheers, Lucas
Forwarding the question asked by @Ihar on slack here: Are the routers attached to networks with lots of ports with disabled port security ? If so, this looks a lot like what has been discussed at this OVN ML thread [0] where if when we have many ports with the "unknown" address (port security off) arps will be broadcasted up to the point where it will hit this limitation in OVN. So, it would be nice if we had a better understanding of what the topology looks like and how these ports are created in the customer environment. [0] https://mail.openvswitch.org/pipermail/ovs-discuss/2020-September/050716.html
@Matt, thank you for reproducing and checking the workaround!!! Great work. Next steps for this BZ would be: 1. write a KCS for the workaround; 2. document the (approximate?) limit for the number of routers attached to an external network, similar to what we do for port-security=off ports here: https://docs.redhat.com/en/documentation/red_hat_openstack_platform/16.1/html-single/networking_guide/index#con_limit-nonsecure-port-ovn_work-ovn 3. Expose the OVN network level configuration for broadcast-arps-to-all-routers="false" in neutron (probably as a config option for ml2/ovn) 4. (Not sure if needed) Tripleo config option to set the option value (maybe this can be done with a config snippet? whatever it is, update the KCS from 1. accordingly once implemented) (In RHOSO 18, we can already provide a custom snippet to NeutronApi CR.) There's probably some BZ cloning to do here since we'll need to patch different components (docs, neutron, maybe tripleo).
Forgot to mention in the last comment: there's also a path to improve this and get rid of the router limit by changing the way learning is done on OVN side (by learning once on switch side instead of in each router pipeline.) This was discussed before in upstream, e.g. here: https://mail.openvswitch.org/pipermail/ovs-dev/2023-March/402539.html I think it's worth requesting this improvement from FDP team. (This will go into Jira, since they use it to track their work.) This would of course require some more time to get implemented, but AFAIU the team is open to consider this change, even if it's invasive and probably won't be backported to older OVN releases.
Thanks @Matt for testing the workaround and confirming that it does mitigate the issue.
Hi Ihar Hrachyshka, thank you for your help > 1. write a KCS for the workaround; I've just wrote a KCS for this issue: https://access.redhat.com/solutions/7077367 > 2. document the (approximate?) limit for the number of routers attached to an external network In my RHOSP 17.1.2 lab, the limit was 237. If the number of routers was grater than 237 the issue is occurred. However, the limit may vary depending on environments. I'd prefer to state an approximate limit, like 230 or 200.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHOSP 17.1.4 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2024:9974