Bug 1894478

Summary: [OVN2.13] arp flooding and reaching limit: ofproto_dpif_xlate|WARN|over 4096 resubmit
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Dumitru Ceara <dceara>
Component: ovn2.13Assignee: Dumitru Ceara <dceara>
Status: CLOSED ERRATA QA Contact: Jianlin Shi <jishi>
Severity: urgent Docs Contact:
Priority: urgent    
Version: FDP 20.HCC: apevec, ctrautma, dceara, ggrimaux, jishi, kfida, lhh, majopela, ralongi, scohen
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovn2.13-20.09.0-10.el7fdp ovn2.13-20.09.0-10.el8fdp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1894048
: 1894871 (view as bug list) Environment:
Last Closed: 2020-12-01 15:07:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1894048    
Bug Blocks: 1894871    

Comment 4 Jianlin Shi 2020-11-11 09:58:04 UTC
reproduced on 20.09.0-9 with reproducer in https://bugzilla.redhat.com/show_bug.cgi?id=1894048#c7:

[root@dell-per740-12 bz1776712_broadcast_limit]# rpm -qa | grep -E "ovn|openvswitch"                  
openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch
ovn2.13-central-20.09.0-9.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-common-1.0-13.noarch                                         
openvswitch2.13-2.13.0-59.el7fdp.x86_64
ovn2.13-20.09.0-9.el7fdp.x86_64
ovn2.13-host-20.09.0-9.el7fdp.x86_64                                                                  
kernel-kernel-networking-openvswitch-ovn-regression-bz1776712_broadcast_limit-1.0-1.noarch            
python3-openvswitch2.13-2.13.0-59.el7fdp.x86_64

:: [ 04:47:53 ] :: [  BEGIN   ] :: Running 'ovn-nbctl --wait=hv sync'
:: [ 04:48:03 ] :: [   PASS   ] :: Command 'ovn-nbctl --wait=hv sync' (Expected 0, got 0)
:: [ 04:48:03 ] :: [  BEGIN   ] :: Running 'ip netns exec vm1 ping 172.16.0.102 -c 1'
PING 172.16.0.102 (172.16.0.102) 56(84) bytes of data.                                                

--- 172.16.0.102 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms                                         

:: [ 04:48:13 ] :: [   FAIL   ] :: Command 'ip netns exec vm1 ping 172.16.0.102 -c 1' (Expected 0, got 1)
:: [ 04:48:13 ] :: [  BEGIN   ] :: Running 'ip netns exec vm1 ping 172.16.0.100 -c 1'
PING 172.16.0.100 (172.16.0.100) 56(84) bytes of data.
                                                                                                      
--- 172.16.0.100 ping statistics ---                                                                  
1 packets transmitted, 0 received, 100% packet loss, time 0ms
                                                                                                      
:: [ 04:48:23 ] :: [   FAIL   ] :: Command 'ip netns exec vm1 ping 172.16.0.100 -c 1' (Expected 0, got 1)
<=== ping failed

[root@dell-per740-12 bz1776712_broadcast_limit]# grep 4096 /var/log/openvswitch/ovs-vswitchd.log
2020-11-11T09:47:59.829Z|00054|ofproto_dpif_xlate|WARN|over 4096 resubmit actions on bridge br-int while processing arp,in_port=CONTROLLER,vlan_tci=0x0000,dl_src=00:de:ad:ff:02:71,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=172.16.2.71,arp_tpa=172.16.2.71,arp_op=1,arp_sha=00:de:ad:ff:02:71,arp_tha=00:00:00:00:00:00 
2020-11-11T09:47:59.834Z|00056|ofproto_dpif_xlate|WARN|over 4096 resubmit actions on bridge br-int while processing arp,in_port=CONTROLLER,vlan_tci=0x0000,dl_src=00:de:ad:ff:00:43,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=172.16.0.43,arp_tpa=172.16.0.43,arp_op=1,arp_sha=00:de:ad:ff:00:43,arp_tha=00:00:00:00:00:00 
2020-11-11T09:47:59.838Z|00058|ofproto_dpif_xlate|WARN|over 4096 resubmit actions on bridge br-int while processing arp,in_port=CONTROLLER,vlan_tci=0x0000,dl_src=00:de:ad:ff:02:69,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=172.16.2.69,arp_tpa=172.16.2.69,arp_op=1,arp_sha=00:de:ad:ff:02:69,arp_tha=00:00:00:00:00:00 
2020-11-11T09:47:59.841Z|00060|ofproto_dpif_xlate|WARN|over 4096 resubmit actions on bridge br-int while processing arp,in_port=CONTROLLER,vlan_tci=0x0000,dl_src=00:de:ad:ff:01:39,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=172.16.1.39,arp_tpa=172.16.1.39,arp_op=1,arp_sha=00:de:ad:ff:01:39,arp_tha=00:00:00:00:00:00
2020-11-11T09:47:59.845Z|00062|ofproto_dpif_xlate|WARN|over 4096 resubmit actions on bridge br-int while processing arp,in_port=CONTROLLER,vlan_tci=0x0000,dl_src=00:de:ad:ff:02:74,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=172.16.2.74,arp_tpa=172.16.2.74,arp_op=1,arp_sha=00:de:ad:ff:02:74,arp_tha=00:00:00:00:00:00

<=== 4k resubmit WARN

result on 20.09.0-10:

[root@dell-per740-12 bz1776712_broadcast_limit]# rpm -qa | grep -E "ovn|openvswitch"
openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch
ovn2.13-host-20.09.0-10.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-common-1.0-13.noarch
openvswitch2.13-2.13.0-59.el7fdp.x86_64
ovn2.13-central-20.09.0-10.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-regression-bz1776712_broadcast_limit-1.0-1.noarch
python3-openvswitch2.13-2.13.0-59.el7fdp.x86_64
ovn2.13-20.09.0-10.el7fdp.x86_6

:: [ 04:55:06 ] :: [  BEGIN   ] :: Running 'ovn-nbctl --wait=hv sync'
:: [ 04:56:06 ] :: [   PASS   ] :: Command 'ovn-nbctl --wait=hv sync' (Expected 0, got 0)
:: [ 04:56:06 ] :: [  BEGIN   ] :: Running 'ip netns exec vm1 ping 172.16.0.102 -c 1'
PING 172.16.0.102 (172.16.0.102) 56(84) bytes of data.                                                
64 bytes from 172.16.0.102: icmp_seq=1 ttl=62 time=56.3 ms                                            

--- 172.16.0.102 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 56.306/56.306/56.306/0.000 ms                                                  
:: [ 04:56:06 ] :: [   PASS   ] :: Command 'ip netns exec vm1 ping 172.16.0.102 -c 1' (Expected 0, got 0)
:: [ 04:56:06 ] :: [  BEGIN   ] :: Running 'ip netns exec vm1 ping 172.16.0.100 -c 1'
PING 172.16.0.100 (172.16.0.100) 56(84) bytes of data.
64 bytes from 172.16.0.100: icmp_seq=1 ttl=63 time=14.3 ms                                            

--- 172.16.0.100 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 14.312/14.312/14.312/0.000 ms
:: [ 04:56:06 ] :: [   PASS   ] :: Command 'ip netns exec vm1 ping 172.16.0.100 -c 1' (Expected 0, got 0)
<=== passed

[root@dell-per740-12 bz1776712_broadcast_limit]# grep 4096 /var/log/openvswitch/ovs-vswitchd.log      
2020-11-11T09:55:23.488Z|00055|ofproto_dpif_xlate|WARN|over 4096 resubmit actions on bridge br-int while processing arp,in_port=CONTROLLER,vlan_tci=0x0000,dl_src=00:de:ad:ff:03:03,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=172.16.3.3,arp_tpa=172.16.3.3,arp_op=1,arp_sha=00:de:ad:ff:03:03,arp_tha=00:00:00:00:00:00

<=== still get 4k resubmit WARN message

Comment 5 Jianlin Shi 2020-11-11 10:16:37 UTC
on rhel8 version:

[root@wsfd-advnetlab19 bz1776712_broadcast_limit]# rpm -qa | grep -E "openvswitch|ovn"                
kernel-kernel-networking-openvswitch-ovn-scenario-1.0-12.noarch
ovn2.13-central-20.09.0-10.el8fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-common-1.0-13.noarch                                         
openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch
kernel-kernel-networking-openvswitch-ovn-regression-bz1775778_kernel_unknown_flow-1.0-2.noarch
openvswitch2.13-2.13.0-67.el8fdp.x86_64
ovn2.13-host-20.09.0-10.el8fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-basic-1.0-31.noarch                                          
openvswitch2.11-2.11.3-73.el8fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-regression-bz1776712_broadcast_limit-1.0-1.noarch            
python3-openvswitch2.13-2.13.0-67.el8fdp.x86_64
ovn2.13-20.09.0-10.el8fdp.x86_64

:: [ 05:13:26 ] :: [  BEGIN   ] :: Running 'ovn-nbctl --wait=hv sync'
:: [ 05:14:20 ] :: [   PASS   ] :: Command 'ovn-nbctl --wait=hv sync' (Expected 0, got 0)
:: [ 05:14:20 ] :: [  BEGIN   ] :: Running 'ip netns exec vm1 ping 172.16.0.102 -c 1'
PING 172.16.0.102 (172.16.0.102) 56(84) bytes of data.
64 bytes from 172.16.0.102: icmp_seq=1 ttl=62 time=39.4 ms                                            

--- 172.16.0.102 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 39.378/39.378/39.378/0.000 ms
:: [ 05:14:20 ] :: [   PASS   ] :: Command 'ip netns exec vm1 ping 172.16.0.102 -c 1' (Expected 0, got 0)
:: [ 05:14:20 ] :: [  BEGIN   ] :: Running 'ip netns exec vm1 ping 172.16.0.100 -c 1'
PING 172.16.0.100 (172.16.0.100) 56(84) bytes of data.
64 bytes from 172.16.0.100: icmp_seq=1 ttl=63 time=17.2 ms
                                                                                                      
--- 172.16.0.100 ping statistics ---                                                                  
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 17.218/17.218/17.218/0.000 ms
:: [ 05:14:20 ] :: [   PASS   ] :: Command 'ip netns exec vm1 ping 172.16.0.100 -c 1' (Expected 0, got 0)

<=== ping passed

[root@wsfd-advnetlab19 ~]# grep 4096 /var/log/openvswitch/ovs-vswitchd.log                            
2020-11-11T10:13:39.047Z|00001|ofproto_dpif_xlate(handler70)|WARN|over 4096 resubmit actions on bridge br-int while processing icmp6,in_port=2,vlan_tci=0x0000,dl_src=00:00:00:00:00:01,dl_dst=33:33:00:00:00:02,ipv6_src=fe80::200:ff:fe00:1,ipv6_dst=ff02::2,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=133,icmp_code=0
2020-11-11T10:13:40.068Z|00001|ofproto_dpif_xlate(handler56)|WARN|over 4096 resubmit actions on bridge br-int while processing icmp6,in_port=LOCAL,vlan_tci=0x0000,dl_src=00:00:00:00:00:01,dl_dst=33:33:00:00:00:02,ipv6_src=fe80::c50:caff:feed:5840,ipv6_dst=ff02::2,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=133,icmp_code=0
2020-11-11T10:13:41.361Z|00049|ofproto_dpif_xlate|WARN|over 4096 resubmit actions on bridge br-int while processing arp,in_port=CONTROLLER,vlan_tci=0x0000,dl_src=00:de:ad:ff:03:03,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=172.16.3.3,arp_tpa=172.16.3.3,arp_op=1,arp_sha=00:de:ad:ff:03:03,arp_tha=00:00:00:00:00:00   
2020-11-11T10:13:53.893Z|00001|ofproto_dpif_xlate(handler53)|WARN|over 4096 resubmit actions on bridge br-int while processing icmp6,in_port=2,vlan_tci=0x0000,dl_src=00:00:00:00:00:01,dl_dst=33:33:00:00:00:02,ipv6_src=fe80::200:ff:fe00:1,ipv6_dst=ff02::2,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=133,icmp_code=0
2020-11-11T10:13:57.476Z|00001|ofproto_dpif_xlate(handler50)|WARN|over 4096 resubmit actions on bridge br-int while processing icmp6,in_port=LOCAL,vlan_tci=0x0000,dl_src=00:00:00:00:00:01,dl_dst=33:33:00:00:00:02,ipv6_src=fe80::c50:caff:feed:5840,ipv6_dst=ff02::2,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=133,icmp_code=0
2020-11-11T10:15:24.007Z|00003|ofproto_dpif_xlate(handler53)|WARN|over 4096 resubmit actions on bridge br-int while processing icmp6,in_port=2,vlan_tci=0x0000,dl_src=00:00:00:00:00:01,dl_dst=33:33:00:00:00:02,ipv6_src=fe80::200:ff:fe00:1,ipv6_dst=ff02::2,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=133,icmp_code=0

<== also get 4k resubmit WARN message

Comment 6 Dumitru Ceara 2020-11-11 12:25:50 UTC
(In reply to Jianlin Shi from comment #5)
> on rhel8 version:
[...]
> 
> [root@wsfd-advnetlab19 ~]# grep 4096 /var/log/openvswitch/ovs-vswitchd.log  
> 
> 2020-11-11T10:13:39.047Z|00001|ofproto_dpif_xlate(handler70)|WARN|over 4096
> resubmit actions on bridge br-int while processing
> icmp6,in_port=2,vlan_tci=0x0000,dl_src=00:00:00:00:00:01,dl_dst=33:33:00:00:
> 00:02,ipv6_src=fe80::200:ff:fe00:1,ipv6_dst=ff02::2,ipv6_label=0x00000,
> nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=133,icmp_code=0
> 2020-11-11T10:13:40.068Z|00001|ofproto_dpif_xlate(handler56)|WARN|over 4096
> resubmit actions on bridge br-int while processing
> icmp6,in_port=LOCAL,vlan_tci=0x0000,dl_src=00:00:00:00:00:01,dl_dst=33:33:00:
> 00:00:02,ipv6_src=fe80::c50:caff:feed:5840,ipv6_dst=ff02::2,
> ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=133,icmp_code=0
> 2020-11-11T10:13:41.361Z|00049|ofproto_dpif_xlate|WARN|over 4096 resubmit
> actions on bridge br-int while processing
> arp,in_port=CONTROLLER,vlan_tci=0x0000,dl_src=00:de:ad:ff:03:03,dl_dst=ff:ff:
> ff:ff:ff:ff,arp_spa=172.16.3.3,arp_tpa=172.16.3.3,arp_op=1,arp_sha=00:de:ad:
> ff:03:03,arp_tha=00:00:00:00:00:00   
> 2020-11-11T10:13:53.893Z|00001|ofproto_dpif_xlate(handler53)|WARN|over 4096
> resubmit actions on bridge br-int while processing
> icmp6,in_port=2,vlan_tci=0x0000,dl_src=00:00:00:00:00:01,dl_dst=33:33:00:00:
> 00:02,ipv6_src=fe80::200:ff:fe00:1,ipv6_dst=ff02::2,ipv6_label=0x00000,
> nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=133,icmp_code=0
> 2020-11-11T10:13:57.476Z|00001|ofproto_dpif_xlate(handler50)|WARN|over 4096
> resubmit actions on bridge br-int while processing
> icmp6,in_port=LOCAL,vlan_tci=0x0000,dl_src=00:00:00:00:00:01,dl_dst=33:33:00:
> 00:00:02,ipv6_src=fe80::c50:caff:feed:5840,ipv6_dst=ff02::2,
> ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=133,icmp_code=0
> 2020-11-11T10:15:24.007Z|00003|ofproto_dpif_xlate(handler53)|WARN|over 4096
> resubmit actions on bridge br-int while processing
> icmp6,in_port=2,vlan_tci=0x0000,dl_src=00:00:00:00:00:01,dl_dst=33:33:00:00:
> 00:02,ipv6_src=fe80::200:ff:fe00:1,ipv6_dst=ff02::2,ipv6_label=0x00000,
> nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=133,icmp_code=0
> 
> <== also get 4k resubmit WARN message

The IPv6 RS packets hit the 4k resubmit limit as discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1894048#c9

The logs for ARPs correspond to GARPs originated by ovn-controller for logical router port owned IPs which happen to be sent before all logical flows and MC_FLOOD_L2 groups are installed in OVS.  This is a transient issue and shouldn't happen anymore once all flows get installed.  There's no easy way to delay ovn-controller from sending those GARPs until all flows are installed.  But this also shouldn't impact E-W, S-N traffic.

Once the flows are installed, running an ovn-trace simulating the GARPs we see that we definitely don't do 4k resubmits for GARPs:

# ovn-trace public 'inport=="public_r300" && eth.src==00:de:ad:ff:03:03 && eth.dst==ff:ff:ff:ff:ff:ff && arp && arp.op==1 &&  arp.spa == 172.16.3.3 && arp.tpa == 172.16.3.3 && arp.sha == 00:de:ad:ff:03:03'
# arp,reg14=0xfd,vlan_tci=0x0000,dl_src=00:de:ad:ff:03:03,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=172.16.3.3,arp_tpa=172.16.3.3,arp_op=1,arp_sha=00:de:ad:ff:03:03,arp_tha=00:00:00:00:00:00

ingress(dp="public", inport="public_r300")
------------------------------------------
 0. ls_in_port_sec_l2 (ovn-northd.c:4874): inport == "public_r300", priority 50, uuid 3467239b
    next;
 6. ls_in_acl_hint (ovn-northd.c:5326): !ct.trk, priority 5, uuid c938fca3
    reg0[8] = 1;
    reg0[9] = 1;
    next;
14. ls_in_arp_rsp (ovn-northd.c:6978): arp.tpa == 172.16.3.3 && arp.op == 1 && inport == "public_r300", priority 100, uuid d0eb7607
    next;
20. ls_in_l2_lkup (ovn-northd.c:6462): flags[1] == 0 && arp.op == 1 && arp.tpa == { 172.16.3.103, 172.16.3.3}, priority 80, uuid 6c02da99
    clone { outport = "public_r300"; output; };

    clone
    -----
            outport = "public_r300";
            output;
            /* omitting output because inport == outport && !flags.loopback */
    outport = "_MC_flood_l2";
    output;

multicast(dp="public", mcgroup="_MC_flood_l2")
----------------------------------------------

    egress(dp="public", inport="public_r300", outport="ln_p1")
    ----------------------------------------------------------
         4. ls_out_acl_hint (ovn-northd.c:5326): !ct.trk, priority 5, uuid 6faad367
            reg0[8] = 1;
            reg0[9] = 1;
            next;
        10. ls_out_port_sec_l2 (ovn-northd.c:4963): eth.mcast, priority 100, uuid ab8e9434
            output;
            /* output to "ln_p1", type "localnet" */

Comment 7 Jianlin Shi 2020-11-12 01:34:23 UTC
set VERIFIED per comment 6

Comment 9 errata-xmlrpc 2020-12-01 15:07:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5308