Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1894048

Summary: [OVN2.11] arp flooding and reaching limit: ofproto_dpif_xlate|WARN|over 4096 resubmit
Product: Red Hat Enterprise Linux Fast Datapath Reporter: ggrimaux
Component: ovn2.11Assignee: Dumitru Ceara <dceara>
Status: CLOSED ERRATA QA Contact: Jianlin Shi <jishi>
Severity: urgent Docs Contact:
Priority: urgent    
Version: FDP 20.HCC: apevec, ctrautma, dceara, kfida, lhh, majopela, scohen
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovn2.11-2.11.1-56.el7fdp ovn2.11-2.11.1-56.el8fdp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1894478 1894872 (view as bug list) Environment:
Last Closed: 2020-12-01 15:07:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1894478, 1894872    

Description ggrimaux 2020-11-03 12:41:14 UTC
Description of problem:
Client is being affected where floating IPs are being randomly removed from instances because DHCP renew arp request can sometimes not be processed in time and the TTL reaches 0 therefore the IP is removed, impacting traffic flow.

We have logs and sosreport from the affected instances.

We need your help to work around this issue.

Version-Release number of selected component (if applicable):
OSP13z12
OVN 2.11.1-55 (Hotfix version provided for another bug they encountered (https://bugzilla.redhat.com/show_bug.cgi?id=1890974))

How reproducible:
Currently its all the time.

Steps to Reproduce:
1. You just have to wait.
2.
3.

Actual results:
Instances can't renew dhcp release because of arp flooding and reaching limit (ofproto_dpif_xlate|WARN|over 4096 resubmit)

Expected results:
Preventing this arp flooding and changing this limit.

Additional info:

Comment 2 Jianlin Shi 2020-11-04 00:44:28 UTC
Hi Dumitru,

will this bug go into 20.I?

Comment 3 Dumitru Ceara 2020-11-04 08:49:09 UTC
(In reply to Jianlin Shi from comment #2)
> Hi Dumitru,
> 
> will this bug go into 20.I?

Hi Jianlin,

Probably not, the fix is still under review upstream.

Regards,
Dumitru

Comment 7 Jianlin Shi 2020-11-11 07:39:41 UTC
create topo as follows:
VM0 -- br-ex --- localnet-port -- pub-logical-switch -- LR1 -- VM1 (FIP1)
                                              -- LR2 -- VM2 (FIP2)
                                              -- LRn -- (FIPn)                              
with following script:

ovn-nbctl ls-add public
ovn-nbctl lsp-add public ln_p1
ovn-nbctl lsp-set-addresses ln_p1 unknown
ovn-nbctl lsp-set-type ln_p1 localnet
ovn-nbctl lsp-set-options ln_p1 network_name=nattest

i=1
for m in `seq 0 4`;do
  for n in `seq 1 99`;do
    ovn-nbctl lr-add r${i}
    ovn-nbctl lrp-add r${i} r${i}_public 00:de:ad:ff:$m:$n 172.16.$m.$n/16
    ovn-nbctl lrp-add r${i} r${i}_s${i} 00:de:ad:fe:$m:$n 173.$m.$n.1/24
    ovn-nbctl lr-nat-add r${i} dnat_and_snat 172.16.${m}.$((n+100)) 173.$m.$n.2
    ovn-nbctl set logical_router_port r${i}_public options:redirect-chassis=hv1

                # s1
    ovn-nbctl ls-add s${i}

                # s1 - r1
    ovn-nbctl lsp-add s${i} s${i}_r${i}
    ovn-nbctl lsp-set-type s${i}_r${i} router           
    ovn-nbctl lsp-set-addresses s${i}_r${i} router
    ovn-nbctl lsp-set-options s${i}_r${i} router-port=r${i}_s${i}            
                # s1 - vm1
    ovn-nbctl lsp-add s$i vm$i
    ovn-nbctl lsp-set-addresses vm$i "00:de:ad:01:$m:$n 173.$m.$n.2"
    ovn-nbctl lrp-add r$i r${i}_public 40:44:00:00:$m:$n 172.16.$m.$n/16 
    ovn-nbctl lsp-add public public_r${i}
    ovn-nbctl lsp-set-type public_r${i} router
    ovn-nbctl lsp-set-addresses public_r${i} router
                
    ovn-nbctl lsp-set-options public_r${i} router-port=r${i}_public
    let i++ 
    if [ $i -gt 300 ];then
       break;
    fi
  done
  if [ $i -gt 300 ];then
    break;
  fi
done
#add host vm1
ip netns add vm1
ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal
ip link set vm1 netns vm1
ip netns exec vm1 ip link set vm1 address 00:de:ad:01:00:01
ip netns exec vm1 ip addr add 173.0.1.2/24 dev vm1
ip netns exec vm1 ip link set vm1 up
ovs-vsctl set Interface vm1 external_ids:iface-id=vm1
                
ip netns add vm2
ovs-vsctl add-port br-int vm2 -- set interface vm2 type=internal
ip link set vm2 netns vm2
ip netns exec vm2 ip link set vm2 address 00:de:ad:01:00:02
ip netns exec vm2 ip addr add 173.0.2.2/24 dev vm2
ip netns exec vm2 ip link set vm2 up
ovs-vsctl set Interface vm2 external_ids:iface-id=vm2
                
#set provide network
ovs-vsctl add-br nat_test
ip link set nat_test up
ovs-vsctl set Open_vSwitch . external-ids:ovn-bridge-mappings=nattest:nat_test

ip netns add vm0
ovs-vsctl add-port nat_test vm0 -- set interface vm0 type=internal
ip link set vm0 netns vm0
ip netns exec vm0 ip link set vm0 address 00:00:00:00:00:01
ip netns exec vm0 ip addr add 172.16.0.100/16 dev vm0
ip netns exec vm0 ip link set vm0 up
ovs-vsctl set Interface vm0 external_ids:iface-id=vm0
ip netns exec vm1 ip route add default via 173.0.1.1
ip netns exec vm2 ip route add default via 173.0.2.1

ovn-nbctl --wait=hv sync
sleep 30
ip netns exec vm1 ping 172.16.0.102 -c 1
ip netns exec vm1 ping 172.16.0.100 -c 1

reproduced on 2.11.1-55:

[root@dell-per740-12 bz1776712_broadcast_limit]# rpm -qa | grep -E "openvswitch|ovn"
openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch
ovn2.11-central-2.11.1-55.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-common-1.0-13.noarch
openvswitch2.11-2.11.3-76.el7fdp.x86_64
ovn2.11-2.11.1-55.el7fdp.x86_64
ovn2.11-host-2.11.1-55.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-regression-bz1776712_broadcast_limit-1.0-1.noarch
python-openvswitch2.11-2.11.3-76.el7fdp.x86_64

:: [ 02:17:34 ] :: [  BEGIN   ] :: Running 'ovn-nbctl --wait=hv sync'
:: [ 02:17:51 ] :: [   PASS   ] :: Command 'ovn-nbctl --wait=hv sync' (Expected 0, got 0)
:: [ 02:17:51 ] :: [  BEGIN   ] :: Running 'ip netns exec vm1 ping 172.16.0.102 -c 1'
PING 172.16.0.102 (172.16.0.102) 56(84) bytes of data.                         
                                                           
--- 172.16.0.102 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
                  
:: [ 02:18:01 ] :: [   FAIL   ] :: Command 'ip netns exec vm1 ping 172.16.0.102 -c 1' (Expected 0, got 1)
:: [ 02:18:01 ] :: [  BEGIN   ] :: Running 'ip netns exec vm1 ping 172.16.0.100 -c 1'
PING 172.16.0.100 (172.16.0.100) 56(84) bytes of data.
                              
--- 172.16.0.100 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
                         
:: [ 02:18:11 ] :: [   FAIL   ] :: Command 'ip netns exec vm1 ping 172.16.0.100 -c 1' (Expected 0, got 1)

<=== FAIL

[root@dell-per740-12 bz1776712_broadcast_limit]# grep 4096 /var/log/openvswitch/ovs-vswitchd.log      
2020-11-11T07:16:34.346Z|00049|ofproto_dpif_xlate|WARN|over 4096 resubmit actions on bridge br-int while processing arp,in_port=CONTROLLER,vlan_tci=0x0000,dl_src=00:de:ad:ff:01:86,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=172.16.1.86,arp_tpa=172.16.1.86,arp_op=1,arp_sha=00:de:ad:ff:01:86,arp_tha=00:00:00:00:00:00 
2020-11-11T07:16:34.349Z|00051|ofproto_dpif_xlate|WARN|over 4096 resubmit actions on bridge br-int while processing arp,in_port=CONTROLLER,vlan_tci=0x0000,dl_src=00:de:ad:ff:01:87,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=172.16.1.87,arp_tpa=172.16.1.87,arp_op=1,arp_sha=00:de:ad:ff:01:87,arp_tha=00:00:00:00:00:00 
2020-11-11T07:16:34.575Z|00053|ofproto_dpif_xlate|WARN|over 4096 resubmit actions on bridge br-int while processing arp,in_port=CONTROLLER,vlan_tci=0x0000,dl_src=00:de:ad:ff:02:10,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=172.16.2.10,arp_tpa=172.16.2.10,arp_op=1,arp_sha=00:de:ad:ff:02:10,arp_tha=00:00:00:00:00:00 
2020-11-11T07:16:34.581Z|00055|ofproto_dpif_xlate|WARN|over 4096 resubmit actions on bridge br-int while processing arp,in_port=CONTROLLER,vlan_tci=0x0000,dl_src=00:de:ad:ff:02:09,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=172.16.2.9,arp_tpa=172.16.2.9,arp_op=1,arp_sha=00:de:ad:ff:02:09,arp_tha=00:00:00:00:00:00   
2020-11-11T07:16:34.584Z|00057|ofproto_dpif_xlate|WARN|over 4096 resubmit actions on bridge br-int while processing arp,in_port=CONTROLLER,vlan_tci=0x0000,dl_src=00:de:ad:ff:02:11,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=172.16.2.11,arp_tpa=172.16.2.11,arp_op=1,arp_sha=00:de:ad:ff:02:11,arp_tha=00:00:00:00:00:00 
2020-11-11T07:17:35.074Z|00098|ofproto_dpif_xlate|WARN|over 4096 resubmit actions on bridge br-int while processing arp,in_port=CONTROLLER,vlan_tci=0x0000,dl_src=00:de:ad:ff:02:81,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=172.16.2.81,arp_tpa=172.16.2.81,arp_op=1,arp_sha=00:de:ad:ff:02:81,arp_tha=00:00:00:00:00:00 
2020-11-11T07:20:54.066Z|00127|ofproto_dpif_xlate|WARN|over 4096 resubmit actions on bridge br-int while processing arp,in_port=CONTROLLER,vlan_tci=0x0000,dl_src=00:de:ad:ff:00:01,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=172.16.0.1,arp_tpa=172.16.0.102,arp_op=1,arp_sha=00:de:ad:ff:00:01,arp_tha=00:00:00:00:00:00

<=== 4k resubmit

Verified on 2.11.1-56:

:: [ 02:38:53 ] :: [   PASS   ] :: Command 'ovn-nbctl --wait=hv sync' (Expected 0, got 0)
:: [ 02:38:53 ] :: [  BEGIN   ] :: Running 'ip netns exec vm1 ping 172.16.0.102 -c 1'
PING 172.16.0.102 (172.16.0.102) 56(84) bytes of data.
64 bytes from 172.16.0.102: icmp_seq=1 ttl=62 time=270 ms                                             

--- 172.16.0.102 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 270.661/270.661/270.661/0.000 ms
:: [ 02:38:53 ] :: [   PASS   ] :: Command 'ip netns exec vm1 ping 172.16.0.102 -c 1' (Expected 0, got 0)
:: [ 02:38:53 ] :: [  BEGIN   ] :: Running 'ip netns exec vm1 ping 172.16.0.100 -c 1'
PING 172.16.0.100 (172.16.0.100) 56(84) bytes of data.
64 bytes from 172.16.0.100: icmp_seq=1 ttl=63 time=176 ms
                                                                                                      
--- 172.16.0.100 ping statistics ---                                                                  
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 176.110/176.110/176.110/0.000 ms

<=== pass

[root@dell-per740-12 bz1776712_broadcast_limit]# grep 4096 /var/log/openvswitch/ovs-vswitchd.log

<=== no 4k submit warn

:: [ 02:38:53 ] :: [   PASS   ] :: Command 'ip netns exec vm1 ping 172.16.0.100 -c 1' (Expected 0, got 0)
[root@dell-per740-12 bz1776712_broadcast_limit]# rpm -qa | grep -E "openvswitch|ovn"                  
openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch
ovn2.11-host-2.11.1-56.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-common-1.0-13.noarch                                         
openvswitch2.11-2.11.3-76.el7fdp.x86_64
ovn2.11-central-2.11.1-56.el7fdp.x86_64                                                               
kernel-kernel-networking-openvswitch-ovn-regression-bz1776712_broadcast_limit-1.0-1.noarch            
python-openvswitch2.11-2.11.3-76.el7fdp.x86_64
ovn2.11-2.11.1-56.el7fdp.x86_64

Comment 8 Jianlin Shi 2020-11-11 09:38:09 UTC
on rhel8 version:

[root@wsfd-advnetlab19 bz1776712_broadcast_limit]# rpm -qa | grep -E "openvswitch|ovn"
python3-openvswitch2.11-2.11.3-73.el8fdp.x86_64                                                       
kernel-kernel-networking-openvswitch-ovn-scenario-1.0-12.noarch
ovn2.11-central-2.11.1-56.el8fdp.x86_64                                                               
kernel-kernel-networking-openvswitch-ovn-common-1.0-13.noarch
openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch
kernel-kernel-networking-openvswitch-ovn-regression-bz1775778_kernel_unknown_flow-1.0-2.noarch
kernel-kernel-networking-openvswitch-ovn-basic-1.0-31.noarch
openvswitch2.11-2.11.3-73.el8fdp.x86_64                                                               
kernel-kernel-networking-openvswitch-ovn-regression-bz1776712_broadcast_limit-1.0-1.noarch
ovn2.11-2.11.1-56.el8fdp.x86_64                                                                       
ovn2.11-host-2.11.1-56.el8fdp.x86_64

:: [ 04:35:15 ] :: [  BEGIN   ] :: Running 'ovn-nbctl --wait=hv sync'
:: [ 04:35:30 ] :: [   PASS   ] :: Command 'ovn-nbctl --wait=hv sync' (Expected 0, got 0)
:: [ 04:35:30 ] :: [  BEGIN   ] :: Running 'ip netns exec vm1 ping 172.16.0.102 -c 1'
PING 172.16.0.102 (172.16.0.102) 56(84) bytes of data.                                                
64 bytes from 172.16.0.102: icmp_seq=1 ttl=62 time=399 ms                                             

--- 172.16.0.102 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 398.909/398.909/398.909/0.000 ms                                               
:: [ 04:35:31 ] :: [   PASS   ] :: Command 'ip netns exec vm1 ping 172.16.0.102 -c 1' (Expected 0, got 0)
:: [ 04:35:31 ] :: [  BEGIN   ] :: Running 'ip netns exec vm1 ping 172.16.0.100 -c 1'
PING 172.16.0.100 (172.16.0.100) 56(84) bytes of data.
64 bytes from 172.16.0.100: icmp_seq=1 ttl=63 time=308 ms                                             

--- 172.16.0.100 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 308.153/308.153/308.153/0.000 ms
:: [ 04:35:31 ] :: [   PASS   ] :: Command 'ip netns exec vm1 ping 172.16.0.100 -c 1' (Expected 0, got 0)

<== ping passed

[root@wsfd-advnetlab19 ~]# grep 4096 /var/log/openvswitch/ovs-vswitchd.log                            
2020-11-11T09:35:28.867Z|00001|ofproto_dpif_xlate(handler55)|WARN|over 4096 resubmit actions on bridge br-int while processing icmp6,in_port=1,vlan_tci=0x0000,dl_src=00:00:00:00:00:01,dl_dst=33:33:00:00:00:02,ipv6_src=fe80::200:ff:fe00:1,ipv6_dst=ff02::2,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=133,icmp_code=0
2020-11-11T09:35:29.381Z|00001|ofproto_dpif_xlate(handler61)|WARN|over 4096 resubmit actions on bridge br-int while processing icmp6,in_port=LOCAL,vlan_tci=0x0000,dl_src=00:00:00:00:00:01,dl_dst=33:33:00:00:00:02,ipv6_src=fe80::2408:54ff:fefc:dd4b,ipv6_dst=ff02::2,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=133,icmp_code=0
2020-11-11T09:35:45.770Z|00001|ofproto_dpif_xlate(handler70)|WARN|over 4096 resubmit actions on bridge br-int while processing icmp6,in_port=LOCAL,vlan_tci=0x0000,dl_src=00:00:00:00:00:01,dl_dst=33:33:00:00:00:02,ipv6_src=fe80::2408:54ff:fefc:dd4b,ipv6_dst=ff02::2,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=133,icmp_code=0
2020-11-11T09:35:45.770Z|00001|ofproto_dpif_xlate(handler56)|WARN|over 4096 resubmit actions on bridge br-int while processing icmp6,in_port=1,vlan_tci=0x0000,dl_src=00:00:00:00:00:01,dl_dst=33:33:00:00:00:02,ipv6_src=fe80::200:ff:fe00:1,ipv6_dst=ff02::2,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=133,icmp_code=0

<=== still get 4096 WARN log

Comment 9 Dumitru Ceara 2020-11-11 09:50:07 UTC
(In reply to Jianlin Shi from comment #8)
[...]             
> PING 172.16.0.100 (172.16.0.100) 56(84) bytes of data.
> 64 bytes from 172.16.0.100: icmp_seq=1 ttl=63 time=308 ms                   
> 
> 
> --- 172.16.0.100 ping statistics ---
> 1 packets transmitted, 1 received, 0% packet loss, time 0ms
> rtt min/avg/max/mdev = 308.153/308.153/308.153/0.000 ms
> :: [ 04:35:31 ] :: [   PASS   ] :: Command 'ip netns exec vm1 ping
> 172.16.0.100 -c 1' (Expected 0, got 0)
> 
> <== ping passed
> 
> [root@wsfd-advnetlab19 ~]# grep 4096 /var/log/openvswitch/ovs-vswitchd.log  
> 
> 2020-11-11T09:35:28.867Z|00001|ofproto_dpif_xlate(handler55)|WARN|over 4096
> resubmit actions on bridge br-int while processing
> icmp6,in_port=1,vlan_tci=0x0000,dl_src=00:00:00:00:00:01,dl_dst=33:33:00:00:
> 00:02,ipv6_src=fe80::200:ff:fe00:1,ipv6_dst=ff02::2,ipv6_label=0x00000,
> nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=133,icmp_code=0
> 2020-11-11T09:35:29.381Z|00001|ofproto_dpif_xlate(handler61)|WARN|over 4096
> resubmit actions on bridge br-int while processing
> icmp6,in_port=LOCAL,vlan_tci=0x0000,dl_src=00:00:00:00:00:01,dl_dst=33:33:00:
> 00:00:02,ipv6_src=fe80::2408:54ff:fefc:dd4b,ipv6_dst=ff02::2,
> ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=133,icmp_code=0
> 2020-11-11T09:35:45.770Z|00001|ofproto_dpif_xlate(handler70)|WARN|over 4096
> resubmit actions on bridge br-int while processing
> icmp6,in_port=LOCAL,vlan_tci=0x0000,dl_src=00:00:00:00:00:01,dl_dst=33:33:00:
> 00:00:02,ipv6_src=fe80::2408:54ff:fefc:dd4b,ipv6_dst=ff02::2,
> ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=133,icmp_code=0
> 2020-11-11T09:35:45.770Z|00001|ofproto_dpif_xlate(handler56)|WARN|over 4096
> resubmit actions on bridge br-int while processing
> icmp6,in_port=1,vlan_tci=0x0000,dl_src=00:00:00:00:00:01,dl_dst=33:33:00:00:
> 00:02,ipv6_src=fe80::200:ff:fe00:1,ipv6_dst=ff02::2,ipv6_label=0x00000,
> nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=133,icmp_code=0
> 
> <=== still get 4096 WARN log

These 4k resubmits here happen for IPv6 Router Solicitation packets (icmp_type=133) generated by the netns.  They have nothing to do with the IPv4 ARPs and/or ICMPv4 packets which flow fine as shown by the successful ping above.

This is however an issue, and we should probably restrict flooding of IPv6 RS (or block it completely).  This was already reported upstream at:
https://mail.openvswitch.org/pipermail/ovs-discuss/2020-September/050713.html

I think it would be better to open a separate BZ to track the IPv6 Router Solicitation 4k resubmit issue.

Thanks,
Dumitru

Comment 10 Jianlin Shi 2020-11-11 09:51:50 UTC
set VERIFIED per comment 9

Comment 11 Jianlin Shi 2020-11-12 01:40:49 UTC
> I think it would be better to open a separate BZ to track the IPv6 Router
> Solicitation 4k resubmit issue.

<=== add bz1896993 to track the ipv6 rs issue.

Comment 12 ggrimaux 2020-11-16 09:38:47 UTC
I hereby confirm that the patch provided fixes the issue the customer was experiencing.

Thank you very much for the quick work on this!

Comment 14 errata-xmlrpc 2020-12-01 15:07:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn2.11 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5309