Bug 1650292

Summary: Losing a few packets when the first LACP bond member is brought up.
Product: Red Hat Enterprise Linux 7 Reporter: Andreas Karis <akaris>
Component: openvswitchAssignee: Matteo Croce <mcroce>
Status: CLOSED INSUFFICIENT_DATA QA Contact: HeKai Wang <hewang>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.6CC: atragler, ctrautma, fhallal, hewang, kzhang, ovs-qe, qding, rkhan, tredaelli
Target Milestone: rcFlags: mcroce: needinfo? (akaris)
mcroce: needinfo? (akaris)
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-25 15:09:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Andreas Karis 2018-11-15 18:26:28 UTC
Description of problem:

Losing a few packets when the first LACP bond member is brought up.

Version-Release number of selected component (if applicable):

~~~
[root@overcloud-novacomputeiha-0 ~]# rpm -qa | egrep 'kernel|openvswitch'
openvswitch-ovn-host-2.9.0-56.el7fdp.x86_64
openvswitch-2.9.0-84.el7fdn.x86_64
kernel-tools-3.10.0-957.el7.x86_64
erlang-kernel-18.3.4.8-1.el7ost.x86_64
openvswitch-selinux-extra-policy-1.0-5.el7fdp.noarch
kernel-3.10.0-957.el7.x86_64
openvswitch-ovn-common-2.9.0-56.el7fdp.x86_64
openstack-neutron-openvswitch-12.0.3-5.el7ost.noarch
python-openvswitch-2.9.0-56.el7fdp.noarch
kernel-tools-libs-3.10.0-957.el7.x86_64
openvswitch-ovn-central-2.9.0-56.el7fdp.x86_64
[root@overcloud-novacomputeiha-0 ~]# 
~~~

How reproducible:

Shut down the port opposite to the first LACP slave, e.g. on the switch. Start a ping across the bond. Bring the port back up and observe some packet loss. 
I can reproduce this in software, not using DPDK (kernel datapath). Pretty sure I could reproduce the same with DPDK, although I didn't configure my system for it.

Steps to Reproduce:

~~~
ovs-vsctl add-br br1 -- set bridge br1 datapath_type=netdev
ip link add ovs-bond-if0 type veth peer name lx-bond-if0
ip link add ovs-bond-if1 type veth peer name lx-bond-if1
ip link add lx-bond0 type bond miimon 100 mode 802.3ad
ip link set dev lx-bond-if0 master lx-bond0
ip link set dev lx-bond-if1 master lx-bond0
ip link set dev lx-bond-if0 up
ip link set dev lx-bond-if1 up
ip link set dev ovs-bond-if0 up
ip link set dev ovs-bond-if1 up
ip link set dev lx-bond0 up
ovs-vsctl add-bond br1 dpdkbond1 ovs-bond-if0 ovs-bond-if1 -- set port dpdkbond1 lacp=active -- set port dpdkbond1 bond_mode=balance-tcp --  set port dpdkbond1 other-config:lacp-time=fast
ip link add name lx-bond0.905 link lx-bond0 type vlan id 905
ip link set dev lx-bond0.905 up
ip a a dev lx-bond0.905 192.168.123.10/24
ip link add veth2 type veth peer name veth3
ip netns add test2
ip link set dev veth2 netns test2
ip link set dev veth3 up
ip netns exec test2 ip link set dev lo up
ip netns exec test2 ip link set dev veth2 up
ip netns exec test2 ip a a dev veth2 192.168.123.11/24
ovs-vsctl add-port br1 veth3 tag=905
~~~

Run:
~~~
[root@overcloud-novacomputeiha-0 ~]# ip link set dev lx-bond-if0 down
~~~

Start a ping. While the ping is running, run:
~~~
[root@overcloud-novacomputeiha-0 ~]# ip link set dev lx-bond-if0 up
~~~

Observe packet loss:
~~~
[root@overcloud-novacomputeiha-0 ~]# ping 192.168.123.11 -i 0.1
PING 192.168.123.11 (192.168.123.11) 56(84) bytes of data.
64 bytes from 192.168.123.11: icmp_seq=1 ttl=64 time=0.235 ms
64 bytes from 192.168.123.11: icmp_seq=2 ttl=64 time=0.214 ms
64 bytes from 192.168.123.11: icmp_seq=3 ttl=64 time=0.167 ms
64 bytes from 192.168.123.11: icmp_seq=4 ttl=64 time=0.170 ms
64 bytes from 192.168.123.11: icmp_seq=5 ttl=64 time=0.165 ms
64 bytes from 192.168.123.11: icmp_seq=6 ttl=64 time=0.163 ms
64 bytes from 192.168.123.11: icmp_seq=7 ttl=64 time=0.184 ms
64 bytes from 192.168.123.11: icmp_seq=8 ttl=64 time=0.165 ms
64 bytes from 192.168.123.11: icmp_seq=9 ttl=64 time=0.169 ms
64 bytes from 192.168.123.11: icmp_seq=10 ttl=64 time=0.162 ms
64 bytes from 192.168.123.11: icmp_seq=11 ttl=64 time=0.173 ms
64 bytes from 192.168.123.11: icmp_seq=12 ttl=64 time=0.200 ms
64 bytes from 192.168.123.11: icmp_seq=13 ttl=64 time=0.188 ms
64 bytes from 192.168.123.11: icmp_seq=14 ttl=64 time=0.192 ms
64 bytes from 192.168.123.11: icmp_seq=15 ttl=64 time=0.174 ms
64 bytes from 192.168.123.11: icmp_seq=16 ttl=64 time=0.209 ms
64 bytes from 192.168.123.11: icmp_seq=17 ttl=64 time=0.174 ms
64 bytes from 192.168.123.11: icmp_seq=18 ttl=64 time=0.178 ms
64 bytes from 192.168.123.11: icmp_seq=19 ttl=64 time=0.205 ms
64 bytes from 192.168.123.11: icmp_seq=20 ttl=64 time=0.188 ms
64 bytes from 192.168.123.11: icmp_seq=21 ttl=64 time=0.182 ms
64 bytes from 192.168.123.11: icmp_seq=22 ttl=64 time=0.174 ms
64 bytes from 192.168.123.11: icmp_seq=23 ttl=64 time=0.187 ms
64 bytes from 192.168.123.11: icmp_seq=24 ttl=64 time=0.251 ms
64 bytes from 192.168.123.11: icmp_seq=25 ttl=64 time=0.191 ms
64 bytes from 192.168.123.11: icmp_seq=26 ttl=64 time=0.193 ms
64 bytes from 192.168.123.11: icmp_seq=27 ttl=64 time=0.190 ms
64 bytes from 192.168.123.11: icmp_seq=28 ttl=64 time=0.194 ms
64 bytes from 192.168.123.11: icmp_seq=29 ttl=64 time=0.205 ms
64 bytes from 192.168.123.11: icmp_seq=30 ttl=64 time=0.186 ms
64 bytes from 192.168.123.11: icmp_seq=31 ttl=64 time=0.183 ms
64 bytes from 192.168.123.11: icmp_seq=32 ttl=64 time=0.211 ms
64 bytes from 192.168.123.11: icmp_seq=33 ttl=64 time=0.183 ms
64 bytes from 192.168.123.11: icmp_seq=34 ttl=64 time=0.182 ms
64 bytes from 192.168.123.11: icmp_seq=35 ttl=64 time=0.186 ms
64 bytes from 192.168.123.11: icmp_seq=36 ttl=64 time=0.191 ms
64 bytes from 192.168.123.11: icmp_seq=37 ttl=64 time=0.192 ms
64 bytes from 192.168.123.11: icmp_seq=38 ttl=64 time=0.180 ms
64 bytes from 192.168.123.11: icmp_seq=39 ttl=64 time=0.189 ms
64 bytes from 192.168.123.11: icmp_seq=40 ttl=64 time=0.174 ms
64 bytes from 192.168.123.11: icmp_seq=41 ttl=64 time=0.185 ms
64 bytes from 192.168.123.11: icmp_seq=42 ttl=64 time=0.188 ms
64 bytes from 192.168.123.11: icmp_seq=43 ttl=64 time=0.185 ms
64 bytes from 192.168.123.11: icmp_seq=44 ttl=64 time=0.183 ms
64 bytes from 192.168.123.11: icmp_seq=45 ttl=64 time=0.183 ms
64 bytes from 192.168.123.11: icmp_seq=46 ttl=64 time=0.178 ms
64 bytes from 192.168.123.11: icmp_seq=47 ttl=64 time=0.177 ms
64 bytes from 192.168.123.11: icmp_seq=48 ttl=64 time=0.186 ms
64 bytes from 192.168.123.11: icmp_seq=49 ttl=64 time=0.175 ms
64 bytes from 192.168.123.11: icmp_seq=50 ttl=64 time=0.194 ms
64 bytes from 192.168.123.11: icmp_seq=51 ttl=64 time=0.175 ms
64 bytes from 192.168.123.11: icmp_seq=52 ttl=64 time=0.185 ms
64 bytes from 192.168.123.11: icmp_seq=53 ttl=64 time=0.186 ms
64 bytes from 192.168.123.11: icmp_seq=54 ttl=64 time=0.201 ms
64 bytes from 192.168.123.11: icmp_seq=55 ttl=64 time=0.182 ms
64 bytes from 192.168.123.11: icmp_seq=56 ttl=64 time=0.175 ms
64 bytes from 192.168.123.11: icmp_seq=57 ttl=64 time=0.180 ms
64 bytes from 192.168.123.11: icmp_seq=58 ttl=64 time=0.174 ms
64 bytes from 192.168.123.11: icmp_seq=59 ttl=64 time=0.211 ms
64 bytes from 192.168.123.11: icmp_seq=60 ttl=64 time=0.177 ms
64 bytes from 192.168.123.11: icmp_seq=61 ttl=64 time=0.171 ms
64 bytes from 192.168.123.11: icmp_seq=62 ttl=64 time=0.264 ms
64 bytes from 192.168.123.11: icmp_seq=63 ttl=64 time=0.185 ms
64 bytes from 192.168.123.11: icmp_seq=64 ttl=64 time=0.188 ms
64 bytes from 192.168.123.11: icmp_seq=65 ttl=64 time=0.180 ms
64 bytes from 192.168.123.11: icmp_seq=66 ttl=64 time=0.180 ms
64 bytes from 192.168.123.11: icmp_seq=67 ttl=64 time=0.173 ms
64 bytes from 192.168.123.11: icmp_seq=68 ttl=64 time=0.174 ms
64 bytes from 192.168.123.11: icmp_seq=69 ttl=64 time=0.184 ms
64 bytes from 192.168.123.11: icmp_seq=70 ttl=64 time=0.180 ms
64 bytes from 192.168.123.11: icmp_seq=71 ttl=64 time=0.179 ms
64 bytes from 192.168.123.11: icmp_seq=72 ttl=64 time=0.170 ms
64 bytes from 192.168.123.11: icmp_seq=73 ttl=64 time=0.178 ms
64 bytes from 192.168.123.11: icmp_seq=74 ttl=64 time=0.171 ms
64 bytes from 192.168.123.11: icmp_seq=81 ttl=64 time=0.267 ms
64 bytes from 192.168.123.11: icmp_seq=82 ttl=64 time=0.187 ms
64 bytes from 192.168.123.11: icmp_seq=83 ttl=64 time=0.181 ms
64 bytes from 192.168.123.11: icmp_seq=84 ttl=64 time=0.190 ms
64 bytes from 192.168.123.11: icmp_seq=85 ttl=64 time=0.181 ms
64 bytes from 192.168.123.11: icmp_seq=86 ttl=64 time=0.172 ms
64 bytes from 192.168.123.11: icmp_seq=87 ttl=64 time=0.172 ms
64 bytes from 192.168.123.11: icmp_seq=88 ttl=64 time=0.173 ms
64 bytes from 192.168.123.11: icmp_seq=89 ttl=64 time=0.182 ms
64 bytes from 192.168.123.11: icmp_seq=90 ttl=64 time=0.180 ms
64 bytes from 192.168.123.11: icmp_seq=91 ttl=64 time=0.176 ms
^C
--- 192.168.123.11 ping statistics ---
91 packets transmitted, 85 received, 6% packet loss, time 9020ms
rtt min/avg/max/mdev = 0.162/0.186/0.267/0.019 ms
[root@overcloud-novacomputeiha-0 ~]# 
~~~

Comment 2 Andreas Karis 2018-11-21 23:57:40 UTC
I just tried this again and this finally isn't as easy to reproduce as I thought. While I could reproduce this originally in the software reproducer, the same system now shows no packet loss.