Description of problem: We are trying to understand why there are vxlan packets getting lost when galera is getting promoted. vlan are not affected by this. Version-Release number of selected component (if applicable): OSP13.6 How reproducible: Everytime Steps to Reproduce: A little bit more info on our testing procedure, we have create 2 RHEL 7.5 VMs on different computes on a VxLAN network without a qrouter. We have installed iperf3 from RHEL 7 repo. On serverB VM: sudo iperf3 -s -p 5000 On serverA VM: sudo iperf3 -c 192.168.255.34 -p 5000 -u -t 3000 -i 1 --lenght 3000 -b 20M 1. sudo pcs resource disable galera-bundle --> no packet loss 2. (Wait for the DB to be fully down) --> No packet loss 3. sudo pcs resource enable galera-bundle (When galera promote a new Master) --> small amount of packet loss. ( 1 or 2 packets, but interesting note that we are sending about 800 UDP pps compare to prod where it's about 12k TCP pps, but measuring was more complicated in TCP and at higher packet rate) Actual results: 2021-10-20T19:00:19.375Z|10335|connmgr|INFO|br-int<->unix#27883: 100 flow_mods in the last 0 s (100 adds) 2021-10-20T19:00:19.420Z|10336|connmgr|INFO|br-int<->unix#27886: 85 flow_mods in the last 0 s (85 adds) 2021-10-20T19:00:19.444Z|10337|connmgr|INFO|br-int<->unix#27889: 1 flow_mods in the last 0 s (1 deletes) 2021-10-20T19:00:19.470Z|10338|connmgr|INFO|br-int<->unix#27892: 1 flow_mods in the last 0 s (1 deletes) 2021-10-20T19:00:19.495Z|10339|connmgr|INFO|br-int<->unix#27895: 1 flow_mods in the last 0 s (1 deletes) 2021-10-20T19:00:19.540Z|10340|connmgr|INFO|br-int<->unix#27898: 4 flow_mods in the last 0 s (4 deletes) 2021-10-20T19:00:19.625Z|10341|connmgr|INFO|br-int<->unix#27901: 93 flow_mods in the last 0 s (93 adds) 2021-10-20T19:00:29.132Z|10342|connmgr|INFO|br-int<->tcp:127.0.0.1:6633: 2 flow_mods 10 s ago (2 deletes) Expected results: No impact on vxlan between compute nodes Additional info: It would seem that this behaviour is also affecting OSP13z16 in a test lab
We also reproduced the behaviour on the fully updated QAsite1 environment (OSP13z16 + 3.10.0-1160.45.1.el7)
I've switched the summary because we're seeing this on RHOSP13-z-latest too
Hi, I've reproduced the issue on a virtualized RHOSP cluster running on top of libvirt in VMs. On compute-0 and compute-1, prepare a RHEL7 image with iperf3. Connect both VMs to a vxlan with DHCP but with no gateway (here bond0 which enslaves ens3) ServerA: [root@instack ~]# ip -o a 1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever 1: lo inet6 ::1/128 scope host \ valid_lft forever preferred_lft forever 4: bond0 inet 10.50.60.88/24 brd 10.50.60.255 scope global dynamic bond0\ valid_lft 85080sec preferred_lft 85080sec 4: bond0 inet6 fe80::f816:3eff:fede:4fb7/64 scope link \ valid_lft forever preferred_lft forever 8: bond1.2010 inet 10.0.0.1/24 brd 10.0.0.255 scope global bond1.2010\ valid_lft forever preferred_lft forever [root@instack ~]# ip -o r 10.0.0.0/24 dev bond1.2010 proto kernel scope link src 10.0.0.1 10.50.60.0/24 dev bond0 proto kernel scope link src 10.50.60.88 169.254.169.254 via 10.50.60.2 dev bond0 proto static ServerB: [root@instack ~]# ip -o a 1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever 1: lo inet6 ::1/128 scope host \ valid_lft forever preferred_lft forever 4: bond0 inet 10.50.60.224/24 brd 10.50.60.255 scope global dynamic bond0\ valid_lft 85056sec preferred_lft 85056sec 4: bond0 inet6 fe80::f816:3eff:fefa:43fc/64 scope link \ valid_lft forever preferred_lft forever 8: bond1.2010 inet 10.0.0.1/24 brd 10.0.0.255 scope global bond1.2010\ valid_lft forever preferred_lft forever [root@instack ~]# ip -o r 10.0.0.0/24 dev bond1.2010 proto kernel scope link src 10.0.0.1 10.50.60.0/24 dev bond0 proto kernel scope link src 10.50.60.224 169.254.169.254 via 10.50.60.3 dev bond0 proto static
On serverA, I'm running this: # iperf3 -s -p 5000 On ServerB, I'm running this: # iperf3 -c 10.50.60.88 -p 5000 -u -t 3000 -i 1 --length 1390 -b 200M When everything is stable in the cluster, I get zero packet loss (see the last column): [root@instack ~]# iperf3 -s -p 5000 ----------------------------------------------------------- Server listening on 5000 ----------------------------------------------------------- Accepted connection from 10.50.60.224, port 36566 [ 5] local 10.50.60.88 port 5000 connected to 10.50.60.224 port 36310 [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams [ 5] 0.00-1.00 sec 22.2 MBytes 186 Mbits/sec 0.004 ms 0/16732 (0%) [ 5] 1.00-2.00 sec 23.8 MBytes 199 Mbits/sec 0.010 ms 0/17926 (0%) [ 5] 2.00-3.00 sec 23.9 MBytes 200 Mbits/sec 0.003 ms 0/18023 (0%) [ 5] 3.00-4.00 sec 23.8 MBytes 199 Mbits/sec 0.003 ms 0/17923 (0%) [ 5] 4.00-5.00 sec 23.9 MBytes 200 Mbits/sec 0.006 ms 0/18015 (0%) [ 5] 5.00-6.00 sec 23.8 MBytes 200 Mbits/sec 0.004 ms 0/17946 (0%) [ 5] 6.00-7.00 sec 23.9 MBytes 201 Mbits/sec 0.004 ms 0/18038 (0%) [ 5] 7.00-8.00 sec 23.8 MBytes 199 Mbits/sec 0.005 ms 0/17923 (0%) [ 5] 8.00-9.00 sec 23.9 MBytes 200 Mbits/sec 0.002 ms 0/17994 (0%) [ 5] 9.00-10.00 sec 23.8 MBytes 200 Mbits/sec 0.004 ms 0/17972 (0%) [ 5] 10.00-11.00 sec 24.0 MBytes 201 Mbits/sec 0.003 ms 0/18114 (0%) [ 5] 11.00-12.00 sec 23.8 MBytes 200 Mbits/sec 0.007 ms 0/17953 (0%) [ 5] 12.00-13.00 sec 23.9 MBytes 201 Mbits/sec 0.002 ms 0/18053 (0%) [ 5] 13.00-14.00 sec 23.7 MBytes 199 Mbits/sec 0.009 ms 0/17896 (0%) [ 5] 14.00-15.00 sec 23.9 MBytes 200 Mbits/sec 0.004 ms 0/18017 (0%) [ 5] 15.00-16.00 sec 23.9 MBytes 201 Mbits/sec 0.008 ms 0/18045 (0%) [ 5] 16.00-17.00 sec 23.2 MBytes 195 Mbits/sec 0.002 ms 0/17503 (0%) [ 5] 17.00-18.00 sec 24.4 MBytes 205 Mbits/sec 0.005 ms 0/18429 (0%)
While monitoring ServerA, I executed this on the cluster: # pcs resource disable galera-bundle Here's what I observed at ServerA's console: [root@instack ~]# iperf3 -s -p 5000 ----------------------------------------------------------- Server listening on 5000 ----------------------------------------------------------- Accepted connection from 10.50.60.224, port 36568 [ 5] local 10.50.60.88 port 5000 connected to 10.50.60.224 port 43785 [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams [ 5] 0.00-1.00 sec 22.3 MBytes 187 Mbits/sec 0.002 ms 0/16820 (0%) [ 5] 1.00-2.00 sec 23.7 MBytes 199 Mbits/sec 0.001 ms 0/17895 (0%) [ 5] 2.00-3.00 sec 23.9 MBytes 200 Mbits/sec 0.002 ms 0/17993 (0%) [ 5] 3.00-4.00 sec 23.8 MBytes 200 Mbits/sec 0.008 ms 0/17943 (0%) [ 5] 4.00-5.00 sec 23.8 MBytes 200 Mbits/sec 0.002 ms 0/17986 (0%) [ 5] 5.00-6.00 sec 23.9 MBytes 200 Mbits/sec 0.003 ms 0/18002 (0%) [ 5] 6.00-7.00 sec 23.8 MBytes 200 Mbits/sec 0.004 ms 0/17988 (0%) [ 5] 7.00-8.00 sec 23.8 MBytes 200 Mbits/sec 0.011 ms 0/17965 (0%) [ 5] 8.00-9.00 sec 23.6 MBytes 198 Mbits/sec 0.005 ms 0/17827 (0%) [ 5] 9.00-10.00 sec 24.0 MBytes 202 Mbits/sec 0.001 ms 0/18135 (0%) [ 5] 10.00-11.00 sec 24.3 MBytes 204 Mbits/sec 0.014 ms 0/18305 (0%) [ 5] 11.00-12.00 sec 23.5 MBytes 197 Mbits/sec 0.006 ms 24/17746 (0.14%) <= pcs resource disable galera-bundle [ 5] 12.00-13.00 sec 23.9 MBytes 200 Mbits/sec 0.006 ms 0/18003 (0%) [ 5] 13.00-14.00 sec 23.4 MBytes 196 Mbits/sec 0.006 ms 75/17710 (0.42%) <= demoting one galera [ 5] 14.00-15.00 sec 24.2 MBytes 203 Mbits/sec 0.008 ms 0/18265 (0%) [ 5] 15.00-16.00 sec 23.6 MBytes 198 Mbits/sec 0.006 ms 0/17792 (0%) [ 5] 16.00-17.00 sec 24.0 MBytes 202 Mbits/sec 0.006 ms 0/18128 (0%) [ 5] 17.00-18.00 sec 23.8 MBytes 200 Mbits/sec 0.003 ms 20/17993 (0.11%) <= demoting 2nd galera [ 5] 18.00-19.00 sec 24.0 MBytes 201 Mbits/sec 0.013 ms 0/18096 (0%) [ 5] 19.00-20.00 sec 23.7 MBytes 199 Mbits/sec 0.003 ms 0/17899 (0%) [ 5] 20.00-21.00 sec 23.7 MBytes 199 Mbits/sec 0.003 ms 0/17858 (0%) [ 5] 21.00-22.00 sec 24.2 MBytes 203 Mbits/sec 0.005 ms 0/18253 (0%) [ 5] 22.00-23.00 sec 23.9 MBytes 201 Mbits/sec 0.012 ms 0/18033 (0%) [ 5] 23.00-24.00 sec 24.2 MBytes 203 Mbits/sec 0.009 ms 0/18258 (0%) <= demoting 3rd galera (sometimes it has no impact) [ 5] 24.00-25.00 sec 23.1 MBytes 193 Mbits/sec 0.002 ms 0/17398 (0%) [ 5] 25.00-26.00 sec 24.1 MBytes 202 Mbits/sec 0.013 ms 0/18163 (0%) [ 5] 26.00-27.00 sec 24.0 MBytes 202 Mbits/sec 0.015 ms 0/18142 (0%) [ 5] 27.00-28.00 sec 23.7 MBytes 199 Mbits/sec 0.005 ms 0/17893 (0%) [ 5] 28.00-29.00 sec 23.3 MBytes 195 Mbits/sec 0.006 ms 0/17548 (0%)
Currently testing with 2M packets, not a single packet lost after appllying workaround: On serverB VM: # iperf3 -s -p 5000 On serverA VM: # iperf3 -c 10.50.60.88 -p 5000 -u -t 3000 -i 1 --length 1390 -b 2M iperf Result: [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams [ 5] 0.00-243.30 sec 0.00 Bytes 0.00 bits/sec 0.006 ms 0/43742 (0%)
Works well on 20M but on 200M I got some packet loss: - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams [ 5] 0.00-142.26 sec 0.00 Bytes 0.00 bits/sec 0.002 ms 1179/2557776 (0.046%)
One thing to note is that when we pushed out these changes (regular RHOSP config-download deploy): neutron::agents::ml2::ovs::of_interface: ovs-ofctl <================== neutron::agents::ml2::ovs::ovsdb_interface: vsctl <================== the tenant confirmed that there had been a hit to the vDRA VNF while those settings were being pushed even though the controller cluster was 100% operational.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 13 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:5156