2016177 – [RHOSP13] Galera promoting causes vxlan packet drop

Bug 2016177 - [RHOSP13] Galera promoting causes vxlan packet drop

Summary: [RHOSP13] Galera promoting causes vxlan packet drop

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Slawek Kaplonski
QA Contact:	Candido Campos
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-10-20 20:48 UTC by camorris@redhat.co
Modified:	2022-01-11 15:02 UTC (History)
CC List:	14 users (show)
Fixed In Version:	openstack-neutron-12.1.1-42.4.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-12-15 15:57:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1948642	None	None	None	2021-10-25 08:02:48 UTC
Launchpad	1952770	None	None	None	2021-12-01 16:44:30 UTC
OpenStack gerrit	815255	None	MERGED	Don't setup bridge controller if it is already set	2021-11-23 14:40:38 UTC
OpenStack gerrit	819900	None	NEW	Do no use "--strict" for OF deletion in TRANSIENT_TABLE	2021-12-01 16:44:30 UTC
Red Hat Issue Tracker	OSP-10499	None	None	None	2021-11-15 12:35:04 UTC
Red Hat Product Errata	RHBA-2021:5156	None	None	None	2021-12-15 15:57:31 UTC

Description camorris@redhat.co 2021-10-20 20:48:18 UTC

Description of problem:
We are trying to understand why there are vxlan packets getting lost when galera is getting promoted. vlan are not affected by this.

Version-Release number of selected component (if applicable):
OSP13.6

How reproducible:
Everytime

Steps to Reproduce:

A little bit more info on our testing procedure, we have create 2 RHEL 7.5 VMs on different computes on a VxLAN network without a qrouter. We have installed iperf3 from RHEL 7 repo.

On serverB VM: sudo iperf3 -s -p 5000
On serverA VM: sudo iperf3 -c 192.168.255.34 -p 5000 -u -t 3000 -i 1 --lenght 3000 -b 20M

1. sudo pcs resource disable galera-bundle --> no packet loss
2. (Wait for the DB to be fully down) --> No packet loss
3. sudo pcs resource enable galera-bundle 
(When galera promote a new Master) --> small amount of packet loss. ( 1 or 2 packets, but interesting note that we are sending about 800 UDP pps compare to prod where it's about 12k TCP pps, but measuring was more complicated in TCP and at higher packet rate)

Actual results:
2021-10-20T19:00:19.375Z|10335|connmgr|INFO|br-int<->unix#27883: 100 flow_mods in the last 0 s (100 adds)
2021-10-20T19:00:19.420Z|10336|connmgr|INFO|br-int<->unix#27886: 85 flow_mods in the last 0 s (85 adds)
2021-10-20T19:00:19.444Z|10337|connmgr|INFO|br-int<->unix#27889: 1 flow_mods in the last 0 s (1 deletes)
2021-10-20T19:00:19.470Z|10338|connmgr|INFO|br-int<->unix#27892: 1 flow_mods in the last 0 s (1 deletes)
2021-10-20T19:00:19.495Z|10339|connmgr|INFO|br-int<->unix#27895: 1 flow_mods in the last 0 s (1 deletes)
2021-10-20T19:00:19.540Z|10340|connmgr|INFO|br-int<->unix#27898: 4 flow_mods in the last 0 s (4 deletes)
2021-10-20T19:00:19.625Z|10341|connmgr|INFO|br-int<->unix#27901: 93 flow_mods in the last 0 s (93 adds)
2021-10-20T19:00:29.132Z|10342|connmgr|INFO|br-int<->tcp:127.0.0.1:6633: 2 flow_mods 10 s ago (2 deletes)

Expected results:
No impact on vxlan between compute nodes

Additional info:
It would seem that this behaviour is also affecting OSP13z16 in a test lab

Comment 2 Vincent S. Cojot 2021-10-20 21:50:25 UTC

We also reproduced the behaviour on the fully updated QAsite1 environment (OSP13z16 + 3.10.0-1160.45.1.el7)

Comment 11 Vincent S. Cojot 2021-10-21 23:20:23 UTC

I've switched the summary because we're seeing this on RHOSP13-z-latest too

Comment 13 Vincent S. Cojot 2021-10-22 17:48:01 UTC

Hi,
I've reproduced the issue on a virtualized RHOSP cluster running on top of libvirt in VMs.

On compute-0 and compute-1, prepare a RHEL7 image with iperf3. Connect both VMs to a vxlan with DHCP but with no gateway (here bond0 which enslaves ens3)

ServerA:
[root@instack ~]# ip -o a
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
1: lo    inet6 ::1/128 scope host \       valid_lft forever preferred_lft forever
4: bond0    inet 10.50.60.88/24 brd 10.50.60.255 scope global dynamic bond0\       valid_lft 85080sec preferred_lft 85080sec
4: bond0    inet6 fe80::f816:3eff:fede:4fb7/64 scope link \       valid_lft forever preferred_lft forever
8: bond1.2010    inet 10.0.0.1/24 brd 10.0.0.255 scope global bond1.2010\       valid_lft forever preferred_lft forever
[root@instack ~]# ip -o r
10.0.0.0/24 dev bond1.2010 proto kernel scope link src 10.0.0.1 
10.50.60.0/24 dev bond0 proto kernel scope link src 10.50.60.88 
169.254.169.254 via 10.50.60.2 dev bond0 proto static 

ServerB:
[root@instack ~]# ip -o a
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
1: lo    inet6 ::1/128 scope host \       valid_lft forever preferred_lft forever
4: bond0    inet 10.50.60.224/24 brd 10.50.60.255 scope global dynamic bond0\       valid_lft 85056sec preferred_lft 85056sec
4: bond0    inet6 fe80::f816:3eff:fefa:43fc/64 scope link \       valid_lft forever preferred_lft forever
8: bond1.2010    inet 10.0.0.1/24 brd 10.0.0.255 scope global bond1.2010\       valid_lft forever preferred_lft forever
[root@instack ~]# ip -o r
10.0.0.0/24 dev bond1.2010 proto kernel scope link src 10.0.0.1 
10.50.60.0/24 dev bond0 proto kernel scope link src 10.50.60.224 
169.254.169.254 via 10.50.60.3 dev bond0 proto static

Comment 14 Vincent S. Cojot 2021-10-22 17:50:18 UTC

On serverA, I'm running this:
# iperf3 -s -p 5000 

On ServerB, I'm running this:
# iperf3 -c 10.50.60.88 -p 5000 -u -t 3000 -i 1 --length 1390 -b 200M


When everything is stable in the cluster, I get zero packet loss (see the last column):

[root@instack ~]# iperf3 -s -p 5000 
-----------------------------------------------------------
Server listening on 5000
-----------------------------------------------------------
Accepted connection from 10.50.60.224, port 36566
[  5] local 10.50.60.88 port 5000 connected to 10.50.60.224 port 36310
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  5]   0.00-1.00   sec  22.2 MBytes   186 Mbits/sec  0.004 ms  0/16732 (0%)  
[  5]   1.00-2.00   sec  23.8 MBytes   199 Mbits/sec  0.010 ms  0/17926 (0%)  
[  5]   2.00-3.00   sec  23.9 MBytes   200 Mbits/sec  0.003 ms  0/18023 (0%)  
[  5]   3.00-4.00   sec  23.8 MBytes   199 Mbits/sec  0.003 ms  0/17923 (0%)  
[  5]   4.00-5.00   sec  23.9 MBytes   200 Mbits/sec  0.006 ms  0/18015 (0%)  
[  5]   5.00-6.00   sec  23.8 MBytes   200 Mbits/sec  0.004 ms  0/17946 (0%)  
[  5]   6.00-7.00   sec  23.9 MBytes   201 Mbits/sec  0.004 ms  0/18038 (0%)  
[  5]   7.00-8.00   sec  23.8 MBytes   199 Mbits/sec  0.005 ms  0/17923 (0%)  
[  5]   8.00-9.00   sec  23.9 MBytes   200 Mbits/sec  0.002 ms  0/17994 (0%)  
[  5]   9.00-10.00  sec  23.8 MBytes   200 Mbits/sec  0.004 ms  0/17972 (0%)  
[  5]  10.00-11.00  sec  24.0 MBytes   201 Mbits/sec  0.003 ms  0/18114 (0%)  
[  5]  11.00-12.00  sec  23.8 MBytes   200 Mbits/sec  0.007 ms  0/17953 (0%)  
[  5]  12.00-13.00  sec  23.9 MBytes   201 Mbits/sec  0.002 ms  0/18053 (0%)  
[  5]  13.00-14.00  sec  23.7 MBytes   199 Mbits/sec  0.009 ms  0/17896 (0%)  
[  5]  14.00-15.00  sec  23.9 MBytes   200 Mbits/sec  0.004 ms  0/18017 (0%)  
[  5]  15.00-16.00  sec  23.9 MBytes   201 Mbits/sec  0.008 ms  0/18045 (0%)  
[  5]  16.00-17.00  sec  23.2 MBytes   195 Mbits/sec  0.002 ms  0/17503 (0%)  
[  5]  17.00-18.00  sec  24.4 MBytes   205 Mbits/sec  0.005 ms  0/18429 (0%)

Comment 15 Vincent S. Cojot 2021-10-22 17:59:36 UTC

While monitoring ServerA, I executed this on the cluster:

# pcs resource disable galera-bundle

Here's what I observed at ServerA's console:

[root@instack ~]# iperf3 -s -p 5000 
-----------------------------------------------------------
Server listening on 5000
-----------------------------------------------------------
Accepted connection from 10.50.60.224, port 36568
[  5] local 10.50.60.88 port 5000 connected to 10.50.60.224 port 43785
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  5]   0.00-1.00   sec  22.3 MBytes   187 Mbits/sec  0.002 ms  0/16820 (0%)  
[  5]   1.00-2.00   sec  23.7 MBytes   199 Mbits/sec  0.001 ms  0/17895 (0%)  
[  5]   2.00-3.00   sec  23.9 MBytes   200 Mbits/sec  0.002 ms  0/17993 (0%)  
[  5]   3.00-4.00   sec  23.8 MBytes   200 Mbits/sec  0.008 ms  0/17943 (0%)  
[  5]   4.00-5.00   sec  23.8 MBytes   200 Mbits/sec  0.002 ms  0/17986 (0%)  
[  5]   5.00-6.00   sec  23.9 MBytes   200 Mbits/sec  0.003 ms  0/18002 (0%)  
[  5]   6.00-7.00   sec  23.8 MBytes   200 Mbits/sec  0.004 ms  0/17988 (0%)  
[  5]   7.00-8.00   sec  23.8 MBytes   200 Mbits/sec  0.011 ms  0/17965 (0%)  
[  5]   8.00-9.00   sec  23.6 MBytes   198 Mbits/sec  0.005 ms  0/17827 (0%)  
[  5]   9.00-10.00  sec  24.0 MBytes   202 Mbits/sec  0.001 ms  0/18135 (0%)  
[  5]  10.00-11.00  sec  24.3 MBytes   204 Mbits/sec  0.014 ms  0/18305 (0%)  
[  5]  11.00-12.00  sec  23.5 MBytes   197 Mbits/sec  0.006 ms  24/17746 (0.14%)  <= pcs resource disable galera-bundle
[  5]  12.00-13.00  sec  23.9 MBytes   200 Mbits/sec  0.006 ms  0/18003 (0%)  
[  5]  13.00-14.00  sec  23.4 MBytes   196 Mbits/sec  0.006 ms  75/17710 (0.42%)  <= demoting one galera
[  5]  14.00-15.00  sec  24.2 MBytes   203 Mbits/sec  0.008 ms  0/18265 (0%)  
[  5]  15.00-16.00  sec  23.6 MBytes   198 Mbits/sec  0.006 ms  0/17792 (0%)  
[  5]  16.00-17.00  sec  24.0 MBytes   202 Mbits/sec  0.006 ms  0/18128 (0%)  
[  5]  17.00-18.00  sec  23.8 MBytes   200 Mbits/sec  0.003 ms  20/17993 (0.11%)  <= demoting 2nd galera
[  5]  18.00-19.00  sec  24.0 MBytes   201 Mbits/sec  0.013 ms  0/18096 (0%)  
[  5]  19.00-20.00  sec  23.7 MBytes   199 Mbits/sec  0.003 ms  0/17899 (0%)  
[  5]  20.00-21.00  sec  23.7 MBytes   199 Mbits/sec  0.003 ms  0/17858 (0%)  
[  5]  21.00-22.00  sec  24.2 MBytes   203 Mbits/sec  0.005 ms  0/18253 (0%)  
[  5]  22.00-23.00  sec  23.9 MBytes   201 Mbits/sec  0.012 ms  0/18033 (0%)  
[  5]  23.00-24.00  sec  24.2 MBytes   203 Mbits/sec  0.009 ms  0/18258 (0%)  <= demoting 3rd galera (sometimes it has no impact)
[  5]  24.00-25.00  sec  23.1 MBytes   193 Mbits/sec  0.002 ms  0/17398 (0%)  
[  5]  25.00-26.00  sec  24.1 MBytes   202 Mbits/sec  0.013 ms  0/18163 (0%)  
[  5]  26.00-27.00  sec  24.0 MBytes   202 Mbits/sec  0.015 ms  0/18142 (0%)  
[  5]  27.00-28.00  sec  23.7 MBytes   199 Mbits/sec  0.005 ms  0/17893 (0%)  
[  5]  28.00-29.00  sec  23.3 MBytes   195 Mbits/sec  0.006 ms  0/17548 (0%)

Comment 22 Vincent S. Cojot 2021-10-28 19:24:50 UTC

Currently testing with 2M packets, not a single packet lost after appllying workaround:

On serverB VM:
# iperf3 -s -p 5000

On serverA VM: 
# iperf3 -c 10.50.60.88 -p 5000 -u -t 3000 -i 1 --length 1390 -b 2M


iperf Result:
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  5]   0.00-243.30 sec  0.00 Bytes  0.00 bits/sec  0.006 ms  0/43742 (0%)

Comment 23 Vincent S. Cojot 2021-10-28 19:44:53 UTC

Works well on 20M but on 200M I got some packet loss:
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  5]   0.00-142.26 sec  0.00 Bytes  0.00 bits/sec  0.002 ms  1179/2557776 (0.046%)

Comment 26 Vincent S. Cojot 2021-11-23 17:19:01 UTC

One thing to note is that when we pushed out these changes (regular RHOSP config-download deploy):

    neutron::agents::ml2::ovs::of_interface: ovs-ofctl <==================
    neutron::agents::ml2::ovs::ovsdb_interface: vsctl <==================

the tenant confirmed that there had been a hit to the vDRA VNF while those settings were being pushed even though the controller cluster was 100% operational.

Comment 43 errata-xmlrpc 2021-12-15 15:57:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 13 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:5156

Note You need to log in before you can comment on or make changes to this bug.