2129253 – Failovers of a HA Octavia LB may results in Loss of traffic to the LB randomly

Bug 2129253 - Failovers of a HA Octavia LB may results in Loss of traffic to the LB randomly

Summary: Failovers of a HA Octavia LB may results in Loss of traffic to the LB randomly

Keywords:
Status:	CLOSED DUPLICATE of bug 2126055
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-octavia
Sub Component:
Version:	16.2 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Nate Johnston
QA Contact:	Bruna Bonguardo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-09-23 05:19 UTC by chrisbro@redhat.com
Modified:	2022-09-30 13:02 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-09-30 13:02:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-18898	0	None	None	None	2022-09-23 05:25:36 UTC

Description chrisbro@redhat.com 2022-09-23 05:19:21 UTC

Description of problem:
Issue seems to be when failing over a LB may results in traffic loss to the LB. It could be one failover of the LB OR could be many many many until we hit this issue. 

Issue exist using the amphora provider.


Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform release 16.2.2 (Train)
Octavia Image: octavia-amphora-16.2-20220310.1.x86_64

3 x controllers 
3 x compute nodes


How reproducible:
Intermittent 

Steps to Reproduce:
1.Failover a LB
2.
3.

Actual results:


Expected results:


Additional info:

Loadbalncer Information:
~~~
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list -f yaml
- ha_ip: 172.16.0.82
  id: ddebebe6-5784-4d79-8c9b-615571cf4e5b
  lb_network_ip: 172.24.3.216
  loadbalancer_id: 5af2fced-134e-4d86-9248-12e0d3c12f33
  role: BACKUP
  status: ALLOCATED
- ha_ip: 172.16.0.82
  id: a7ffb051-5cf7-47f6-9805-e5bdc57c819e
  lb_network_ip: 172.24.3.22
  loadbalancer_id: 5af2fced-134e-4d86-9248-12e0d3c12f33
  role: MASTER
  status: ALLOCATED
~~~
~~~
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer list -f yaml
- id: 5af2fced-134e-4d86-9248-12e0d3c12f33
  name: lb1
  project_id: 639600ae1ce84a46b984cf32d1ce2195
  provider: amphora
  provisioning_status: ACTIVE
  vip_address: 172.16.0.82
~~~
~~~
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer listener show 9fb09144-67d4-425c-bb76-5f687ea0928d -f yaml                                                                                                                                      
admin_state_up: true
connection_limit: -1
created_at: '2022-09-17T06:12:05'
default_pool_id: c21e37e3-cfcf-4e42-a38c-4f4a7337deb2
default_tls_container_ref: null
description: ''
id: 9fb09144-67d4-425c-bb76-5f687ea0928d
insert_headers: null
l7policies: ''
loadbalancers: 5af2fced-134e-4d86-9248-12e0d3c12f33
name: listener1
operating_status: ONLINE
project_id: 639600ae1ce84a46b984cf32d1ce2195
protocol: TCP
protocol_port: 28410
provisioning_status: ACTIVE
sni_container_refs: []
timeout_client_data: 50000
timeout_member_connect: 5000
timeout_member_data: 50000
timeout_tcp_inspect: 0
updated_at: '2022-09-23T04:29:59'
client_ca_tls_container_ref: null
client_authentication: NONE
client_crl_container_ref: null
allowed_cidrs: null
~~~

~~~
{
    "loadbalancer": {
        "id": "5af2fced-134e-4d86-9248-12e0d3c12f33",
        "name": "lb1",
        "operating_status": "ONLINE",
        "provisioning_status": "ACTIVE",
        "listeners": [
            {
                "id": "9fb09144-67d4-425c-bb76-5f687ea0928d",
                "name": "listener1",
                "operating_status": "ONLINE",
                "provisioning_status": "ACTIVE",
                "pools": [
                    {
                        "id": "c21e37e3-cfcf-4e42-a38c-4f4a7337deb2",
                        "name": "pool1",
                        "provisioning_status": "ACTIVE",
                        "operating_status": "ONLINE",
                        "members": [
                            {
                                "id": "d12542f7-99ac-401d-8207-ea9f23e8fd5e",
                                "name": "rhel-vm1",
                                "operating_status": "NO_MONITOR",
                                "provisioning_status": "ACTIVE",
                                "address": "172.16.0.210",
                                "protocol_port": 80
                            }
                        ]
                    }
                ]
            }
        ]
    }
}
~~~

The traffic coming from the director Node(10.0.0.35) using command curl -v http://10.0.0.193:28410
Everything works as expected until we hit this issue.

Curl the back end member through another FIP works perfectly. 
~~~
(overcloud) [stack@undercloud-0 ~]$  curl -m5  http://10.0.0.158
This is rhel-vm1
~~~~

curl command used to the LB FIP 
~~~
(overcloud) [stack@undercloud-0 ~]$ curl -v -m 5  http://10.0.0.193:28410                                                                                                                                                                                  
* Rebuilt URL to: http://10.0.0.193:28410/
* Uses proxy env variable no_proxy == ',10.0.0.117,192.168.24.38'
*   Trying 10.0.0.193...
* TCP_NODELAY set
* Connection timed out after 5001 milliseconds
* Closing connection 0
curl: (28) Connection timed out after 5001 milliseconds
~~~

FIPs 

~~~
(overcloud) [stack@undercloud-0 ~]$ openstack floating ip list --floating-ip-address 10.0.0.158 -f yaml
- Fixed IP Address: 172.16.0.210
  Floating IP Address: 10.0.0.158
  Floating Network: a372faa5-156a-4642-b939-c9b7f4a6fa99
  ID: 4307da72-4e1b-4925-b458-86a646cf1b1d
  Port: 2b361e7f-ddcc-4fb6-99bc-6cc40813c50b
  Project: 639600ae1ce84a46b984cf32d1ce2195
~~~
~~~
(overcloud) [stack@undercloud-0 ~]$ openstack floating ip list --floating-ip-address 10.0.0.193 -f yaml                                                                                                                                                    
- Fixed IP Address: 172.16.0.82
  Floating IP Address: 10.0.0.193
  Floating Network: a372faa5-156a-4642-b939-c9b7f4a6fa99
  ID: cbe57048-e6e8-4c7d-9750-de073758184e
  Port: 64cebd7f-8a52-4af8-bd44-42eeca1aedf4
  Project: 639600ae1ce84a46b984cf32d1ce2195

~~~


DO note, the LB listener is listening on port 28410 then passes it through to the backend member on port 80. (Reproducing the same issue in lab to replicate the customers issue.)
~~~

We can see the traffic coming in on interface ens5 on controller-2
~~~
[root@controller-2 neutron]# tcpdump -nnei genev_sys_6081 port 28410
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes
01:01:47.697111 fa:16:3e:a2:6d:a6 > fa:16:3e:7c:46:0a, ethertype IPv4 (0x0800), length 74: 10.0.0.35.54156 > 172.16.0.82.28410: Flags [S], seq 980173721, win 29200, options [mss 1460,sackOK,TS val 1799839257 ecr 0,nop,wscale 7], length 0
01:01:47.697865 fa:16:3e:aa:71:7a > 52:54:00:5f:e4:43, ethertype IPv4 (0x0800), length 74: 172.16.0.82.28410 > 10.0.0.35.54156: Flags [S.], seq 3334803320, ack 980173722, win 27800, options [mss 1402,sackOK,TS val 4283067010 ecr 1799839257,nop,wscale 4], length 0
01:01:47.698021 fa:16:3e:a2:6d:a6 > fa:16:3e:7c:46:0a, ethertype IPv4 (0x0800), length 66: 10.0.0.35.54156 > 172.16.0.82.28410: Flags [.], ack 1, win 229, options [nop,nop,TS val 1799839258 ecr 4283067010],
length 0
01:01:47.698041 fa:16:3e:a2:6d:a6 > fa:16:3e:7c:46:0a, ethertype IPv4 (0x0800), length 146: 10.0.0.35.54156 > 172.16.0.82.28410: Flags [P.], seq 1:81, ack 1, win 229, options [nop,nop,TS val 1799839259 ecr 4283067010], length 80
~~~


We see the traffic happening through the geneve tunnel to compute node 1
~~~
root@compute-1 ~]# tcpdump -nnei genev_sys_6081 port 28410
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes
04:32:03.112531 fa:16:3e:a2:6d:a6 > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 74: 10.0.0.35.51642 > 172.16.0.82.28410: Flags [S], seq 4144791301, win 29200, options [mss 1460,sackOK,TS val 1812454674 ecr 0,nop,wscale 7], length 0             
04:32:05.110182 fa:16:3e:a2:6d:a6 > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 74: 10.0.0.35.51644 > 172.16.0.82.28410: Flags [S], seq 1955949749, win 29200, options [mss 1460,sackOK,TS val 1812456671 ecr 0,nop,wscale 7], length 0             
04:32:06.120679 fa:16:3e:a2:6d:a6 > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 74: 10.0.0.35.51644 > 172.16.0.82.28410: Flags [S], seq 1955949749, win 29200, options [mss 1460,sackOK,TS val 1812457682 ecr 0,nop,wscale 7], length 0             
~~~


Where the amphora instances are running 

| ddebebe6-5784-4d79-8c9b-615571cf4e5b | 5af2fced-134e-4d86-9248-12e0d3c12f33 | ALLOCATED | BACKUP | 172.24.3.216  | 172.16.0.82 |
| a7ffb051-5cf7-47f6-9805-e5bdc57c819e | 5af2fced-134e-4d86-9248-12e0d3c12f33 | ALLOCATED | MASTER | 172.24.3.22   | 172.16.0.82 | 


BACKUP 
openstack loadbalancer amphora show ddebebe6-5784-4d79-8c9b-615571cf4e5b 
| compute_id      | ef9a8611-06ac-4754-b70c-e3611bea0c1f | 

| OS-EXT-SRV-ATTR:host                | compute-1.redhat.local                                                                 |   

MASTER
| compute_id      | 306a7e9f-3266-46c5-8017-42316d2d21c5 |
| OS-EXT-SRV-ATTR:host                | compute-2.redhat.local                                                                 |   

We can see the traffic is hitting the BACKUP LB 
  
~~~
[root@amphora-ddebebe6-5784-4d79-8c9b-615571cf4e5b ~]# ip netns exec amphora-haproxy tcpdump -nnei eth1 
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
00:51:21.574940 fa:16:3e:c0:60:2a > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 54: 172.16.0.206 > 172.16.0.190: VRRPv2, Advertisement, vrid 1, prio 100, authtype simple, intvl 1s, length 20
00:51:21.944793 fa:16:3e:a2:6d:a6 > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 74: 10.0.0.35.41428 > 172.16.0.82.28410: Flags [S], seq 2574929603, win 29200, options [mss 1460,sackOK,TS val 1813613458 ecr 0,nop,wscale 7], length 0
00:51:21.944823 fa:16:3e:40:b6:1c > fa:16:3e:09:61:4d, ethertype IPv4 (0x0800), length 74: 10.0.0.35.41428 > 172.16.0.82.28410: Flags [S], seq 2574929603, win 29200, options [mss 1460,sackOK,TS val 1813613458 ecr 0,nop,wscale 7], length 0
00:51:22.575411 fa:16:3e:c0:60:2a > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 54: 172.16.0.206 > 172.16.0.190: VRRPv2, Advertisement, vrid 1, prio 100, authtype simple, intvl 1s, length 20
00:51:23.576352 fa:16:3e:c0:60:2a > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 54: 172.16.0.206 > 172.16.0.190: VRRPv2, Advertisement, vrid 1, prio 100, authtype simple, intvl 1s, length 20
00:51:23.902529 fa:16:3e:40:b6:1c > fa:16:3e:c0:60:2a, ethertype IPv4 (0x0800), length 66: 172.16.0.190.39017 > 172.16.0.206.1025: Flags [R.], seq 791843565, ack 2539806639, win 1753, options [nop,nop,TS val 2547303002 ecr 2644028076], length 0
00:51:23.902671 fa:16:3e:40:b6:1c > fa:16:3e:c0:60:2a, ethertype IPv4 (0x0800), length 74: 172.16.0.190.39019 > 172.16.0.206.1025: Flags [S], seq 1894260233, win 28040, options [mss 1402,sackOK,TS val 2547303002 ecr 0,nop,wscale 4], length 0
00:51:23.903627 fa:16:3e:c0:60:2a > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 74: 172.16.0.206.1025 > 172.16.0.190.39019: Flags [S.], seq 1491209716, ack 1894260234, win 27800, options [mss 1402,sackOK,TS val 2644033077 ecr 2547303002,nop,wscale 4], length 0
00:51:23.903650 fa:16:3e:40:b6:1c > fa:16:3e:c0:60:2a, ethertype IPv4 (0x0800), length 66: 172.16.0.190.39019 > 172.16.0.206.1025: Flags [.], ack 1, win 1753, options [nop,nop,TS val 2547303003 ecr 2644033077], length 0
00:51:23.903836 fa:16:3e:40:b6:1c > fa:16:3e:c0:60:2a, ethertype IPv4 (0x0800), length 142: 172.16.0.190.39019 > 172.16.0.206.1025: Flags [P.], seq 1:77, ack 1, win 1753, options [nop,nop,TS val 2547303003 ecr 2644033077], length 76
00:51:23.904075 fa:16:3e:c0:60:2a > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 74: 172.16.0.206.30633 > 172.16.0.190.1025: Flags [S], seq 1351400618, win 28040, options [mss 1402,sackOK,TS val 2644033077 ecr 0,nop,wscale 4], length 0
00:51:23.904097 fa:16:3e:40:b6:1c > fa:16:3e:c0:60:2a, ethertype IPv4 (0x0800), length 74: 172.16.0.190.1025 > 172.16.0.206.30633: Flags [S.], seq 2641093430, ack 1351400619, win 27800, options [mss 1402,sackOK,TS val 2547303004 ecr 2644033077,nop,wscale 4], length 0
00:51:23.904351 fa:16:3e:c0:60:2a > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 66: 172.16.0.206.1025 > 172.16.0.190.39019: Flags [.], ack 77, win 1738, options [nop,nop,TS val 2644033078 ecr 2547303003], length 0
00:51:23.904360 fa:16:3e:c0:60:2a > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 70: 172.16.0.206.1025 > 172.16.0.190.39019: Flags [P.], seq 1:5, ack 77, win 1738, options [nop,nop,TS val 2644033078 ecr 2547303003], length 4
00:51:23.904365 fa:16:3e:40:b6:1c > fa:16:3e:c0:60:2a, ethertype IPv4 (0x0800), length 66: 172.16.0.190.39019 > 172.16.0.206.1025: Flags [.], ack 5, win 1753, options [nop,nop,TS val 2547303004 ecr 2644033078], length 0
00:51:23.904649 fa:16:3e:c0:60:2a > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 54: 172.16.0.206.30633 > 172.16.0.190.1025: Flags [R], seq 1351400619, win 0, length 0
00:51:23.920888 fa:16:3e:a2:6d:a6 > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 74: 10.0.0.35.41442 > 172.16.0.82.28410: Flags [S], seq 3968766919, win 29200, options [mss 1460,sackOK,TS val 1813615434 ecr 0,nop,wscale 7], length 0
00:51:23.920907 fa:16:3e:40:b6:1c > fa:16:3e:09:61:4d, ethertype IPv4 (0x0800), length 74: 10.0.0.35.41442 > 172.16.0.82.28410: Flags [S], seq 3968766919, win 29200, options [mss 1460,sackOK,TS val 1813615434 ecr 0,nop,wscale 7], length 0
00:51:24.576012 fa:16:3e:c0:60:2a > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 54: 172.16.0.206 > 172.16.0.190: VRRPv2, Advertisement, vrid 1, prio 100, authtype simple, intvl 1s, length 20
~~~

On the MASTER NODE we can NOT see any traffic happening. 

~~~
[root@amphora-a7ffb051-5cf7-47f6-9805-e5bdc57c819e ~]# ip netns exec amphora-haproxy tcpdump -nnei eth1 port 28410
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
~~~



Fixing this issue we can do a unsolicited ARP (GARP) on the MASTER LB
Then we can see the traffic start on the MASTER LB and stop on the BACKUP LB 
~~~
ip netns exec amphora-haproxy arping -c 100 -A -I eth1 -s 172.16.0.82 172.16.0.82
~~~

ON the MASTER LB
We can see the traffic happening again
~~~
[root@amphora-a7ffb051-5cf7-47f6-9805-e5bdc57c819e ~]# ip netns exec amphora-haproxy tcpdump -nnei eth1 port 28410                                                                                                                                          
dropped privs to tcpdump                                                                                                                                                                                                                                    
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode                                                                                                                                                                                  
listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes                                                                                                                                                                                   
01:01:12.820964 fa:16:3e:a2:6d:a6 > fa:16:3e:c0:60:2a, ethertype IPv4 (0x0800), length 74: 10.0.0.35.57358 > 172.16.0.82.28410: Flags [S], seq 994771537, win 29200, options [mss 1460,sackOK,TS val 1814205527 ecr 0,nop,wscale 7], length 0               
01:01:12.821301 fa:16:3e:c0:60:2a > fa:16:3e:a2:6d:a6, ethertype IPv4 (0x0800), length 74: 172.16.0.82.28410 > 10.0.0.35.57358: Flags [S.], seq 2063853850, ack 994771538, win 27800, options [mss 1402,sackOK,TS val 450892544 ecr 1814205527,nop,wscale 4$
, length 0                                                                                                                                                                                                                                                  
01:01:12.822768 fa:16:3e:a2:6d:a6 > fa:16:3e:c0:60:2a, ethertype IPv4 (0x0800), length 146: 10.0.0.35.57358 > 172.16.0.82.28410: Flags [P.], seq 1:81, ack 1, win 229, options [nop,nop,TS val 1814205530 ecr 450892544], length 0                         
01:01:12.822793 fa:16:3e:c0:60:2a > fa:16:3e:a2:6d:a6, ethertype IPv4 (0x0800), length 66: 172.16.0.82.28410 > 10.0.0.35.57358: Flags [.], ack 81, win 1738, options [nop,nop,TS val 450892545 ecr 1814205530], length 0                                    
01:01:12.822817 fa:16:3e:a2:6d:a6 > fa:16:3e:c0:60:2a, ethertype IPv4 (0x0800), length 66: 10.0.0.35.57358 > 172.16.0.82.28410: Flags [.], ack 1, win 229, options [nop,nop,TS val 1814205530 ecr 450892544], length 0                                      
01:01:12.822820 fa:16:3e:c0:60:2a > fa:16:3e:a2:6d:a6, ethertype IPv4 (0x0800), length 66: 172.16.0.82.28410 > 10.0.0.35.57358: Flags [.], ack 81, win 1738, options [nop,nop,TS val 450892545 ecr 1814205530], length 0                                    
01:01:12.826233 fa:16:3e:c0:60:2a > fa:16:3e:a2:6d:a6, ethertype IPv4 (0x0800), length 343: 172.16.0.82.28410 > 10.0.0.35.57358: Flags [P.], seq 1:278, ack 81, win 1738, options [nop,nop,TS val 450892549 ecr 1814205530], length 277  
~~~

~~~
(overcloud) [stack@undercloud-0 ~]$ curl -v -m 5  http://10.0.0.193:28410
* Rebuilt URL to: http://10.0.0.193:28410/
* Uses proxy env variable no_proxy == ',10.0.0.117,192.168.24.38'
*   Trying 10.0.0.193...
* TCP_NODELAY set
* Connected to 10.0.0.193 (10.0.0.193) port 28410 (#0)
> GET / HTTP/1.1
> Host: 10.0.0.193:28410
> User-Agent: curl/7.61.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Fri, 23 Sep 2022 05:03:18 GMT
< Server: Apache/2.4.37 (Red Hat Enterprise Linux)
< Last-Modified: Sat, 17 Sep 2022 06:15:49 GMT
< ETag: "11-5e8d968e94249"
< Accept-Ranges: bytes
< Content-Length: 17
< Content-Type: text/html; charset=UTF-8
< 
This is rhel-vm1
* Connection #0 to host 10.0.0.193 left intact
(overcloud) [stack@undercloud-0 ~]$ 
~~~


Need to understand how to fix this issue as randomly the LB will stop forwarding traffic due to the issue after a Failover of a LB. The issue could happen after one failover or many consecutively failovers.

Comment 9 Gregory Thiemonge 2022-09-30 13:02:40 UTC


*** This bug has been marked as a duplicate of bug 2126055 ***

Note You need to log in before you can comment on or make changes to this bug.