Description of problem: Issue seems to be when failing over a LB may results in traffic loss to the LB. It could be one failover of the LB OR could be many many many until we hit this issue. Issue exist using the amphora provider. Version-Release number of selected component (if applicable): Red Hat OpenStack Platform release 16.2.2 (Train) Octavia Image: octavia-amphora-16.2-20220310.1.x86_64 3 x controllers 3 x compute nodes How reproducible: Intermittent Steps to Reproduce: 1.Failover a LB 2. 3. Actual results: Expected results: Additional info: Loadbalncer Information: ~~~ (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list -f yaml - ha_ip: 172.16.0.82 id: ddebebe6-5784-4d79-8c9b-615571cf4e5b lb_network_ip: 172.24.3.216 loadbalancer_id: 5af2fced-134e-4d86-9248-12e0d3c12f33 role: BACKUP status: ALLOCATED - ha_ip: 172.16.0.82 id: a7ffb051-5cf7-47f6-9805-e5bdc57c819e lb_network_ip: 172.24.3.22 loadbalancer_id: 5af2fced-134e-4d86-9248-12e0d3c12f33 role: MASTER status: ALLOCATED ~~~ ~~~ (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer list -f yaml - id: 5af2fced-134e-4d86-9248-12e0d3c12f33 name: lb1 project_id: 639600ae1ce84a46b984cf32d1ce2195 provider: amphora provisioning_status: ACTIVE vip_address: 172.16.0.82 ~~~ ~~~ (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer listener show 9fb09144-67d4-425c-bb76-5f687ea0928d -f yaml admin_state_up: true connection_limit: -1 created_at: '2022-09-17T06:12:05' default_pool_id: c21e37e3-cfcf-4e42-a38c-4f4a7337deb2 default_tls_container_ref: null description: '' id: 9fb09144-67d4-425c-bb76-5f687ea0928d insert_headers: null l7policies: '' loadbalancers: 5af2fced-134e-4d86-9248-12e0d3c12f33 name: listener1 operating_status: ONLINE project_id: 639600ae1ce84a46b984cf32d1ce2195 protocol: TCP protocol_port: 28410 provisioning_status: ACTIVE sni_container_refs: [] timeout_client_data: 50000 timeout_member_connect: 5000 timeout_member_data: 50000 timeout_tcp_inspect: 0 updated_at: '2022-09-23T04:29:59' client_ca_tls_container_ref: null client_authentication: NONE client_crl_container_ref: null allowed_cidrs: null ~~~ ~~~ { "loadbalancer": { "id": "5af2fced-134e-4d86-9248-12e0d3c12f33", "name": "lb1", "operating_status": "ONLINE", "provisioning_status": "ACTIVE", "listeners": [ { "id": "9fb09144-67d4-425c-bb76-5f687ea0928d", "name": "listener1", "operating_status": "ONLINE", "provisioning_status": "ACTIVE", "pools": [ { "id": "c21e37e3-cfcf-4e42-a38c-4f4a7337deb2", "name": "pool1", "provisioning_status": "ACTIVE", "operating_status": "ONLINE", "members": [ { "id": "d12542f7-99ac-401d-8207-ea9f23e8fd5e", "name": "rhel-vm1", "operating_status": "NO_MONITOR", "provisioning_status": "ACTIVE", "address": "172.16.0.210", "protocol_port": 80 } ] } ] } ] } } ~~~ The traffic coming from the director Node(10.0.0.35) using command curl -v http://10.0.0.193:28410 Everything works as expected until we hit this issue. Curl the back end member through another FIP works perfectly. ~~~ (overcloud) [stack@undercloud-0 ~]$ curl -m5 http://10.0.0.158 This is rhel-vm1 ~~~~ curl command used to the LB FIP ~~~ (overcloud) [stack@undercloud-0 ~]$ curl -v -m 5 http://10.0.0.193:28410 * Rebuilt URL to: http://10.0.0.193:28410/ * Uses proxy env variable no_proxy == ',10.0.0.117,192.168.24.38' * Trying 10.0.0.193... * TCP_NODELAY set * Connection timed out after 5001 milliseconds * Closing connection 0 curl: (28) Connection timed out after 5001 milliseconds ~~~ FIPs ~~~ (overcloud) [stack@undercloud-0 ~]$ openstack floating ip list --floating-ip-address 10.0.0.158 -f yaml - Fixed IP Address: 172.16.0.210 Floating IP Address: 10.0.0.158 Floating Network: a372faa5-156a-4642-b939-c9b7f4a6fa99 ID: 4307da72-4e1b-4925-b458-86a646cf1b1d Port: 2b361e7f-ddcc-4fb6-99bc-6cc40813c50b Project: 639600ae1ce84a46b984cf32d1ce2195 ~~~ ~~~ (overcloud) [stack@undercloud-0 ~]$ openstack floating ip list --floating-ip-address 10.0.0.193 -f yaml - Fixed IP Address: 172.16.0.82 Floating IP Address: 10.0.0.193 Floating Network: a372faa5-156a-4642-b939-c9b7f4a6fa99 ID: cbe57048-e6e8-4c7d-9750-de073758184e Port: 64cebd7f-8a52-4af8-bd44-42eeca1aedf4 Project: 639600ae1ce84a46b984cf32d1ce2195 ~~~ DO note, the LB listener is listening on port 28410 then passes it through to the backend member on port 80. (Reproducing the same issue in lab to replicate the customers issue.) ~~~ We can see the traffic coming in on interface ens5 on controller-2 ~~~ [root@controller-2 neutron]# tcpdump -nnei genev_sys_6081 port 28410 dropped privs to tcpdump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes 01:01:47.697111 fa:16:3e:a2:6d:a6 > fa:16:3e:7c:46:0a, ethertype IPv4 (0x0800), length 74: 10.0.0.35.54156 > 172.16.0.82.28410: Flags [S], seq 980173721, win 29200, options [mss 1460,sackOK,TS val 1799839257 ecr 0,nop,wscale 7], length 0 01:01:47.697865 fa:16:3e:aa:71:7a > 52:54:00:5f:e4:43, ethertype IPv4 (0x0800), length 74: 172.16.0.82.28410 > 10.0.0.35.54156: Flags [S.], seq 3334803320, ack 980173722, win 27800, options [mss 1402,sackOK,TS val 4283067010 ecr 1799839257,nop,wscale 4], length 0 01:01:47.698021 fa:16:3e:a2:6d:a6 > fa:16:3e:7c:46:0a, ethertype IPv4 (0x0800), length 66: 10.0.0.35.54156 > 172.16.0.82.28410: Flags [.], ack 1, win 229, options [nop,nop,TS val 1799839258 ecr 4283067010], length 0 01:01:47.698041 fa:16:3e:a2:6d:a6 > fa:16:3e:7c:46:0a, ethertype IPv4 (0x0800), length 146: 10.0.0.35.54156 > 172.16.0.82.28410: Flags [P.], seq 1:81, ack 1, win 229, options [nop,nop,TS val 1799839259 ecr 4283067010], length 80 ~~~ We see the traffic happening through the geneve tunnel to compute node 1 ~~~ root@compute-1 ~]# tcpdump -nnei genev_sys_6081 port 28410 dropped privs to tcpdump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes 04:32:03.112531 fa:16:3e:a2:6d:a6 > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 74: 10.0.0.35.51642 > 172.16.0.82.28410: Flags [S], seq 4144791301, win 29200, options [mss 1460,sackOK,TS val 1812454674 ecr 0,nop,wscale 7], length 0 04:32:05.110182 fa:16:3e:a2:6d:a6 > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 74: 10.0.0.35.51644 > 172.16.0.82.28410: Flags [S], seq 1955949749, win 29200, options [mss 1460,sackOK,TS val 1812456671 ecr 0,nop,wscale 7], length 0 04:32:06.120679 fa:16:3e:a2:6d:a6 > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 74: 10.0.0.35.51644 > 172.16.0.82.28410: Flags [S], seq 1955949749, win 29200, options [mss 1460,sackOK,TS val 1812457682 ecr 0,nop,wscale 7], length 0 ~~~ Where the amphora instances are running | ddebebe6-5784-4d79-8c9b-615571cf4e5b | 5af2fced-134e-4d86-9248-12e0d3c12f33 | ALLOCATED | BACKUP | 172.24.3.216 | 172.16.0.82 | | a7ffb051-5cf7-47f6-9805-e5bdc57c819e | 5af2fced-134e-4d86-9248-12e0d3c12f33 | ALLOCATED | MASTER | 172.24.3.22 | 172.16.0.82 | BACKUP openstack loadbalancer amphora show ddebebe6-5784-4d79-8c9b-615571cf4e5b | compute_id | ef9a8611-06ac-4754-b70c-e3611bea0c1f | | OS-EXT-SRV-ATTR:host | compute-1.redhat.local | MASTER | compute_id | 306a7e9f-3266-46c5-8017-42316d2d21c5 | | OS-EXT-SRV-ATTR:host | compute-2.redhat.local | We can see the traffic is hitting the BACKUP LB ~~~ [root@amphora-ddebebe6-5784-4d79-8c9b-615571cf4e5b ~]# ip netns exec amphora-haproxy tcpdump -nnei eth1 dropped privs to tcpdump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes 00:51:21.574940 fa:16:3e:c0:60:2a > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 54: 172.16.0.206 > 172.16.0.190: VRRPv2, Advertisement, vrid 1, prio 100, authtype simple, intvl 1s, length 20 00:51:21.944793 fa:16:3e:a2:6d:a6 > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 74: 10.0.0.35.41428 > 172.16.0.82.28410: Flags [S], seq 2574929603, win 29200, options [mss 1460,sackOK,TS val 1813613458 ecr 0,nop,wscale 7], length 0 00:51:21.944823 fa:16:3e:40:b6:1c > fa:16:3e:09:61:4d, ethertype IPv4 (0x0800), length 74: 10.0.0.35.41428 > 172.16.0.82.28410: Flags [S], seq 2574929603, win 29200, options [mss 1460,sackOK,TS val 1813613458 ecr 0,nop,wscale 7], length 0 00:51:22.575411 fa:16:3e:c0:60:2a > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 54: 172.16.0.206 > 172.16.0.190: VRRPv2, Advertisement, vrid 1, prio 100, authtype simple, intvl 1s, length 20 00:51:23.576352 fa:16:3e:c0:60:2a > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 54: 172.16.0.206 > 172.16.0.190: VRRPv2, Advertisement, vrid 1, prio 100, authtype simple, intvl 1s, length 20 00:51:23.902529 fa:16:3e:40:b6:1c > fa:16:3e:c0:60:2a, ethertype IPv4 (0x0800), length 66: 172.16.0.190.39017 > 172.16.0.206.1025: Flags [R.], seq 791843565, ack 2539806639, win 1753, options [nop,nop,TS val 2547303002 ecr 2644028076], length 0 00:51:23.902671 fa:16:3e:40:b6:1c > fa:16:3e:c0:60:2a, ethertype IPv4 (0x0800), length 74: 172.16.0.190.39019 > 172.16.0.206.1025: Flags [S], seq 1894260233, win 28040, options [mss 1402,sackOK,TS val 2547303002 ecr 0,nop,wscale 4], length 0 00:51:23.903627 fa:16:3e:c0:60:2a > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 74: 172.16.0.206.1025 > 172.16.0.190.39019: Flags [S.], seq 1491209716, ack 1894260234, win 27800, options [mss 1402,sackOK,TS val 2644033077 ecr 2547303002,nop,wscale 4], length 0 00:51:23.903650 fa:16:3e:40:b6:1c > fa:16:3e:c0:60:2a, ethertype IPv4 (0x0800), length 66: 172.16.0.190.39019 > 172.16.0.206.1025: Flags [.], ack 1, win 1753, options [nop,nop,TS val 2547303003 ecr 2644033077], length 0 00:51:23.903836 fa:16:3e:40:b6:1c > fa:16:3e:c0:60:2a, ethertype IPv4 (0x0800), length 142: 172.16.0.190.39019 > 172.16.0.206.1025: Flags [P.], seq 1:77, ack 1, win 1753, options [nop,nop,TS val 2547303003 ecr 2644033077], length 76 00:51:23.904075 fa:16:3e:c0:60:2a > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 74: 172.16.0.206.30633 > 172.16.0.190.1025: Flags [S], seq 1351400618, win 28040, options [mss 1402,sackOK,TS val 2644033077 ecr 0,nop,wscale 4], length 0 00:51:23.904097 fa:16:3e:40:b6:1c > fa:16:3e:c0:60:2a, ethertype IPv4 (0x0800), length 74: 172.16.0.190.1025 > 172.16.0.206.30633: Flags [S.], seq 2641093430, ack 1351400619, win 27800, options [mss 1402,sackOK,TS val 2547303004 ecr 2644033077,nop,wscale 4], length 0 00:51:23.904351 fa:16:3e:c0:60:2a > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 66: 172.16.0.206.1025 > 172.16.0.190.39019: Flags [.], ack 77, win 1738, options [nop,nop,TS val 2644033078 ecr 2547303003], length 0 00:51:23.904360 fa:16:3e:c0:60:2a > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 70: 172.16.0.206.1025 > 172.16.0.190.39019: Flags [P.], seq 1:5, ack 77, win 1738, options [nop,nop,TS val 2644033078 ecr 2547303003], length 4 00:51:23.904365 fa:16:3e:40:b6:1c > fa:16:3e:c0:60:2a, ethertype IPv4 (0x0800), length 66: 172.16.0.190.39019 > 172.16.0.206.1025: Flags [.], ack 5, win 1753, options [nop,nop,TS val 2547303004 ecr 2644033078], length 0 00:51:23.904649 fa:16:3e:c0:60:2a > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 54: 172.16.0.206.30633 > 172.16.0.190.1025: Flags [R], seq 1351400619, win 0, length 0 00:51:23.920888 fa:16:3e:a2:6d:a6 > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 74: 10.0.0.35.41442 > 172.16.0.82.28410: Flags [S], seq 3968766919, win 29200, options [mss 1460,sackOK,TS val 1813615434 ecr 0,nop,wscale 7], length 0 00:51:23.920907 fa:16:3e:40:b6:1c > fa:16:3e:09:61:4d, ethertype IPv4 (0x0800), length 74: 10.0.0.35.41442 > 172.16.0.82.28410: Flags [S], seq 3968766919, win 29200, options [mss 1460,sackOK,TS val 1813615434 ecr 0,nop,wscale 7], length 0 00:51:24.576012 fa:16:3e:c0:60:2a > fa:16:3e:40:b6:1c, ethertype IPv4 (0x0800), length 54: 172.16.0.206 > 172.16.0.190: VRRPv2, Advertisement, vrid 1, prio 100, authtype simple, intvl 1s, length 20 ~~~ On the MASTER NODE we can NOT see any traffic happening. ~~~ [root@amphora-a7ffb051-5cf7-47f6-9805-e5bdc57c819e ~]# ip netns exec amphora-haproxy tcpdump -nnei eth1 port 28410 dropped privs to tcpdump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes ~~~ Fixing this issue we can do a unsolicited ARP (GARP) on the MASTER LB Then we can see the traffic start on the MASTER LB and stop on the BACKUP LB ~~~ ip netns exec amphora-haproxy arping -c 100 -A -I eth1 -s 172.16.0.82 172.16.0.82 ~~~ ON the MASTER LB We can see the traffic happening again ~~~ [root@amphora-a7ffb051-5cf7-47f6-9805-e5bdc57c819e ~]# ip netns exec amphora-haproxy tcpdump -nnei eth1 port 28410 dropped privs to tcpdump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes 01:01:12.820964 fa:16:3e:a2:6d:a6 > fa:16:3e:c0:60:2a, ethertype IPv4 (0x0800), length 74: 10.0.0.35.57358 > 172.16.0.82.28410: Flags [S], seq 994771537, win 29200, options [mss 1460,sackOK,TS val 1814205527 ecr 0,nop,wscale 7], length 0 01:01:12.821301 fa:16:3e:c0:60:2a > fa:16:3e:a2:6d:a6, ethertype IPv4 (0x0800), length 74: 172.16.0.82.28410 > 10.0.0.35.57358: Flags [S.], seq 2063853850, ack 994771538, win 27800, options [mss 1402,sackOK,TS val 450892544 ecr 1814205527,nop,wscale 4$ , length 0 01:01:12.822768 fa:16:3e:a2:6d:a6 > fa:16:3e:c0:60:2a, ethertype IPv4 (0x0800), length 146: 10.0.0.35.57358 > 172.16.0.82.28410: Flags [P.], seq 1:81, ack 1, win 229, options [nop,nop,TS val 1814205530 ecr 450892544], length 0 01:01:12.822793 fa:16:3e:c0:60:2a > fa:16:3e:a2:6d:a6, ethertype IPv4 (0x0800), length 66: 172.16.0.82.28410 > 10.0.0.35.57358: Flags [.], ack 81, win 1738, options [nop,nop,TS val 450892545 ecr 1814205530], length 0 01:01:12.822817 fa:16:3e:a2:6d:a6 > fa:16:3e:c0:60:2a, ethertype IPv4 (0x0800), length 66: 10.0.0.35.57358 > 172.16.0.82.28410: Flags [.], ack 1, win 229, options [nop,nop,TS val 1814205530 ecr 450892544], length 0 01:01:12.822820 fa:16:3e:c0:60:2a > fa:16:3e:a2:6d:a6, ethertype IPv4 (0x0800), length 66: 172.16.0.82.28410 > 10.0.0.35.57358: Flags [.], ack 81, win 1738, options [nop,nop,TS val 450892545 ecr 1814205530], length 0 01:01:12.826233 fa:16:3e:c0:60:2a > fa:16:3e:a2:6d:a6, ethertype IPv4 (0x0800), length 343: 172.16.0.82.28410 > 10.0.0.35.57358: Flags [P.], seq 1:278, ack 81, win 1738, options [nop,nop,TS val 450892549 ecr 1814205530], length 277 ~~~ ~~~ (overcloud) [stack@undercloud-0 ~]$ curl -v -m 5 http://10.0.0.193:28410 * Rebuilt URL to: http://10.0.0.193:28410/ * Uses proxy env variable no_proxy == ',10.0.0.117,192.168.24.38' * Trying 10.0.0.193... * TCP_NODELAY set * Connected to 10.0.0.193 (10.0.0.193) port 28410 (#0) > GET / HTTP/1.1 > Host: 10.0.0.193:28410 > User-Agent: curl/7.61.1 > Accept: */* > < HTTP/1.1 200 OK < Date: Fri, 23 Sep 2022 05:03:18 GMT < Server: Apache/2.4.37 (Red Hat Enterprise Linux) < Last-Modified: Sat, 17 Sep 2022 06:15:49 GMT < ETag: "11-5e8d968e94249" < Accept-Ranges: bytes < Content-Length: 17 < Content-Type: text/html; charset=UTF-8 < This is rhel-vm1 * Connection #0 to host 10.0.0.193 left intact (overcloud) [stack@undercloud-0 ~]$ ~~~ Need to understand how to fix this issue as randomly the LB will stop forwarding traffic due to the issue after a Failover of a LB. The issue could happen after one failover or many consecutively failovers.
*** This bug has been marked as a duplicate of bug 2126055 ***