Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1997128

Summary:	OpenShift on OpenStack creates too many members in the octavia pool
Product:	OpenShift Container Platform	Reporter:	Gregory Thiemonge <gthiemon>
Component:	Cloud Compute	Assignee:	egarcia
Cloud Compute sub component:	OpenStack Provider	QA Contact:	Jon Uriarte <juriarte>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	adduarte, bbonguar, egarcia, ihrachys, jveiraca, lpeer, majopela, m.andre, mbooth, mfedosin, njohnston, pprinett, scohen
Version:	4.6
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1996756	Environment:
Last Closed:	2021-09-01 08:40:01 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Gregory Thiemonge 2021-08-24 13:18:23 UTC

+++ This bug was initially created as a clone of Bug #1996756 +++

Description of problem:
We run OpenShift (4.6) on OpenStack (16.1).
We notice that the load balancer pool members contains only 2 ONLINE members:
```
$ openstack loadbalancer member list 56bcc29c-7624-4b2d-8935-229472ca2316 -c operating_status -f value | sort | uniq -c
    132 ERROR
      2 ONLINE
```
In this particular case, the loadbalancer is expected to balance the load between 2 servers, yet there are 132 useless workers in the member list.

The problem arise in case of failover:
- when OpenStack triggers a failover of the master amphora, the traffic is taken by the backup amphora, until the master amphora becomes active again. When the traffic comes back to the master amphora, octavia consider all the pool members as ONLINE, and is not able to balance the load properly
- when a new loadbalancer member is added to the pool, every single pool member is marked as ONLINE, until the amphora mark them as ERROR again
- in our case, we have a load balancer in ERROR state for 2 months. We just triggered a failover of this loadbalancer: OpenShift is triggering a lot of addition/deletion of members into the pool, causing the loadbalancer to go to PENDING_UPDATE state, and all the pool members to go ONLINE. This has been going for more than 1hour now.
- also the loadbalancer always shows in DEGRADED operating_status because some of the pool members are in ERROR.

I think the problem is twofold:
- OpenShift should only add into the loadbalancer the workers that can support the traffic
- Octavia should try its best to keep the member state after a haproxy restart (through something like haproxy server-state-file option)

The Octavia behavior is very easy to reproduce, just create a load balancer with a health monitor, add a couple of  pool members, and you see the member that were in ERROR going back to ONLINE for a short while.

Version-Release number of selected component (if applicable):
RHOSP 16.1

How reproducible:
OCP on OSP with Octavia

Steps to Reproduce:
1.
2.
3.

Actual results:
Loadbalancer always shows as DEGRADED.

Expected results:


Additional info:

Comment 4 egarcia 2021-08-30 16:51:46 UTC

on first glance, my suspicion is that this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1957709. @gthiemon, its not very likely that we will be able to fix this in the in-tree cloud provider unless its urgent, but this is something we are already aware of, and want to fix once we fully integrate the out of tree cloud provider.

Comment 5 Gregory Thiemonge 2021-09-01 06:41:09 UTC

(In reply to egarcia from comment #4)
> on first glance, my suspicion is that this is a duplicate of
> https://bugzilla.redhat.com/show_bug.cgi?id=1957709. @gthiemon,
> its not very likely that we will be able to fix this in the in-tree cloud
> provider unless its urgent, but this is something we are already aware of,
> and want to fix once we fully integrate the out of tree cloud provider.

forwarding the needinfo to @jveiraca as he was the reporter of the original bug in OSP

Comment 7 Joaquín Veira 2021-09-01 08:40:01 UTC


*** This bug has been marked as a duplicate of bug 1957709 ***

Comment 8 egarcia 2021-09-01 12:38:01 UTC

We can't tackle this until we fully adopt the out of tree cloud provider, so we cant promise this until at least 4.10+ unfortunately.