Bug 2057007 - LB members in ERROR become ONLINE when adding new members
Summary: LB members in ERROR become ONLINE when adding new members
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-octavia
Version: 16.2 (Train)
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: z3
: 16.2 (Train on RHEL 8.4)
Assignee: Gregory Thiemonge
QA Contact: Bruna Bonguardo
URL:
Whiteboard:
Depends On: 1996756
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-22 14:42 UTC by Gregory Thiemonge
Modified: 2022-06-22 16:04 UTC (History)
7 users (show)

Fixed In Version: openstack-octavia-5.1.3-2.20220222165233.1bc1477.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1996756
Environment:
Last Closed: 2022-06-22 16:04:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 828274 0 None MERGED Preserve haproxy server states during reloads 2022-02-22 14:46:43 UTC
Red Hat Issue Tracker OSP-12817 0 None None None 2022-02-22 14:52:36 UTC
Red Hat Product Errata RHBA-2022:4793 0 None None None 2022-06-22 16:04:43 UTC

Description Gregory Thiemonge 2022-02-22 14:42:24 UTC
+++ This bug was initially created as a clone of Bug #1996756 +++

Description of problem:
We run OpenShift (4.6) on OpenStack (16.1).
We notice that the load balancer pool members contains only 2 ONLINE members:
```
$ openstack loadbalancer member list 56bcc29c-7624-4b2d-8935-229472ca2316 -c operating_status -f value | sort | uniq -c
    132 ERROR
      2 ONLINE
```
In this particular case, the loadbalancer is expected to balance the load between 2 servers, yet there are 132 useless workers in the member list.

The problem arise in case of failover:
- when OpenStack triggers a failover of the master amphora, the traffic is taken by the backup amphora, until the master amphora becomes active again. When the traffic comes back to the master amphora, octavia consider all the pool members as ONLINE, and is not able to balance the load properly
- when a new loadbalancer member is added to the pool, every single pool member is marked as ONLINE, until the amphora mark them as ERROR again
- in our case, we have a load balancer in ERROR state for 2 months. We just triggered a failover of this loadbalancer: OpenShift is triggering a lot of addition/deletion of members into the pool, causing the loadbalancer to go to PENDING_UPDATE state, and all the pool members to go ONLINE. This has been going for more than 1hour now.
- also the loadbalancer always shows in DEGRADED operating_status because some of the pool members are in ERROR.

I think the problem is twofold:
- OpenShift should only add into the loadbalancer the workers that can support the traffic
- Octavia should try its best to keep the member state after a haproxy restart (through something like haproxy server-state-file option)

The Octavia behavior is very easy to reproduce, just create a load balancer with a health monitor, add a couple of  pool members, and you see the member that were in ERROR going back to ONLINE for a short while.

Version-Release number of selected component (if applicable):
RHOSP 16.1

How reproducible:
OCP on OSP with Octavia

Steps to Reproduce:
1.
2.
3.

Actual results:
Loadbalancer always shows as DEGRADED.

Expected results:


Additional info:

Comment 10 errata-xmlrpc 2022-06-22 16:04:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.3 (Train)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:4793


Note You need to log in before you can comment on or make changes to this bug.