1835974 – [Baremetal] API VIP doesn't fail over to another master when local LB is not healthy.

Bug 1835974 - [Baremetal] API VIP doesn't fail over to another master when local LB is not healthy.

Summary: [Baremetal] API VIP doesn't fail over to another master when local LB is not ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Yossi Boaron
QA Contact:	Eldar Weiss
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-14 20:05 UTC by Yossi Boaron
Modified:	2020-07-13 17:39 UTC (History)
CC List:	4 users (show)
Fixed In Version:	4.5.0-0.nightly-2020-06-09-050255
Doc Type:	Bug Fix
Doc Text:	Cause: Failure of the self-hosted Loadbalancer used for distributing OCP API traffic on the master node that serves as API LB frontend (owns the API VIP). Consequence: The relevant master node will continue to own the API-VIP IP address although local LB is unhealthy and as a result of that OCP API will be unreachable for ~10 seconds. Fix: The Keepalived check for API-VIP script will monitor also self-hosted Loadbalancer health. Result: In case of a failure in local self-hosted Loadbalancer in the master node holding the API-VIP the API-VIP will failover to another master-node and we shouldn't hit service downtime for OCP-API.
Clone Of:
Environment:
Last Closed:	2020-07-13 17:39:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1733	0	None	closed	Bug 1835974: Update keepalived API script to monitor also LB health	2021-01-05 12:38:35 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:39:19 UTC

Description Yossi Boaron 2020-05-14 20:05:59 UTC

Description of problem:

In the current implementation,  the API VIP fails over to another master only based on local kube-api-server pod status.
With this approach, the API VIP can be owned by a master node without a healthy LB.

Version-Release number of the following components:
4.5.0-0.ci-2020-05-14-170026


How reproducible:

Steps to Reproduce:
1.Ssh to the master node holds API VIP
2.Run script that repeatedly deletes Haproxy LB, like so

sleep 1
sudo crictl rm -f $(sudo crictl ps --name haproxy | awk 'FNR==2{ print $1}') 
 

Actual results:

1. API VIP still owned by this master node although local LB is unhealthy/doesn't run
2. In case OCP-APi is not accessible via LB, haproxy-monitor container will delete (after ~10 Sec) the firewall rule that redirects API traffic to LB. so API traffic will be sent directly to local kube-api-server and won't be distributed between masters. 


Expected results:

If local LB is not healthy, API VIP should failover to another master node.

Comment 1 Kirsten Garrison 2020-05-15 17:36:36 UTC

@Yossi Can you add a severity to this BZ please?

Comment 7 errata-xmlrpc 2020-07-13 17:39:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.