1884420 – Keepalived stops on bootstrap too early

Bug 1884420 - Keepalived stops on bootstrap too early

Summary: Keepalived stops on bootstrap too early

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Antoni Segura Puimedon
QA Contact:	Victor Voronkov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-01 22:34 UTC by Ben Nemec
Modified:	2020-10-27 16:47 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: To keep the VIP in the bootstrap node until the masters' API shows up, we increased the priority of the bootstrap keepalived API VIP membership. In order for the VIP to successfully move to the masters even when the bootstrap is requested to stay even after clustering (when its API server is already gone), we implemented a mechanism in the monitor that stops it. The problem with that was that sometimes, during a clustering, the API in the bootstrap node could go down for long enough that it looked like it would not go up anymore. Consequence: If the bootstrap kube-apiserver goes down for some time, and if this time is long enough to trigger the keepalived-monitor to stop keepalived, then the deployment breaks. Fix: Continue to check for the API server on the bootstrap node, and reloading keepalived if it shows up again. In case it is gone for good, API VIP will move to one of the masters, but if it just went down for a while because of API pod restarts and resource issues, we'll reload and reclaim the API VIP. Result: Deployment succeeds.
Clone Of:
Environment:
Last Closed:	2020-10-27 16:47:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift baremetal-runtimecfg pull 102	0	None	closed	Bug 1884420: bootstrap: API shows up, start it again	2021-02-08 16:05:15 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:47:40 UTC

Description Ben Nemec 2020-10-01 22:34:11 UTC

Description of problem: There can be timing issues on the bootstrap node that result in keepalived stopping before kube-apiserver is up on the masters. When this happens, the API VIP migrates to the masters before they are ready and this causes the deployment to fail. The problem appears to be that the bootstrap kube-apiserver goes down for a period of time, and if this time is long enough to trigger the keepalived-monitor to stop keepalived, then the deployment breaks.

How reproducible: Intermittent. In some environments it happens frequently, in others rarely.

Comment 5 errata-xmlrpc 2020-10-27 16:47:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.