Bug 1884420 - Keepalived stops on bootstrap too early
Summary: Keepalived stops on bootstrap too early
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: Antoni Segura Puimedon
QA Contact: Victor Voronkov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-01 22:34 UTC by Ben Nemec
Modified: 2020-10-27 16:47 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: To keep the VIP in the bootstrap node until the masters' API shows up, we increased the priority of the bootstrap keepalived API VIP membership. In order for the VIP to successfully move to the masters even when the bootstrap is requested to stay even after clustering (when its API server is already gone), we implemented a mechanism in the monitor that stops it. The problem with that was that sometimes, during a clustering, the API in the bootstrap node could go down for long enough that it looked like it would not go up anymore. Consequence: If the bootstrap kube-apiserver goes down for some time, and if this time is long enough to trigger the keepalived-monitor to stop keepalived, then the deployment breaks. Fix: Continue to check for the API server on the bootstrap node, and reloading keepalived if it shows up again. In case it is gone for good, API VIP will move to one of the masters, but if it just went down for a while because of API pod restarts and resource issues, we'll reload and reclaim the API VIP. Result: Deployment succeeds.
Clone Of:
Environment:
Last Closed: 2020-10-27 16:47:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift baremetal-runtimecfg pull 102 0 None closed Bug 1884420: bootstrap: API shows up, start it again 2021-02-08 16:05:15 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:47:40 UTC

Description Ben Nemec 2020-10-01 22:34:11 UTC
Description of problem: There can be timing issues on the bootstrap node that result in keepalived stopping before kube-apiserver is up on the masters. When this happens, the API VIP migrates to the masters before they are ready and this causes the deployment to fail. The problem appears to be that the bootstrap kube-apiserver goes down for a period of time, and if this time is long enough to trigger the keepalived-monitor to stop keepalived, then the deployment breaks.

How reproducible: Intermittent. In some environments it happens frequently, in others rarely.

Comment 5 errata-xmlrpc 2020-10-27 16:47:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.