Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1884420

Summary: Keepalived stops on bootstrap too early
Product: OpenShift Container Platform Reporter: Ben Nemec <bnemec>
Component: NetworkingAssignee: Antoni Segura Puimedon <asegurap>
Networking sub component: runtime-cfg QA Contact: Victor Voronkov <vvoronko>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: bperkins, yboaron
Version: 4.6   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: To keep the VIP in the bootstrap node until the masters' API shows up, we increased the priority of the bootstrap keepalived API VIP membership. In order for the VIP to successfully move to the masters even when the bootstrap is requested to stay even after clustering (when its API server is already gone), we implemented a mechanism in the monitor that stops it. The problem with that was that sometimes, during a clustering, the API in the bootstrap node could go down for long enough that it looked like it would not go up anymore. Consequence: If the bootstrap kube-apiserver goes down for some time, and if this time is long enough to trigger the keepalived-monitor to stop keepalived, then the deployment breaks. Fix: Continue to check for the API server on the bootstrap node, and reloading keepalived if it shows up again. In case it is gone for good, API VIP will move to one of the masters, but if it just went down for a while because of API pod restarts and resource issues, we'll reload and reclaim the API VIP. Result: Deployment succeeds.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:47:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ben Nemec 2020-10-01 22:34:11 UTC
Description of problem: There can be timing issues on the bootstrap node that result in keepalived stopping before kube-apiserver is up on the masters. When this happens, the API VIP migrates to the masters before they are ready and this causes the deployment to fail. The problem appears to be that the bootstrap kube-apiserver goes down for a period of time, and if this time is long enough to trigger the keepalived-monitor to stop keepalived, then the deployment breaks.

How reproducible: Intermittent. In some environments it happens frequently, in others rarely.

Comment 5 errata-xmlrpc 2020-10-27 16:47:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196