Bug 1981465

Summary:	Assisted installer wait for ready nodes on bootstrap kube-apiserver though it moved to one of the other masters
Product:	OpenShift Container Platform	Reporter:	Fred Rolland <frolland>
Component:	assisted-installer	Assignee:	Eran Cohen <ercohen>
assisted-installer sub component:	Installer	QA Contact:	Udi Kalifon <ukalifon>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	aos-bugs, ercohen, ohochman
Version:	4.7
Target Milestone:	---
Target Release:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	OCP-Metal-v1.0.24.1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-10-18 17:38:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Fred Rolland 2021-07-12 15:48:55 UTC

Description of problem:
The assisted-installer on the bootstrap is waiting for 2 ready master nodes using the loopback kubeconfig (expecting the kube-apiserver to run on the bootstrap node).
But it seems that the kube-apiserver moved to one of the master node before we had 2 ready masters.



According to the controller, ocp2-master1 was ready at 08:53:09
time="2021-07-08T08:53:09Z" level=info msg="Found new ready node ocp2-master1 with inventory id 02984d56-5b5f-aadb-23e4-6ab2c948b311, kubernetes id 02984d56-5b5f-aadb-23e4-6ab2c948b311, updating its status to Done

And it match the event on the service:
  {
    "cluster_id": "ea9b93d8-9702-41a6-8527-5247085aaa51",
    "event_time": "2021-07-08T08:53:09.481Z",
    "host_id": "02984d56-5b5f-aadb-23e4-6ab2c948b311",
    "message": "Host ocp2-master1: reached installation stage Done",
    "severity": "info"
  }

The kube-apiserver on the bootstrap went down (acording to the bootkube log) at: 08:50:34
Jul 08 08:50:34 random-hostname-1d35749c-2f11-4cbd-8ea0-7e368cbbe9ad bootkube.sh[7739]: Sending bootstrap-finished event.Tearing down temporary bootstrap control plane...



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. One of the master takes a lot of time to join/ready
2. kube-apiserver move from bootstrap to a ready master
3. Installer still use local kubeconfig

Actual results:


Expected results:


Additional info:

Comment 1 Udi Kalifon 2021-07-16 12:10:52 UTC

Can this be reproduced? How should we test this? In which cases does kube-apiserver move to a new master?

Comment 2 Fred Rolland 2021-07-20 14:12:17 UTC

Maybe @ercohen can provide steps to reproduce.

Comment 3 Eran Cohen 2021-07-21 11:47:39 UTC

This can be reproduced by delaying one of the master nodes to get to ready status (in kubernetes) after the master node reboot.

I think the best way to reproduce it is by killing the CNI pods on one of the master nodes until the kube-apiserver move from the bootstrap to the other master.
This will allow the installation to progress but will keep the node in a NotReady status.
 
I think it might reproduce with an easier flow: just stop kubelet, disconnect the node network, stop the node, etc... but unsure how it will effect the kube-apiserver transition to the other master.

Comment 5 Udi Kalifon 2021-08-09 20:41:44 UTC

Verified. We tested by stopping the kubelet service on one of the non-bootstrap masters right after it rebooted. We can see in the logs that the kube-apiserver did not get disconnected, and if we fix the problem within 1 hour the installation succeeds.

Comment 8 errata-xmlrpc 2021-10-18 17:38:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759