1981465 – Assisted installer wait for ready nodes on bootstrap kube-apiserver though it moved to one of the other masters

Bug 1981465 - Assisted installer wait for ready nodes on bootstrap kube-apiserver though it moved to one of the other masters

Summary: Assisted installer wait for ready nodes on bootstrap kube-apiserver though it...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	assisted-installer
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Eran Cohen
QA Contact:	Udi Kalifon
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-12 15:48 UTC by Fred Rolland
Modified:	2021-10-18 17:38 UTC (History)
CC List:	3 users (show)
Fixed In Version:	OCP-Metal-v1.0.24.1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-18 17:38:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift assisted-installer pull 327	0	None	open	Bug 1981465: Assisted installer wait for ready master nodes on bootst…	2021-07-13 08:40:34 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:38:49 UTC

Description Fred Rolland 2021-07-12 15:48:55 UTC

Description of problem:
The assisted-installer on the bootstrap is waiting for 2 ready master nodes using the loopback kubeconfig (expecting the kube-apiserver to run on the bootstrap node).
But it seems that the kube-apiserver moved to one of the master node before we had 2 ready masters.



According to the controller, ocp2-master1 was ready at 08:53:09
time="2021-07-08T08:53:09Z" level=info msg="Found new ready node ocp2-master1 with inventory id 02984d56-5b5f-aadb-23e4-6ab2c948b311, kubernetes id 02984d56-5b5f-aadb-23e4-6ab2c948b311, updating its status to Done

And it match the event on the service:
  {
    "cluster_id": "ea9b93d8-9702-41a6-8527-5247085aaa51",
    "event_time": "2021-07-08T08:53:09.481Z",
    "host_id": "02984d56-5b5f-aadb-23e4-6ab2c948b311",
    "message": "Host ocp2-master1: reached installation stage Done",
    "severity": "info"
  }

The kube-apiserver on the bootstrap went down (acording to the bootkube log) at: 08:50:34
Jul 08 08:50:34 random-hostname-1d35749c-2f11-4cbd-8ea0-7e368cbbe9ad bootkube.sh[7739]: Sending bootstrap-finished event.Tearing down temporary bootstrap control plane...



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. One of the master takes a lot of time to join/ready
2. kube-apiserver move from bootstrap to a ready master
3. Installer still use local kubeconfig

Actual results:


Expected results:


Additional info:

Comment 1 Udi Kalifon 2021-07-16 12:10:52 UTC

Can this be reproduced? How should we test this? In which cases does kube-apiserver move to a new master?

Comment 2 Fred Rolland 2021-07-20 14:12:17 UTC

Maybe @ercohen can provide steps to reproduce.

Comment 3 Eran Cohen 2021-07-21 11:47:39 UTC

This can be reproduced by delaying one of the master nodes to get to ready status (in kubernetes) after the master node reboot.

I think the best way to reproduce it is by killing the CNI pods on one of the master nodes until the kube-apiserver move from the bootstrap to the other master.
This will allow the installation to progress but will keep the node in a NotReady status.
 
I think it might reproduce with an easier flow: just stop kubelet, disconnect the node network, stop the node, etc... but unsure how it will effect the kube-apiserver transition to the other master.

Comment 5 Udi Kalifon 2021-08-09 20:41:44 UTC

Verified. We tested by stopping the kubelet service on one of the non-bootstrap masters right after it rebooted. We can see in the logs that the kube-apiserver did not get disconnected, and if we fix the problem within 1 hour the installation succeeds.

Comment 8 errata-xmlrpc 2021-10-18 17:38:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.