Bug 1981465 - Assisted installer wait for ready nodes on bootstrap kube-apiserver though it moved to one of the other masters
Summary: Assisted installer wait for ready nodes on bootstrap kube-apiserver though it...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: assisted-installer
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.0
Assignee: Eran Cohen
QA Contact: Udi Kalifon
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-12 15:48 UTC by Fred Rolland
Modified: 2021-10-18 17:38 UTC (History)
3 users (show)

Fixed In Version: OCP-Metal-v1.0.24.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:38:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift assisted-installer pull 327 0 None open Bug 1981465: Assisted installer wait for ready master nodes on bootst… 2021-07-13 08:40:34 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:38:49 UTC

Description Fred Rolland 2021-07-12 15:48:55 UTC
Description of problem:
The assisted-installer on the bootstrap is waiting for 2 ready master nodes using the loopback kubeconfig (expecting the kube-apiserver to run on the bootstrap node).
But it seems that the kube-apiserver moved to one of the master node before we had 2 ready masters.



According to the controller, ocp2-master1 was ready at 08:53:09
time="2021-07-08T08:53:09Z" level=info msg="Found new ready node ocp2-master1 with inventory id 02984d56-5b5f-aadb-23e4-6ab2c948b311, kubernetes id 02984d56-5b5f-aadb-23e4-6ab2c948b311, updating its status to Done

And it match the event on the service:
  {
    "cluster_id": "ea9b93d8-9702-41a6-8527-5247085aaa51",
    "event_time": "2021-07-08T08:53:09.481Z",
    "host_id": "02984d56-5b5f-aadb-23e4-6ab2c948b311",
    "message": "Host ocp2-master1: reached installation stage Done",
    "severity": "info"
  }

The kube-apiserver on the bootstrap went down (acording to the bootkube log) at: 08:50:34
Jul 08 08:50:34 random-hostname-1d35749c-2f11-4cbd-8ea0-7e368cbbe9ad bootkube.sh[7739]: Sending bootstrap-finished event.Tearing down temporary bootstrap control plane...



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. One of the master takes a lot of time to join/ready
2. kube-apiserver move from bootstrap to a ready master
3. Installer still use local kubeconfig

Actual results:


Expected results:


Additional info:

Comment 1 Udi Kalifon 2021-07-16 12:10:52 UTC
Can this be reproduced? How should we test this? In which cases does kube-apiserver move to a new master?

Comment 2 Fred Rolland 2021-07-20 14:12:17 UTC
Maybe @ercohen can provide steps to reproduce.

Comment 3 Eran Cohen 2021-07-21 11:47:39 UTC
This can be reproduced by delaying one of the master nodes to get to ready status (in kubernetes) after the master node reboot.

I think the best way to reproduce it is by killing the CNI pods on one of the master nodes until the kube-apiserver move from the bootstrap to the other master.
This will allow the installation to progress but will keep the node in a NotReady status.
 
I think it might reproduce with an easier flow: just stop kubelet, disconnect the node network, stop the node, etc... but unsure how it will effect the kube-apiserver transition to the other master.

Comment 5 Udi Kalifon 2021-08-09 20:41:44 UTC
Verified. We tested by stopping the kubelet service on one of the non-bootstrap masters right after it rebooted. We can see in the logs that the kube-apiserver did not get disconnected, and if we fix the problem within 1 hour the installation succeeds.

Comment 8 errata-xmlrpc 2021-10-18 17:38:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.