Description of problem: The assisted-installer on the bootstrap is waiting for 2 ready master nodes using the loopback kubeconfig (expecting the kube-apiserver to run on the bootstrap node). But it seems that the kube-apiserver moved to one of the master node before we had 2 ready masters. According to the controller, ocp2-master1 was ready at 08:53:09 time="2021-07-08T08:53:09Z" level=info msg="Found new ready node ocp2-master1 with inventory id 02984d56-5b5f-aadb-23e4-6ab2c948b311, kubernetes id 02984d56-5b5f-aadb-23e4-6ab2c948b311, updating its status to Done And it match the event on the service: { "cluster_id": "ea9b93d8-9702-41a6-8527-5247085aaa51", "event_time": "2021-07-08T08:53:09.481Z", "host_id": "02984d56-5b5f-aadb-23e4-6ab2c948b311", "message": "Host ocp2-master1: reached installation stage Done", "severity": "info" } The kube-apiserver on the bootstrap went down (acording to the bootkube log) at: 08:50:34 Jul 08 08:50:34 random-hostname-1d35749c-2f11-4cbd-8ea0-7e368cbbe9ad bootkube.sh[7739]: Sending bootstrap-finished event.Tearing down temporary bootstrap control plane... Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. One of the master takes a lot of time to join/ready 2. kube-apiserver move from bootstrap to a ready master 3. Installer still use local kubeconfig Actual results: Expected results: Additional info:
Can this be reproduced? How should we test this? In which cases does kube-apiserver move to a new master?
Maybe @ercohen can provide steps to reproduce.
This can be reproduced by delaying one of the master nodes to get to ready status (in kubernetes) after the master node reboot. I think the best way to reproduce it is by killing the CNI pods on one of the master nodes until the kube-apiserver move from the bootstrap to the other master. This will allow the installation to progress but will keep the node in a NotReady status. I think it might reproduce with an easier flow: just stop kubelet, disconnect the node network, stop the node, etc... but unsure how it will effect the kube-apiserver transition to the other master.
Verified. We tested by stopping the kubelet service on one of the non-bootstrap masters right after it rebooted. We can see in the logs that the kube-apiserver did not get disconnected, and if we fix the problem within 1 hour the installation succeeds.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759