Bug 1981465
| Summary: | Assisted installer wait for ready nodes on bootstrap kube-apiserver though it moved to one of the other masters | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Fred Rolland <frolland> |
| Component: | assisted-installer | Assignee: | Eran Cohen <ercohen> |
| assisted-installer sub component: | Installer | QA Contact: | Udi Kalifon <ukalifon> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | aos-bugs, ercohen, ohochman |
| Version: | 4.7 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.9.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | OCP-Metal-v1.0.24.1 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-10-18 17:38:46 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Can this be reproduced? How should we test this? In which cases does kube-apiserver move to a new master? Maybe @ercohen can provide steps to reproduce. This can be reproduced by delaying one of the master nodes to get to ready status (in kubernetes) after the master node reboot. I think the best way to reproduce it is by killing the CNI pods on one of the master nodes until the kube-apiserver move from the bootstrap to the other master. This will allow the installation to progress but will keep the node in a NotReady status. I think it might reproduce with an easier flow: just stop kubelet, disconnect the node network, stop the node, etc... but unsure how it will effect the kube-apiserver transition to the other master. Verified. We tested by stopping the kubelet service on one of the non-bootstrap masters right after it rebooted. We can see in the logs that the kube-apiserver did not get disconnected, and if we fix the problem within 1 hour the installation succeeds. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |
Description of problem: The assisted-installer on the bootstrap is waiting for 2 ready master nodes using the loopback kubeconfig (expecting the kube-apiserver to run on the bootstrap node). But it seems that the kube-apiserver moved to one of the master node before we had 2 ready masters. According to the controller, ocp2-master1 was ready at 08:53:09 time="2021-07-08T08:53:09Z" level=info msg="Found new ready node ocp2-master1 with inventory id 02984d56-5b5f-aadb-23e4-6ab2c948b311, kubernetes id 02984d56-5b5f-aadb-23e4-6ab2c948b311, updating its status to Done And it match the event on the service: { "cluster_id": "ea9b93d8-9702-41a6-8527-5247085aaa51", "event_time": "2021-07-08T08:53:09.481Z", "host_id": "02984d56-5b5f-aadb-23e4-6ab2c948b311", "message": "Host ocp2-master1: reached installation stage Done", "severity": "info" } The kube-apiserver on the bootstrap went down (acording to the bootkube log) at: 08:50:34 Jul 08 08:50:34 random-hostname-1d35749c-2f11-4cbd-8ea0-7e368cbbe9ad bootkube.sh[7739]: Sending bootstrap-finished event.Tearing down temporary bootstrap control plane... Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. One of the master takes a lot of time to join/ready 2. kube-apiserver move from bootstrap to a ready master 3. Installer still use local kubeconfig Actual results: Expected results: Additional info: