Description of problem: Some of the CI jobs are failing due to unavailability of apiserver pods which is happening during the installation phase. The installer team said that once all the terraform scripts are executed, it checks if it can get a response from the apiserver and if it gets a response then the bootstrap node is removed (https://coreos.slack.com/archives/C68TNFWA2/p1655194740110649). Now, in the CI jobs we are hitting a case where there's only one apiserver pod when the bootstrap node is removed, and that apiserver pod also gets a shutdown signal. Only after the apiserver pod restarts then the responses are received from apiserver. The other apiserver pods come up after ~6-7mins. Following are some of the CI jobs which have failed due to this reason: (1) https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1550035368093421568 (2) https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1549737257605271552 (3) https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1547763014369808384 Version-Release number of selected component (if applicable): 4.10.22, 4.10.23, 4.10.24 How reproducible: Not able to reproduce the issue. Found them in the CI job failures. Steps to Reproduce: 1. 2. 3. Actual results: Showing results for only (1) https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1550035368093421568 CI job failure reason: : [sig-arch] events should not repeat pathologically 0s { 1 events happened too frequently event happened 25 times, something is wrong: ns/openshift-marketplace pod/marketplace-operator-6fc4c9d8f-nrz7j node/ip-10-0-251-212.us-west-2.compute.internal - reason/ProbeError Readiness probe error: Get "http://10.130.0.16:8080/healthz": dial tcp 10.130.0.16:8080: connect: connection refused body: } On further investigation of the logs of the pod marketplace-operator-6fc4c9d8f-nrz7j the following messages were found: 2022-07-21 08:59:15 E0721 08:59:15.018356 1 leaderelection.go:330] error retrieving resource lock openshift-marketplace/marketplace-operator-lock: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-marketplace/configmaps/marketplace-operator-lock": dial tcp 172.30.0.1:443: i/o timeout 2022-07-21 08:59:45 I0721 08:59:45.018153 1 leaderelection.go:283] failed to renew lease openshift-marketplace/marketplace-operator-lock: timed out waiting for the condition 2022-07-21 08:59:45 E0721 08:59:45.018226 1 leaderelection.go:306] Failed to release lock: resource name may not be empty 2022-07-21 08:59:45 time="2022-07-21T08:59:45Z" level=warning msg="leader election lost for marketplace-operator-6fc4c9d8f-nrz7j identity" The lock was acquired after ~3mins: 2022-07-21 09:02:04 I0721 09:02:04.315133 1 leaderelection.go:248] attempting to acquire leader lease openshift-marketplace/marketplace-operator-lock... 2022-07-21 09:02:04 I0721 09:02:04.322037 1 leaderelection.go:258] successfully acquired lease openshift-marketplace/marketplace-operator-lock This event happened during the installation phase and during that time only a single apiserver pod was up in the openshift-kube-apiserver namespace. That apiserver pod received a signal for shutdown: 2022-07-21 08:57:43 I0721 08:57:43.768118 1 cmd.go:97] Received SIGTERM or SIGINT signal, shutting down controller. The logs in openshift-ovn-kubernetes namespace was also checked as to when the apiserver pod IP addresses were added and removed from the service default/kubernetes. Initially, the first apiserver pod (10.0.251.212) on the master node gets added at 08:56:56. The existing apiserver pod (10.0.81.233) is the one running on the bootstrap node: 2022-07-21 08:56:56 I0721 08:56:56.844739 1 kube.go:317] Adding slice kubernetes endpoints: [10.0.251.212], port: 6443 2022-07-21 08:56:56 I0721 08:56:56.844748 1 kube.go:317] Adding slice kubernetes endpoints: [10.0.81.233], port: 6443 2022-07-21 08:56:56 I0721 08:56:56.844754 1 kube.go:333] LB Endpoints for default/kubernetes are: [10.0.251.212 10.0.81.233] / [] on port: 6443 Then, the apiserver pod running on the first master node gets removed at 08:57:43, when it received the signal for shutdown. The remaining apiserver pod is the one running on the bootstrap node. The bootstrap node would be removed shortly as it would have already got the response from the apiserver pod from the first master node: 2022-07-21 08:57:43 I0721 08:57:43.786267 1 kube.go:317] Adding slice kubernetes endpoints: [10.0.81.233], port: 6443 2022-07-21 08:57:43 I0721 08:57:43.786277 1 kube.go:333] LB Endpoints for default/kubernetes are: [10.0.81.233] / [] on port: 6443 During this period of time all apiserver pods are down and all requests fail. That's why the lease could not be renewed on the lock. Similar pattern is there in both (2) and (3). Expected results: At least one apiserver pod should be available during the installation phase. Additional info: Slack threads regarding the issue: 1. https://coreos.slack.com/archives/CB48XQ4KZ/p1654684714336479 2. https://coreos.slack.com/archives/C68TNFWA2/p1655194740110649