Bug 2112022 - During installation phase apiserver becomes unavailable
Summary: During installation phase apiserver becomes unavailable
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: ---
Assignee: Abu Kashem
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-28 16:25 UTC by Arkadeep Sen
Modified: 2022-08-22 12:21 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-22 12:21:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Arkadeep Sen 2022-07-28 16:25:54 UTC
Description of problem:
Some of the CI jobs are failing due to unavailability of apiserver pods which is happening during the installation phase. The installer team said that once all the terraform scripts are executed, it checks if it can get a response from the apiserver and if it gets a response then the bootstrap node is removed (https://coreos.slack.com/archives/C68TNFWA2/p1655194740110649). Now, in the CI jobs we are hitting a case where there's only one apiserver pod when the bootstrap node is removed, and that apiserver pod also gets a shutdown signal. Only after the apiserver pod restarts then the responses are received from apiserver. The other apiserver pods come up after ~6-7mins.

Following are some of the CI jobs which have failed due to this reason:
(1) https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1550035368093421568
(2) https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1549737257605271552
(3) https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1547763014369808384


Version-Release number of selected component (if applicable): 4.10.22, 4.10.23, 4.10.24


How reproducible: Not able to reproduce the issue. Found them in the CI job failures.


Steps to Reproduce:
1.
2.
3.

Actual results:
Showing results for only (1) https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1550035368093421568

CI job failure reason:
: [sig-arch] events should not repeat pathologically 	0s
{  1 events happened too frequently

event happened 25 times, something is wrong: ns/openshift-marketplace pod/marketplace-operator-6fc4c9d8f-nrz7j node/ip-10-0-251-212.us-west-2.compute.internal - reason/ProbeError Readiness probe error: Get "http://10.130.0.16:8080/healthz": dial tcp 10.130.0.16:8080: connect: connection refused
body: 
}

On further investigation of the logs of the pod marketplace-operator-6fc4c9d8f-nrz7j the following messages were found:
2022-07-21 08:59:15	E0721 08:59:15.018356       1 leaderelection.go:330] error retrieving resource lock openshift-marketplace/marketplace-operator-lock: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-marketplace/configmaps/marketplace-operator-lock": dial tcp 172.30.0.1:443: i/o timeout
2022-07-21 08:59:45	I0721 08:59:45.018153       1 leaderelection.go:283] failed to renew lease openshift-marketplace/marketplace-operator-lock: timed out waiting for the condition
2022-07-21 08:59:45	E0721 08:59:45.018226       1 leaderelection.go:306] Failed to release lock: resource name may not be empty
2022-07-21 08:59:45	time="2022-07-21T08:59:45Z" level=warning msg="leader election lost for marketplace-operator-6fc4c9d8f-nrz7j identity"

The lock was acquired after ~3mins:
2022-07-21 09:02:04	I0721 09:02:04.315133       1 leaderelection.go:248] attempting to acquire leader lease openshift-marketplace/marketplace-operator-lock...
2022-07-21 09:02:04	I0721 09:02:04.322037       1 leaderelection.go:258] successfully acquired lease openshift-marketplace/marketplace-operator-lock

This event happened during the installation phase and during that time only a single apiserver pod was up in the openshift-kube-apiserver namespace. That apiserver pod received a signal for shutdown:
2022-07-21 08:57:43	I0721 08:57:43.768118       1 cmd.go:97] Received SIGTERM or SIGINT signal, shutting down controller.

The logs in openshift-ovn-kubernetes namespace was also checked as to when the apiserver pod IP addresses were added and removed from the service default/kubernetes. Initially, the first apiserver pod (10.0.251.212) on the master node gets added at 08:56:56. The existing apiserver pod (10.0.81.233) is the one running on the bootstrap node:
2022-07-21 08:56:56	I0721 08:56:56.844739       1 kube.go:317] Adding slice kubernetes endpoints: [10.0.251.212], port: 6443
2022-07-21 08:56:56	I0721 08:56:56.844748       1 kube.go:317] Adding slice kubernetes endpoints: [10.0.81.233], port: 6443
2022-07-21 08:56:56	I0721 08:56:56.844754       1 kube.go:333] LB Endpoints for default/kubernetes are: [10.0.251.212 10.0.81.233] / [] on port: 6443

Then, the apiserver pod running on the first master node gets removed at 08:57:43, when it received the signal for shutdown. The remaining apiserver pod is the one running on the bootstrap node. The bootstrap node would be removed shortly as it would have already got the response from the apiserver pod from the first master node:
2022-07-21 08:57:43	I0721 08:57:43.786267       1 kube.go:317] Adding slice kubernetes endpoints: [10.0.81.233], port: 6443
2022-07-21 08:57:43	I0721 08:57:43.786277       1 kube.go:333] LB Endpoints for default/kubernetes are: [10.0.81.233] / [] on port: 6443

During this period of time all apiserver pods are down and all requests fail. That's why the lease could not be renewed on the lock.

Similar pattern is there in both (2) and (3).

Expected results:
At least one apiserver pod should be available during the installation phase.


Additional info:
Slack threads regarding the issue:
1. https://coreos.slack.com/archives/CB48XQ4KZ/p1654684714336479
2. https://coreos.slack.com/archives/C68TNFWA2/p1655194740110649


Note You need to log in before you can comment on or make changes to this bug.