Bug 1958974 - kube-scheduler-recovery-controller is reported as crashlooping in 4.8 on about 8% of multiple types of runs
Summary: kube-scheduler-recovery-controller is reported as crashlooping in 4.8 on abou...
Keywords:
Status: CLOSED DUPLICATE of bug 1948311
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-scheduler
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.8.0
Assignee: Jan Chaloupka
QA Contact: RamaKasturi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-10 14:14 UTC by Clayton Coleman
Modified: 2021-05-14 12:56 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-14 12:56:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Clayton Coleman 2021-05-10 14:14:19 UTC
In 4.8 an alert is firing during multiple types of runs (not just upgrade)

[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel]

fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: May  6 20:03:09.938: Unexpected alerts fired or pending after the test run:

alert KubePodCrashLooping fired for 60 seconds with labels: {container="kube-scheduler-recovery-controller", endpoint="https-main", job="kube-state-metrics", namespace="openshift-kube-scheduler", pod="openshift-kube-scheduler-ci-op-5w9r9bh3-f23e1-v72l9-master-2", service="kube-state-metrics", severity="warning"}

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1390379317086654464

https://search.ci.openshift.org/?search=KubePodCrashLooping+fired.*kube-scheduler-recovery-controller&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

It appears to be happening about 2-10% of runs across a wide range of job types.

In the specific run I see:

E0506 19:23:59.655311       1 leaderelection.go:325] error retrieving resource lock openshift-kube-controller-manager/cert-recovery-controller-lock: Get "https://localhost:6443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/cert-recovery-controller-lock?timeout=35s": dial tcp [::1]:6443: connect: connection refused
E0506 19:23:59.771404       1 csrcontroller.go:146] key failed with : Get "https://localhost:6443/api/v1/namespaces/openshift-kube-controller-manager-operator/configmaps/csr-signer-ca": dial tcp [::1]:6443: connect: connection refused
I0506 19:24:04.653352       1 leaderelection.go:278] failed to renew lease openshift-kube-controller-manager/cert-recovery-controller-lock: timed out waiting for the condition
E0506 19:24:04.653502       1 leaderelection.go:301] Failed to release lock: resource name may not be empty
W0506 19:24:04.653534       1 leaderelection.go:75] leader election lost

and 

E0506 19:31:07.032797       1 leaderelection.go:325] error retrieving resource lock openshift-kube-controller-manager/cert-recovery-controller-lock: Get "https://localhost:6443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/cert-recovery-controller-lock?timeout=35s": dial tcp [::1]:6443: connect: connection refused
E0506 19:31:07.292056       1 csrcontroller.go:146] key failed with : Get "https://localhost:6443/api/v1/namespaces/openshift-kube-controller-manager-operator/configmaps/csr-signer-ca": dial tcp [::1]:6443: connect: connection refused
I0506 19:31:12.029354       1 leaderelection.go:278] failed to renew lease openshift-kube-controller-manager/cert-recovery-controller-lock: timed out waiting for the condition
E0506 19:31:12.029631       1 leaderelection.go:301] Failed to release lock: resource name may not be empty
W0506 19:31:12.029693       1 leaderelection.go:75] leader election lost

Which are near the end of the install period.  The api server is rolling out during this time as well.

The alert starts firing at 19:36 to 19:37 which is about 6m after the start of the test run (around 19:30).

In general, during normal operations when the apiserver is rolling out we shouldn't fire alerts, so if we're going to fire KubePodCrashLooping during a normal apiserver rollout because of the recovery controller we may just need to tune KubePodCrashLooping to take longer to fire.  The alert requires more than one restart in two separate 10m intervals... which might not be the right calculation and be too aggressive.

Comment 1 Maciej Szulik 2021-05-12 08:47:00 UTC
Jan you started looking into this problem.

Comment 2 Jan Chaloupka 2021-05-14 12:56:11 UTC

*** This bug has been marked as a duplicate of bug 1948311 ***


Note You need to log in before you can comment on or make changes to this bug.