1960278 – alert KubePodCrashLooping: kube-controller-manager-recovery-controller

Bug 1960278 - alert KubePodCrashLooping: kube-controller-manager-recovery-controller

Summary: alert KubePodCrashLooping: kube-controller-manager-recovery-controller

Keywords:
Status:	CLOSED DUPLICATE of bug 1948311
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Maciej Szulik
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-13 13:51 UTC by Petr Muller
Modified:	2021-05-17 08:53 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	test: openshift-tests.[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel]
Last Closed:	2021-05-17 08:53:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Petr Muller 2021-05-13 13:51:58 UTC

This is popping up quite often in CI:


fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: May 13 11:05:34.817: Unexpected alerts fired or pending after the test run:

alert KubePodCrashLooping fired for 150 seconds with labels: {container="kube-controller-manager-recovery-controller", endpoint="https-main", job="kube-state-metrics", namespace="openshift-kube-controller-manager", pod="kube-controller-manager-ci-op-01m0zl8v-36a8b-sgwvs-master-2", service="kube-state-metrics", severity="warning"}

Search:
https://search.ci.openshift.org/?search=alert+KubePodCrashLooping+fired+for+.*+seconds+with+labels%3A.*kube-controller-manager-recovery-controller&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Examples:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere/1392781781551288320
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere/1392796057418600448
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp/1391133786674040832

Comment 1 Petr Muller 2021-05-13 13:56:59 UTC

Not 100% sure if this related to https://bugzilla.redhat.com/show_bug.cgi?id=1958974 "kube-scheduler-recovery-controller is reported as crashlooping in 4.8 on about 8% of multiple types of runs" or not - the crashlooping thing seems to be different, so leaving this as a separate bug for now.

Comment 2 Maciej Szulik 2021-05-17 08:53:58 UTC

Yeah, this looks very similar looking at the pod logs:

2021-05-08T21:28:45.715477293Z E0508 21:28:45.715424       1 csrcontroller.go:146] key failed with : Get "https://localhost:6443/api/v1/namespaces/openshift-kube-controller-manager-operator/configmaps/csr-signer-ca": dial tcp [::1]:6443: connect: connection refused
2021-05-08T21:28:50.393085920Z I0508 21:28:50.393009       1 leaderelection.go:278] failed to renew lease openshift-kube-controller-manager/cert-recovery-controller-lock: timed out waiting for the condition
2021-05-08T21:28:50.393179604Z E0508 21:28:50.393155       1 leaderelection.go:301] Failed to release lock: resource name may not be empty
2021-05-08T21:28:50.393209737Z W0508 21:28:50.393186       1 leaderelection.go:75] leader election lost

since the investigation is an ongoing, I'll close that as duplicate of the other one.

*** This bug has been marked as a duplicate of bug 1948311 ***

Note You need to log in before you can comment on or make changes to this bug.