Bug 1847646

Summary:	[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured: KubePodCrashLooping: openshift-kube-controller-manager
Product:	OpenShift Container Platform	Reporter:	Abu Kashem <akashem>
Component:	kube-controller-manager	Assignee:	Tomáš Nožička <tnozicka>
Status:	CLOSED ERRATA	QA Contact:	RamaKasturi <knarra>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.5	CC:	aos-bugs, knarra, mfojtik
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:07:36 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Abu Kashem 2020-06-16 18:28:37 UTC

Description of problem:

[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel] expand_less	1m27s
fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected
    <map[string]error | len:1>: {
        "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1": {
            s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"cluster-policy-controller\",\"endpoint\":\"https-main\",\"instance\":\"10.131.0.11:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-controller-manager\",\"pod\":\"kube-controller-manager-ci-op-bdd1t18l-2aad9-l7gf7-master-0\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1592323459.738,\"8\"]},{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"cluster-policy-controller\",\"endpoint\":\"https-main\",\"instance\":\"10.131.0.11:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-controller-manager\",\"pod\":\"kube-controller-manager-ci-op-bdd1t18l-2aad9-l7gf7-master-1\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1592323459.738,\"8\"]},{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"cluster-policy-controller\",\"endpoint\":\"https-main\",\"instance\":\"10.131.0.11:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-controller-manager\",\"pod\":\"kube-controller-manager-ci-op-bdd1t18l-2aad9-l7gf7-master-2\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1592323459.738,\"12\"]}]",
        },
    }
to be empty


https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1272906558837100544



Version-Release number of selected component (if applicable):
4.5

Comment 1 Abu Kashem 2020-06-16 19:26:31 UTC

Log from another job: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1272836348180434944


Jun 16 11:00:11.679 E ns/openshift-kube-apiserver pod/kube-apiserver-ci-op-9sqtdx6b-2aad9-r8zbs-master-1 node/ci-op-9sqtdx6b-2aad9-r8zbs-master-1 container/setup init container exited with code 124 (Error): ................................................................................
Jun 16 11:00:14.676 E ns/openshift-kube-controller-manager pod/kube-controller-manager-ci-op-9sqtdx6b-2aad9-r8zbs-master-1 node/ci-op-9sqtdx6b-2aad9-r8zbs-master-1 container/cluster-policy-controller container exited with code 255 (Error): ?allowWatchBookmarks=true&resourceVersion=25979&timeout=9m14s&timeoutSeconds=554&watch=true: dial tcp [::1]:6443: connect: connection refused\nE0616 11:00:13.857118       1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1beta1.Ingress: Get https://localhost:6443/apis/extensions/v1beta1/ingresses?allowWatchBookmarks=true&resourceVersion=15009&timeout=6m57s&timeoutSeconds=417&watch=true: dial tcp [::1]:6443: connect: connection refused\nE0616 11:00:13.858224       1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1.DeploymentConfig: Get https://localhost:6443/apis/apps.openshift.io/v1/deploymentconfigs?allowWatchBookmarks=true&resourceVersion=24983&timeout=6m0s&timeoutSeconds=360&watch=true: dial tcp [::1]:6443: connect: connection refused\nE0616 11:00:13.859232       1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1beta1.Event: Get https://localhost:6443/apis/events.k8s.io/v1beta1/events?allowWatchBookmarks=true&resourceVersion=26035&timeout=8m51s&timeoutSeconds=531&watch=true: dial tcp [::1]:6443: connect: connection refused\nE0616 11:00:13.860365       1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1beta1.Ingress: Get https://localhost:6443/apis/networking.k8s.io/v1beta1/ingresses?allowWatchBookmarks=true&resourceVersion=15009&timeout=8m41s&timeoutSeconds=521&watch=true: dial tcp [::1]:6443: connect: connection refused\nE0616 11:00:13.861385       1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1.Service: Get https://localhost:6443/api/v1/services?allowWatchBookmarks=true&resourceVersion=18248&timeout=7m10s&timeoutSeconds=430&watch=true: dial tcp [::1]:6443: connect: connection refused\nI0616 11:00:14.586952       1 leaderelection.go:277] failed to renew lease openshift-kube-controller-manager/cluster-policy-controller: timed out waiting for the condition\nF0616 11:00:14.587014       1 policy_controller.go:94] leaderelection lost\nI0616 11:00:14.587033       1 reconciliation_controller.go:154] Shutting down ClusterQuotaReconcilationController\n


kube-apiserver on the localhost was rolling out and kube-controller-manager got "connection refused" error. This is causing the alert to go off maybe?

Comment 2 Tomáš Nožička 2020-06-18 09:18:28 UTC

I looked into it yesterday, one restart was caused by lost leader election because the local apiserver was down too long, the other was cluster policy controller not waiting correctly for the port to be available again. Both would reconcile over time. Will try to allocate some time to fix this but it doesn't look fatal.

Comment 4 Tomáš Nožička 2020-07-30 16:04:49 UTC

KCM now uses internal loadbalancer, this should be fixed

Comment 7 RamaKasturi 2020-08-12 09:28:27 UTC

Moving the bug to verified state with the payload below as i have not seen any alerts getting fired for KubePodCrashLooping.

[ramakasturinarra@dhcp35-60 ~]$ oc version
Client Version: 4.6.0-202008120152.p0-ddbae76
Server Version: 4.6.0-0.nightly-2020-08-11-040013
Kubernetes Version: v1.19.0-rc.2+5241b27-dirty

Below are the steps followed to verify the same:
==================================================
1) Login to Prometheus console and click on Firing tab
2) on the same master, bring down the local kube-apiserver pod and see if the local KCM fails or keeps working use a master with KCM leader
3) To bring down run the commands mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml /home/kube-apiserver-pod.yaml and check using crictl command that it terminated the containers.
4) Also check oc get pods -n openshift-kube-apiserver to see that apiserver pod on the master node where manifests were moved is not present.
5) Now check prometheus console to see if there is any alert in firing state with respect to KubePodCrashLooping and also see that KCM keeps running with out any issues.

Comment 9 errata-xmlrpc 2020-10-27 16:07:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196