Description of problem: [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel] expand_less 1m27s fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected <map[string]error | len:1>: { "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1": { s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"cluster-policy-controller\",\"endpoint\":\"https-main\",\"instance\":\"10.131.0.11:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-controller-manager\",\"pod\":\"kube-controller-manager-ci-op-bdd1t18l-2aad9-l7gf7-master-0\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1592323459.738,\"8\"]},{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"cluster-policy-controller\",\"endpoint\":\"https-main\",\"instance\":\"10.131.0.11:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-controller-manager\",\"pod\":\"kube-controller-manager-ci-op-bdd1t18l-2aad9-l7gf7-master-1\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1592323459.738,\"8\"]},{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"cluster-policy-controller\",\"endpoint\":\"https-main\",\"instance\":\"10.131.0.11:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-controller-manager\",\"pod\":\"kube-controller-manager-ci-op-bdd1t18l-2aad9-l7gf7-master-2\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1592323459.738,\"12\"]}]", }, } to be empty https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1272906558837100544 Version-Release number of selected component (if applicable): 4.5
Log from another job: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1272836348180434944 Jun 16 11:00:11.679 E ns/openshift-kube-apiserver pod/kube-apiserver-ci-op-9sqtdx6b-2aad9-r8zbs-master-1 node/ci-op-9sqtdx6b-2aad9-r8zbs-master-1 container/setup init container exited with code 124 (Error): ................................................................................ Jun 16 11:00:14.676 E ns/openshift-kube-controller-manager pod/kube-controller-manager-ci-op-9sqtdx6b-2aad9-r8zbs-master-1 node/ci-op-9sqtdx6b-2aad9-r8zbs-master-1 container/cluster-policy-controller container exited with code 255 (Error): ?allowWatchBookmarks=true&resourceVersion=25979&timeout=9m14s&timeoutSeconds=554&watch=true: dial tcp [::1]:6443: connect: connection refused\nE0616 11:00:13.857118 1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1beta1.Ingress: Get https://localhost:6443/apis/extensions/v1beta1/ingresses?allowWatchBookmarks=true&resourceVersion=15009&timeout=6m57s&timeoutSeconds=417&watch=true: dial tcp [::1]:6443: connect: connection refused\nE0616 11:00:13.858224 1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1.DeploymentConfig: Get https://localhost:6443/apis/apps.openshift.io/v1/deploymentconfigs?allowWatchBookmarks=true&resourceVersion=24983&timeout=6m0s&timeoutSeconds=360&watch=true: dial tcp [::1]:6443: connect: connection refused\nE0616 11:00:13.859232 1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1beta1.Event: Get https://localhost:6443/apis/events.k8s.io/v1beta1/events?allowWatchBookmarks=true&resourceVersion=26035&timeout=8m51s&timeoutSeconds=531&watch=true: dial tcp [::1]:6443: connect: connection refused\nE0616 11:00:13.860365 1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1beta1.Ingress: Get https://localhost:6443/apis/networking.k8s.io/v1beta1/ingresses?allowWatchBookmarks=true&resourceVersion=15009&timeout=8m41s&timeoutSeconds=521&watch=true: dial tcp [::1]:6443: connect: connection refused\nE0616 11:00:13.861385 1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1.Service: Get https://localhost:6443/api/v1/services?allowWatchBookmarks=true&resourceVersion=18248&timeout=7m10s&timeoutSeconds=430&watch=true: dial tcp [::1]:6443: connect: connection refused\nI0616 11:00:14.586952 1 leaderelection.go:277] failed to renew lease openshift-kube-controller-manager/cluster-policy-controller: timed out waiting for the condition\nF0616 11:00:14.587014 1 policy_controller.go:94] leaderelection lost\nI0616 11:00:14.587033 1 reconciliation_controller.go:154] Shutting down ClusterQuotaReconcilationController\n kube-apiserver on the localhost was rolling out and kube-controller-manager got "connection refused" error. This is causing the alert to go off maybe?
I looked into it yesterday, one restart was caused by lost leader election because the local apiserver was down too long, the other was cluster policy controller not waiting correctly for the port to be available again. Both would reconcile over time. Will try to allocate some time to fix this but it doesn't look fatal.
KCM now uses internal loadbalancer, this should be fixed
Moving the bug to verified state with the payload below as i have not seen any alerts getting fired for KubePodCrashLooping. [ramakasturinarra@dhcp35-60 ~]$ oc version Client Version: 4.6.0-202008120152.p0-ddbae76 Server Version: 4.6.0-0.nightly-2020-08-11-040013 Kubernetes Version: v1.19.0-rc.2+5241b27-dirty Below are the steps followed to verify the same: ================================================== 1) Login to Prometheus console and click on Firing tab 2) on the same master, bring down the local kube-apiserver pod and see if the local KCM fails or keeps working use a master with KCM leader 3) To bring down run the commands mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml /home/kube-apiserver-pod.yaml and check using crictl command that it terminated the containers. 4) Also check oc get pods -n openshift-kube-apiserver to see that apiserver pod on the master node where manifests were moved is not present. 5) Now check prometheus console to see if there is any alert in firing state with respect to KubePodCrashLooping and also see that KCM keeps running with out any issues.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196