Bug 1847646 - [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured: KubePodCrashLooping: openshift-kube-controller-manager
Summary: [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing stat...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-controller-manager
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Tomáš Nožička
QA Contact: RamaKasturi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-16 18:28 UTC by Abu Kashem
Modified: 2020-10-27 16:07 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:07:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:07:56 UTC

Description Abu Kashem 2020-06-16 18:28:37 UTC
Description of problem:

[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel] expand_less	1m27s
fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected
    <map[string]error | len:1>: {
        "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1": {
            s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"cluster-policy-controller\",\"endpoint\":\"https-main\",\"instance\":\"10.131.0.11:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-controller-manager\",\"pod\":\"kube-controller-manager-ci-op-bdd1t18l-2aad9-l7gf7-master-0\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1592323459.738,\"8\"]},{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"cluster-policy-controller\",\"endpoint\":\"https-main\",\"instance\":\"10.131.0.11:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-controller-manager\",\"pod\":\"kube-controller-manager-ci-op-bdd1t18l-2aad9-l7gf7-master-1\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1592323459.738,\"8\"]},{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"cluster-policy-controller\",\"endpoint\":\"https-main\",\"instance\":\"10.131.0.11:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-controller-manager\",\"pod\":\"kube-controller-manager-ci-op-bdd1t18l-2aad9-l7gf7-master-2\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1592323459.738,\"12\"]}]",
        },
    }
to be empty


https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1272906558837100544



Version-Release number of selected component (if applicable):
4.5

Comment 1 Abu Kashem 2020-06-16 19:26:31 UTC
Log from another job: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1272836348180434944


Jun 16 11:00:11.679 E ns/openshift-kube-apiserver pod/kube-apiserver-ci-op-9sqtdx6b-2aad9-r8zbs-master-1 node/ci-op-9sqtdx6b-2aad9-r8zbs-master-1 container/setup init container exited with code 124 (Error): ................................................................................
Jun 16 11:00:14.676 E ns/openshift-kube-controller-manager pod/kube-controller-manager-ci-op-9sqtdx6b-2aad9-r8zbs-master-1 node/ci-op-9sqtdx6b-2aad9-r8zbs-master-1 container/cluster-policy-controller container exited with code 255 (Error): ?allowWatchBookmarks=true&resourceVersion=25979&timeout=9m14s&timeoutSeconds=554&watch=true: dial tcp [::1]:6443: connect: connection refused\nE0616 11:00:13.857118       1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1beta1.Ingress: Get https://localhost:6443/apis/extensions/v1beta1/ingresses?allowWatchBookmarks=true&resourceVersion=15009&timeout=6m57s&timeoutSeconds=417&watch=true: dial tcp [::1]:6443: connect: connection refused\nE0616 11:00:13.858224       1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1.DeploymentConfig: Get https://localhost:6443/apis/apps.openshift.io/v1/deploymentconfigs?allowWatchBookmarks=true&resourceVersion=24983&timeout=6m0s&timeoutSeconds=360&watch=true: dial tcp [::1]:6443: connect: connection refused\nE0616 11:00:13.859232       1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1beta1.Event: Get https://localhost:6443/apis/events.k8s.io/v1beta1/events?allowWatchBookmarks=true&resourceVersion=26035&timeout=8m51s&timeoutSeconds=531&watch=true: dial tcp [::1]:6443: connect: connection refused\nE0616 11:00:13.860365       1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1beta1.Ingress: Get https://localhost:6443/apis/networking.k8s.io/v1beta1/ingresses?allowWatchBookmarks=true&resourceVersion=15009&timeout=8m41s&timeoutSeconds=521&watch=true: dial tcp [::1]:6443: connect: connection refused\nE0616 11:00:13.861385       1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1.Service: Get https://localhost:6443/api/v1/services?allowWatchBookmarks=true&resourceVersion=18248&timeout=7m10s&timeoutSeconds=430&watch=true: dial tcp [::1]:6443: connect: connection refused\nI0616 11:00:14.586952       1 leaderelection.go:277] failed to renew lease openshift-kube-controller-manager/cluster-policy-controller: timed out waiting for the condition\nF0616 11:00:14.587014       1 policy_controller.go:94] leaderelection lost\nI0616 11:00:14.587033       1 reconciliation_controller.go:154] Shutting down ClusterQuotaReconcilationController\n


kube-apiserver on the localhost was rolling out and kube-controller-manager got "connection refused" error. This is causing the alert to go off maybe?

Comment 2 Tomáš Nožička 2020-06-18 09:18:28 UTC
I looked into it yesterday, one restart was caused by lost leader election because the local apiserver was down too long, the other was cluster policy controller not waiting correctly for the port to be available again. Both would reconcile over time. Will try to allocate some time to fix this but it doesn't look fatal.

Comment 4 Tomáš Nožička 2020-07-30 16:04:49 UTC
KCM now uses internal loadbalancer, this should be fixed

Comment 7 RamaKasturi 2020-08-12 09:28:27 UTC
Moving the bug to verified state with the payload below as i have not seen any alerts getting fired for KubePodCrashLooping.

[ramakasturinarra@dhcp35-60 ~]$ oc version
Client Version: 4.6.0-202008120152.p0-ddbae76
Server Version: 4.6.0-0.nightly-2020-08-11-040013
Kubernetes Version: v1.19.0-rc.2+5241b27-dirty

Below are the steps followed to verify the same:
==================================================
1) Login to Prometheus console and click on Firing tab
2) on the same master, bring down the local kube-apiserver pod and see if the local KCM fails or keeps working use a master with KCM leader
3) To bring down run the commands mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml /home/kube-apiserver-pod.yaml and check using crictl command that it terminated the containers.
4) Also check oc get pods -n openshift-kube-apiserver to see that apiserver pod on the master node where manifests were moved is not present.
5) Now check prometheus console to see if there is any alert in firing state with respect to KubePodCrashLooping and also see that KCM keeps running with out any issues.

Comment 9 errata-xmlrpc 2020-10-27 16:07:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.