Bug 1821697

Summary: KubeletPlegDurationHigh alert reported in firing state in e2e-gcp-4.5 ci tests run
Product: OpenShift Container Platform Reporter: Sinny Kumari <skumari>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED DUPLICATE QA Contact: ge liu <geliu>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.5CC: aos-bugs, jokerman, mpatel, skolicha
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
test: [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]
Last Closed: 2020-05-20 18:52:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sinny Kumari 2020-04-07 12:31:31 UTC
Seeing this in release-openshift-origin-installer-e2e-gcp-4.5 CI tests reported by https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-blocking#release-openshift-origin-installer-e2e-gcp-4.5&sort-by-flakiness=

"[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]"

Example of failing jobs: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1261

Error message from one of the failing jobs:

fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:167]: Expected
    <map[string]error | len:1>: {
        "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1": {
            s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"KubeletPlegDurationHigh\",\"alertstate\":\"firing\",\"instance\":\"10.0.0.5:10250\",\"node\":\"ci-op-snzkl-m-0.c.openshift-gce-devel-ci.internal\",\"quantile\":\"0.99\",\"severity\":\"warning\"},\"value\":[1586250781.73,\"1\"]}]",
        },
    }
to be empty

Comment 1 Ryan Phillips 2020-05-11 17:10:52 UTC
It looks like the etcd-operator may want to ignore the closing event on BootstrapTeardownDegraded

Apr 07 09:12:25.637 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nBootstrapTeardownDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"

Comment 2 Dan Mace 2020-05-20 18:52:29 UTC
The cited run exhibits no obvious etcd problems, so I'm closing it as a dupe of #1743911 which is tracking KubeAPILatencyHigh issues which _could_ be related to etcd.

*** This bug has been marked as a duplicate of bug 1743911 ***