Bug 1821697

Summary:	KubeletPlegDurationHigh alert reported in firing state in e2e-gcp-4.5 ci tests run
Product:	OpenShift Container Platform	Reporter:	Sinny Kumari <skumari>
Component:	Etcd	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED DUPLICATE	QA Contact:	ge liu <geliu>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.5	CC:	aos-bugs, jokerman, mpatel, skolicha
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:	test: [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]
Last Closed:	2020-05-20 18:52:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sinny Kumari 2020-04-07 12:31:31 UTC

Seeing this in release-openshift-origin-installer-e2e-gcp-4.5 CI tests reported by https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-blocking#release-openshift-origin-installer-e2e-gcp-4.5&sort-by-flakiness=

"[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]"

Example of failing jobs: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1261

Error message from one of the failing jobs:

fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:167]: Expected
    <map[string]error | len:1>: {
        "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1": {
            s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"KubeletPlegDurationHigh\",\"alertstate\":\"firing\",\"instance\":\"10.0.0.5:10250\",\"node\":\"ci-op-snzkl-m-0.c.openshift-gce-devel-ci.internal\",\"quantile\":\"0.99\",\"severity\":\"warning\"},\"value\":[1586250781.73,\"1\"]}]",
        },
    }
to be empty

Comment 1 Ryan Phillips 2020-05-11 17:10:52 UTC

It looks like the etcd-operator may want to ignore the closing event on BootstrapTeardownDegraded

Apr 07 09:12:25.637 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nBootstrapTeardownDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"

Comment 2 Dan Mace 2020-05-20 18:52:29 UTC

The cited run exhibits no obvious etcd problems, so I'm closing it as a dupe of #1743911 which is tracking KubeAPILatencyHigh issues which _could_ be related to etcd.

*** This bug has been marked as a duplicate of bug 1743911 ***