1821697 – KubeletPlegDurationHigh alert reported in firing state in e2e-gcp-4.5 ci tests run

Bug 1821697 - KubeletPlegDurationHigh alert reported in firing state in e2e-gcp-4.5 ci tests run

Summary: KubeletPlegDurationHigh alert reported in firing state in e2e-gcp-4.5 ci tes...

Keywords:
Status:	CLOSED DUPLICATE of bug 1743911
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-07 12:31 UTC by Sinny Kumari
Modified:	2020-05-20 19:54 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	test: [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]
Last Closed:	2020-05-20 18:52:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Sinny Kumari 2020-04-07 12:31:31 UTC

Seeing this in release-openshift-origin-installer-e2e-gcp-4.5 CI tests reported by https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-blocking#release-openshift-origin-installer-e2e-gcp-4.5&sort-by-flakiness=

"[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]"

Example of failing jobs: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1261

Error message from one of the failing jobs:

fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:167]: Expected
    <map[string]error | len:1>: {
        "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1": {
            s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"KubeletPlegDurationHigh\",\"alertstate\":\"firing\",\"instance\":\"10.0.0.5:10250\",\"node\":\"ci-op-snzkl-m-0.c.openshift-gce-devel-ci.internal\",\"quantile\":\"0.99\",\"severity\":\"warning\"},\"value\":[1586250781.73,\"1\"]}]",
        },
    }
to be empty

Comment 1 Ryan Phillips 2020-05-11 17:10:52 UTC

It looks like the etcd-operator may want to ignore the closing event on BootstrapTeardownDegraded

Apr 07 09:12:25.637 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nBootstrapTeardownDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"

Comment 2 Dan Mace 2020-05-20 18:52:29 UTC

The cited run exhibits no obvious etcd problems, so I'm closing it as a dupe of #1743911 which is tracking KubeAPILatencyHigh issues which _could_ be related to etcd.

*** This bug has been marked as a duplicate of bug 1743911 ***

Note You need to log in before you can comment on or make changes to this bug.