Bug 1821697 - KubeletPlegDurationHigh alert reported in firing state in e2e-gcp-4.5 ci tests run
Summary: KubeletPlegDurationHigh alert reported in firing state in e2e-gcp-4.5 ci tes...
Keywords:
Status: CLOSED DUPLICATE of bug 1743911
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.5.0
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-07 12:31 UTC by Sinny Kumari
Modified: 2020-05-20 19:54 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
test: [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]
Last Closed: 2020-05-20 18:52:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Sinny Kumari 2020-04-07 12:31:31 UTC
Seeing this in release-openshift-origin-installer-e2e-gcp-4.5 CI tests reported by https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-blocking#release-openshift-origin-installer-e2e-gcp-4.5&sort-by-flakiness=

"[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]"

Example of failing jobs: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1261

Error message from one of the failing jobs:

fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:167]: Expected
    <map[string]error | len:1>: {
        "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1": {
            s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"KubeletPlegDurationHigh\",\"alertstate\":\"firing\",\"instance\":\"10.0.0.5:10250\",\"node\":\"ci-op-snzkl-m-0.c.openshift-gce-devel-ci.internal\",\"quantile\":\"0.99\",\"severity\":\"warning\"},\"value\":[1586250781.73,\"1\"]}]",
        },
    }
to be empty

Comment 1 Ryan Phillips 2020-05-11 17:10:52 UTC
It looks like the etcd-operator may want to ignore the closing event on BootstrapTeardownDegraded

Apr 07 09:12:25.637 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nBootstrapTeardownDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"

Comment 2 Dan Mace 2020-05-20 18:52:29 UTC
The cited run exhibits no obvious etcd problems, so I'm closing it as a dupe of #1743911 which is tracking KubeAPILatencyHigh issues which _could_ be related to etcd.

*** This bug has been marked as a duplicate of bug 1743911 ***


Note You need to log in before you can comment on or make changes to this bug.