Bug 1836886 - [4.3] Spurious TargetDown alerts for healthy pods
Summary: [4.3] Spurious TargetDown alerts for healthy pods
Keywords:
Status: CLOSED DUPLICATE of bug 1836887
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: apiserver-auth
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.3.z
Assignee: Standa Laznicka
QA Contact: scheng
URL:
Whiteboard:
Depends On: 1779438
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-18 12:29 UTC by Standa Laznicka
Modified: 2020-05-18 12:32 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-18 12:32:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Standa Laznicka 2020-05-18 12:29:47 UTC
This bug was initially created as a copy of Bug #1779438

I am copying this bug because: 



Examples from 4.3 promotion jobs [1]:

fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:134]: Expected
    <map[string]error | len:1>: {
        "ALERTS{alertname!~\"Watchdog|UsingDeprecatedAPIExtensionsV1Beta1\",alertstate=\"firing\"} >= 1": {
            s: "promQL query: ALERTS{alertname!~\"Watchdog|UsingDeprecatedAPIExtensionsV1Beta1\",alertstate=\"firing\"} >= 1 had reported incorrect results: ALERTS{alertname=\"TargetDown\", alertstate=\"firing\", job=\"metrics\", namespace=\"openshift-kube-controller-manager-operator\", service=\"metrics\", severity=\"warning\"} => 1 @[1575338031.017]",
        },
    }
to be empty
...
failed: (7m2s) 2019-12-03T01:53:53 "[Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog [Suite:openshift/conformance/parallel/minimal]"

But [2] says the pod has been running since 1:20Z and is still running at gather time, so probably a metrics-gathering thing.  Similar issue in [3]:

  ALERTS{alertname=\"TargetDown\", alertstate=\"firing\", job=\"metrics\", namespace=\"openshift-console-operator\", service=\"metrics\", severity=\"warning\"}

despite a healthy console operator [4].  Hit this 15 times today (1% of all e2e failures) [5].  4 of those (6% of failures) were for 4.3 release jobs [6].

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-shared-vpc-4.3/205
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-shared-vpc-4.3/205/artifacts/e2e-gcp/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-11-22-122829-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-kube-controller-manager-operator/pods/kube-controller-manager-operator-6c984f44df-vf9bx/kube-controller-manager-operator-6c984f44df-vf9bx.yaml
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-shared-vpc-4.3/209
[4]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-shared-vpc-4.3/209/artifacts/e2e-gcp/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-11-22-122829-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-console-operator/pods/console-operator-75548dd7b4-w9pg6/console-operator-75548dd7b4-w9pg6.yaml
[5]: https://search.svc.ci.openshift.org/chart?search=TargetDown.*firing
[6]: https://search.svc.ci.openshift.org/chart?name=release-openshift-ocp-installer-.*4.3$&search=TargetDown.*firing

Comment 1 Standa Laznicka 2020-05-18 12:32:28 UTC
I was unaware that the cherry-pick bot is now capable of cloning BZs, nice touch

*** This bug has been marked as a duplicate of bug 1836887 ***


Note You need to log in before you can comment on or make changes to this bug.