Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1891068

Summary:	[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] failing due to TargetDown alert from kube-scheduler
Product:	OpenShift Container Platform	Reporter:	Victor Pickard <vpickard>
Component:	kube-scheduler	Assignee:	Mike Dame <mdame>
Status:	CLOSED ERRATA	QA Contact:	RamaKasturi <knarra>
Severity:	medium	Docs Contact:
Priority:	urgent
Version:	4.7	CC:	aos-bugs, fabian, jluhrsen, mfojtik, sbonazzo, wking
Target Milestone:	---	Keywords:	Reopened
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	LifecycleReset
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:	[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early]
Last Closed:	2021-02-24 15:27:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1907329

Description Victor Pickard 2020-10-23 17:53:51 UTC

test:
[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-instrumentation%5C%5D+Prometheus+when+installed+on+the+cluster+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured+%5C%5BEarly%5C%5D


Link to flaky job:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.7/1319569479503450112


fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected
    <map[string]error | len:1>: {
        "ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1": {
            s: "promQL query: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1 had reported incorrect results:\n[{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"TargetDown\",\"alertstate\":\"firing\",\"job\":\"scheduler\",\"namespace\":\"openshift-kube-scheduler\",\"service\":\"scheduler\",\"severity\":\"warning\"},\"value\":[1603470079.646,\"1\"]}]",
        },
    }
to be empty

Comment 2 Michal Fojtik 2020-11-22 18:12:08 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 3 Mike Dame 2020-12-04 18:03:53 UTC

Looking at recent failures this doesn't seem to be related specifically to the scheduler (controller-manager and API server show similar failures), so closing this bz as there are a couple open now with this same flake

*** This bug has been marked as a duplicate of bug 1872874 ***

Comment 4 Fabian von Feilitzsch 2021-01-12 20:35:54 UTC

We're seeing this pop up on promotion jobs, it failed three in a row.

Failing job: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.7/1349041960047874048

It's getting the TargetDown alert from the scheduler.

Comment 5 Michal Fojtik 2021-01-12 20:38:51 UTC

The LifecycleStale keyword was removed because the needinfo? flag was reset and the bug got commented on recently.
The bug assignee was notified.

Comment 6 Fabian von Feilitzsch 2021-01-12 21:03:39 UTC

Actually it looks like we're seeing this a lot more frequently

https://search.ci.openshift.org/?search=query+failed%3A+ALERTS.*TargetDown.*openshift-kube-scheduler&maxAge=48h&context=1&type=bug%2Bjunit&name=4.7&maxMatches=5&maxBytes=20971520&groupBy=job

Bumping to urgent

Comment 8 jamo luhrsen 2021-01-14 23:09:33 UTC

*** Bug 1915912 has been marked as a duplicate of this bug. ***

Comment 9 RamaKasturi 2021-01-19 12:32:52 UTC

Verified in the link below and did not see any failures being reported from last 4 days, so marking the bug verified.

https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-instrumentation%5C%5D+Prometheus+when+installed+on+the+cluster+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured+%5C%5BEarly%5C%5D

Comment 12 errata-xmlrpc 2021-02-24 15:27:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633