Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1852919

Summary:	[sig-instrumentation] Prometheus when installed on the cluster should have important platform topology metrics
Product:	OpenShift Container Platform	Reporter:	Corey Daley <cdaley>
Component:	Monitoring	Assignee:	Sergiusz Urbaniak <surbania>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.5	CC:	alegrand, anpicker, erooth, kakkoyun, lcosic, mloibl, pgier, pkrupa, scuppett, spasquie, surbania, surya
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:	[sig-instrumentation] Prometheus when installed on the cluster should have important platform topology metrics
Last Closed:	2020-08-21 13:45:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Corey Daley 2020-07-01 15:09:07 UTC

test:
[sig-instrumentation] Prometheus when installed on the cluster should have important platform topology metrics 

is failing frequently in CI, see search results:
https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-instrumentation%5C%5D+Prometheus+when+installed+on+the+cluster+should+have+important+platform+topology+metrics

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/batch/pull-ci-openshift-origin-master-e2e-gcp/1278223357514354688

Comment 2 Simon Pasquier 2020-07-01 16:16:39 UTC

Looking at a few of the failed CI jobs, the "[sig-instrumentation] Prometheus when installed on the cluster should have important platform topology metrics" test was only one of the many other tests that failed. Do you have an example of a job where the test failure is isolated?

Comment 3 Corey Daley 2020-07-01 16:21:26 UTC

This one only has a few test failures
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_sriov-network-operator/273/pull-ci-openshift-sriov-network-operator-master-e2e-aws/1278219205161783296

I doubt you will find one that ONLY has that test failing.

Comment 4 Paul Gier 2020-07-01 16:53:13 UTC

I found a couple of test runs that seem to have only this test failing, for example:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-gcp/99/pull-ci-openshift-cluster-api-provider-gcp-master-e2e-gcp/1278246498538098688

However, this test is actually testing a few different queries, and it seems to not be consistent which queries/sub-tests are failing.  For example, some are failing on the number of etc instances:
"promQL query: instance:etcd_object_counts:sum > 0 had reported incorrect results"

And some are failing on cluster infrastructure provider and/or feature set.
"promQL query: cluster_infrastructure_provider{type!=\"\"} had reported incorrect results:\n[]"

I'm also wondering why most/all of them see to have "Run #0: Failed" and "Run #1: Passed"

Comment 5 Paul Gier 2020-07-01 17:29:51 UTC

Lowering priority to medium, since a lot of the failures appear to just be flakes where the test fails one time, but passes the second time.

Comment 6 Paul Gier 2020-07-01 20:11:13 UTC

Assigning to test infra team because this seems to be a flaky test and not caused by a failure in the monitoring components.

Comment 7 Steve Kuznetsov 2020-07-06 14:51:06 UTC

> I'm also wondering why most/all of them see to have "Run #0: Failed" and "Run #1: Passed"

The platform will re-try a test N times to weed out flakes. The e2e framework only fails the job if the test fails all N times.

> Assigning to test infra team because this seems to be a flaky test and not caused by a failure in the monitoring components.

The Test Infrastructure component is not the correct place for flaky tests. The monitoring team owns their tests and ensuring they are not flaky.

Comment 8 Simon Pasquier 2020-07-06 16:11:24 UTC

Found 2 other recent CI runs where the test is reported as flaky:

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/683/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn/1279256870099357696
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/686/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn/1280128937162182656

As noted by Paul in comment 14, the test doesn't always fail on the same metric so it might be a question of timing (maybe the test runs too early).

I'm lowering the severity to medium (assuming this is what you wanted to do Paul).

Comment 9 Stephen Cuppett 2020-07-06 16:44:09 UTC

Setting target release to current development version (4.6) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.

Comment 11 Simon Pasquier 2020-07-10 13:08:30 UTC

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/9989/rehearse-9989-pull-ci-openshift-origin-master-e2e-aws-fips/1281441174669758464

The error message says specifically that the 'cluster_installer{type!="",invoker!=""}' query didn't return any data. This is confirmed by looking at the attached Prometheus data dump: for some unknown reason, the Prometheus servers couldn't scrape metrics from the cluster-version-operator pod, the pod was reported as down (the TargetDown alert fired for service="cluster-version-operator") though it was effectively running and listening on the same address/port as expected by Prometheus.

Comment 12 Simon Pasquier 2020-07-10 13:17:18 UTC

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/401/pull-ci-openshift-cluster-version-operator-master-e2e/1281445783433908224

This is a flaky test: it failed on the first run and succeeded on the second.

In the first run, it fails because 'cluster_infrastructure_provider{type!=""}' returned no data. The test reported the failure at 05:24:12 and the Prometheus data dump shows that the metric only appeared at 05:24:19. Here we can say that the test ran too early.

Comment 13 Sergiusz Urbaniak 2020-07-29 12:50:03 UTC

@sur: This should have been fixed by the prometheus-operator bump revert, but please recheck if the failures don't happen any more.

Comment 19 Pawel Krupa 2020-08-21 08:03:42 UTC

Removing NEEDINFO as it seems all necessary information are in ticket.