Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1852919

Summary: [sig-instrumentation] Prometheus when installed on the cluster should have important platform topology metrics
Product: OpenShift Container Platform Reporter: Corey Daley <cdaley>
Component: MonitoringAssignee: Sergiusz Urbaniak <surbania>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.5CC: alegrand, anpicker, erooth, kakkoyun, lcosic, mloibl, pgier, pkrupa, scuppett, spasquie, surbania, surya
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
[sig-instrumentation] Prometheus when installed on the cluster should have important platform topology metrics
Last Closed: 2020-08-21 13:45:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 2 Simon Pasquier 2020-07-01 16:16:39 UTC
Looking at a few of the failed CI jobs, the "[sig-instrumentation] Prometheus when installed on the cluster should have important platform topology metrics" test was only one of the many other tests that failed. Do you have an example of a job where the test failure is isolated?

Comment 4 Paul Gier 2020-07-01 16:53:13 UTC
I found a couple of test runs that seem to have only this test failing, for example:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-gcp/99/pull-ci-openshift-cluster-api-provider-gcp-master-e2e-gcp/1278246498538098688

However, this test is actually testing a few different queries, and it seems to not be consistent which queries/sub-tests are failing.  For example, some are failing on the number of etc instances:
"promQL query: instance:etcd_object_counts:sum > 0 had reported incorrect results"

And some are failing on cluster infrastructure provider and/or feature set.
"promQL query: cluster_infrastructure_provider{type!=\"\"} had reported incorrect results:\n[]"

I'm also wondering why most/all of them see to have "Run #0: Failed" and "Run #1: Passed"

Comment 5 Paul Gier 2020-07-01 17:29:51 UTC
Lowering priority to medium, since a lot of the failures appear to just be flakes where the test fails one time, but passes the second time.

Comment 6 Paul Gier 2020-07-01 20:11:13 UTC
Assigning to test infra team because this seems to be a flaky test and not caused by a failure in the monitoring components.

Comment 7 Steve Kuznetsov 2020-07-06 14:51:06 UTC
> I'm also wondering why most/all of them see to have "Run #0: Failed" and "Run #1: Passed"

The platform will re-try a test N times to weed out flakes. The e2e framework only fails the job if the test fails all N times.

> Assigning to test infra team because this seems to be a flaky test and not caused by a failure in the monitoring components.

The Test Infrastructure component is not the correct place for flaky tests. The monitoring team owns their tests and ensuring they are not flaky.

Comment 8 Simon Pasquier 2020-07-06 16:11:24 UTC
Found 2 other recent CI runs where the test is reported as flaky:

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/683/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn/1279256870099357696
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/686/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn/1280128937162182656

As noted by Paul in comment 14, the test doesn't always fail on the same metric so it might be a question of timing (maybe the test runs too early).

I'm lowering the severity to medium (assuming this is what you wanted to do Paul).

Comment 9 Stephen Cuppett 2020-07-06 16:44:09 UTC
Setting target release to current development version (4.6) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.

Comment 11 Simon Pasquier 2020-07-10 13:08:30 UTC
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/9989/rehearse-9989-pull-ci-openshift-origin-master-e2e-aws-fips/1281441174669758464

The error message says specifically that the 'cluster_installer{type!="",invoker!=""}' query didn't return any data. This is confirmed by looking at the attached Prometheus data dump: for some unknown reason, the Prometheus servers couldn't scrape metrics from the cluster-version-operator pod, the pod was reported as down (the TargetDown alert fired for service="cluster-version-operator") though it was effectively running and listening on the same address/port as expected by Prometheus.

Comment 12 Simon Pasquier 2020-07-10 13:17:18 UTC
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/401/pull-ci-openshift-cluster-version-operator-master-e2e/1281445783433908224

This is a flaky test: it failed on the first run and succeeded on the second.

In the first run, it fails because 'cluster_infrastructure_provider{type!=""}' returned no data. The test reported the failure at 05:24:12 and the Prometheus data dump shows that the metric only appeared at 05:24:19. Here we can say that the test ran too early.

Comment 13 Sergiusz Urbaniak 2020-07-29 12:50:03 UTC
@sur: This should have been fixed by the prometheus-operator bump revert, but please recheck if the failures don't happen any more.

Comment 19 Pawel Krupa 2020-08-21 08:03:42 UTC
Removing NEEDINFO as it seems all necessary information are in ticket.