Bug 1852919
| Summary: | [sig-instrumentation] Prometheus when installed on the cluster should have important platform topology metrics | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Corey Daley <cdaley> |
| Component: | Monitoring | Assignee: | Sergiusz Urbaniak <surbania> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.5 | CC: | alegrand, anpicker, erooth, kakkoyun, lcosic, mloibl, pgier, pkrupa, scuppett, spasquie, surbania, surya |
| Target Milestone: | --- | ||
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: |
[sig-instrumentation] Prometheus when installed on the cluster should have important platform topology metrics
|
|
| Last Closed: | 2020-08-21 13:45:05 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Corey Daley
2020-07-01 15:09:07 UTC
Looking at a few of the failed CI jobs, the "[sig-instrumentation] Prometheus when installed on the cluster should have important platform topology metrics" test was only one of the many other tests that failed. Do you have an example of a job where the test failure is isolated? This one only has a few test failures https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_sriov-network-operator/273/pull-ci-openshift-sriov-network-operator-master-e2e-aws/1278219205161783296 I doubt you will find one that ONLY has that test failing. I found a couple of test runs that seem to have only this test failing, for example: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-gcp/99/pull-ci-openshift-cluster-api-provider-gcp-master-e2e-gcp/1278246498538098688 However, this test is actually testing a few different queries, and it seems to not be consistent which queries/sub-tests are failing. For example, some are failing on the number of etc instances: "promQL query: instance:etcd_object_counts:sum > 0 had reported incorrect results" And some are failing on cluster infrastructure provider and/or feature set. "promQL query: cluster_infrastructure_provider{type!=\"\"} had reported incorrect results:\n[]" I'm also wondering why most/all of them see to have "Run #0: Failed" and "Run #1: Passed" Lowering priority to medium, since a lot of the failures appear to just be flakes where the test fails one time, but passes the second time. Assigning to test infra team because this seems to be a flaky test and not caused by a failure in the monitoring components. > I'm also wondering why most/all of them see to have "Run #0: Failed" and "Run #1: Passed" The platform will re-try a test N times to weed out flakes. The e2e framework only fails the job if the test fails all N times. > Assigning to test infra team because this seems to be a flaky test and not caused by a failure in the monitoring components. The Test Infrastructure component is not the correct place for flaky tests. The monitoring team owns their tests and ensuring they are not flaky. Found 2 other recent CI runs where the test is reported as flaky: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/683/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn/1279256870099357696 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/686/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn/1280128937162182656 As noted by Paul in comment 14, the test doesn't always fail on the same metric so it might be a question of timing (maybe the test runs too early). I'm lowering the severity to medium (assuming this is what you wanted to do Paul). Setting target release to current development version (4.6) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate. https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/9989/rehearse-9989-pull-ci-openshift-origin-master-e2e-aws-fips/1281441174669758464 The error message says specifically that the 'cluster_installer{type!="",invoker!=""}' query didn't return any data. This is confirmed by looking at the attached Prometheus data dump: for some unknown reason, the Prometheus servers couldn't scrape metrics from the cluster-version-operator pod, the pod was reported as down (the TargetDown alert fired for service="cluster-version-operator") though it was effectively running and listening on the same address/port as expected by Prometheus. https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/401/pull-ci-openshift-cluster-version-operator-master-e2e/1281445783433908224 This is a flaky test: it failed on the first run and succeeded on the second. In the first run, it fails because 'cluster_infrastructure_provider{type!=""}' returned no data. The test reported the failure at 05:24:12 and the Prometheus data dump shows that the metric only appeared at 05:24:19. Here we can say that the test ran too early. @sur: This should have been fixed by the prometheus-operator bump revert, but please recheck if the failures don't happen any more. Removing NEEDINFO as it seems all necessary information are in ticket. |