Bug 1982795

Summary: [sig-instrumentation] Prometheus when installed on the cluster shouldn't have failing rules evaluation [Suite:openshift/conformance/parallel]
Product: OpenShift Container Platform Reporter: Micah Abbott <miabbott>
Component: MonitoringAssignee: Simon Pasquier <spasquie>
Status: CLOSED DUPLICATE QA Contact: Junqi Zhao <juzhao>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.6CC: alegrand, amuller, anpicker, aos-bugs, erooth, kakkoyun, mfojtik, miabbott, pkrupa, sippy
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: tag-ci
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
[sig-instrumentation] Prometheus when installed on the cluster shouldn't have failing rules evaluation [Suite:openshift/conformance/parallel]
Last Closed: 2021-07-22 07:42:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Micah Abbott 2021-07-15 17:58:32 UTC
test:
[sig-instrumentation] Prometheus when installed on the cluster shouldn't have failing rules evaluation [Suite:openshift/conformance/parallel]

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?search=%5C%5Bsig-instrumentation%5C%5D+Prometheus+when+installed+on+the+cluster+shouldn%27t+have+failing+rules+evaluation+%5C%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5C%5D&maxAge=168h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job


Example failing job:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/5011/pull-ci-openshift-installer-release-4.6-e2e-aws-workers-rhel7/1415291024997093376

```

[AfterEach] [sig-instrumentation] Prometheus
  github.com/openshift/origin/test/extended/util/client.go:133
STEP: Collecting events from namespace "e2e-test-prometheus-ckplh".
STEP: Found 7 events.
Jul 14 13:57:28.310: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for execpod7pnlz: { } Scheduled: Successfully assigned e2e-test-prometheus-ckplh/execpod7pnlz to ip-10-0-161-238.ec2.internal
Jul 14 13:57:28.310: INFO: At 2021-07-14 13:56:32 +0000 UTC - event for execpod7pnlz: {multus } AddedInterface: Add eth0 [10.131.2.247/23]
Jul 14 13:57:28.310: INFO: At 2021-07-14 13:56:32 +0000 UTC - event for execpod7pnlz: {kubelet ip-10-0-161-238.ec2.internal} Pulling: Pulling image "ubi8/ubi"
Jul 14 13:57:28.310: INFO: At 2021-07-14 13:56:33 +0000 UTC - event for execpod7pnlz: {kubelet ip-10-0-161-238.ec2.internal} Pulled: Successfully pulled image "ubi8/ubi" in 508.083082ms
Jul 14 13:57:28.310: INFO: At 2021-07-14 13:56:33 +0000 UTC - event for execpod7pnlz: {kubelet ip-10-0-161-238.ec2.internal} Created: Created container agnhost-pause
Jul 14 13:57:28.310: INFO: At 2021-07-14 13:56:33 +0000 UTC - event for execpod7pnlz: {kubelet ip-10-0-161-238.ec2.internal} Started: Started container agnhost-pause
Jul 14 13:57:28.310: INFO: At 2021-07-14 13:57:28 +0000 UTC - event for execpod7pnlz: {kubelet ip-10-0-161-238.ec2.internal} Killing: Stopping container agnhost-pause
Jul 14 13:57:28.347: INFO: POD           NODE                          PHASE    GRACE  CONDITIONS
Jul 14 13:57:28.348: INFO: execpod7pnlz  ip-10-0-161-238.ec2.internal  Running  1s     [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2021-07-14 13:56:29 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2021-07-14 13:56:33 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2021-07-14 13:56:33 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2021-07-14 13:56:29 +0000 UTC  }]
Jul 14 13:57:28.348: INFO: 
Jul 14 13:57:28.424: INFO: skipping dumping cluster info - cluster too large
Jul 14 13:57:28.518: INFO: Deleted {user.openshift.io/v1, Resource=users  e2e-test-prometheus-ckplh-user}, err: <nil>
Jul 14 13:57:28.585: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients  e2e-client-e2e-test-prometheus-ckplh}, err: <nil>
Jul 14 13:57:28.671: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens  T7TFfthgTFqu4dNaQcQ7jwAAAAAAAAAA}, err: <nil>
[AfterEach] [sig-instrumentation] Prometheus
  github.com/openshift/origin/test/extended/util/client.go:134
Jul 14 13:57:28.671: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-test-prometheus-ckplh" for this suite.
Jul 14 13:57:28.793: INFO: Running AfterSuite actions on all nodes
Jul 14 13:57:28.793: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected
    <map[string]error | len:1>: {
        "prometheus_rule_evaluation_failures_total >= 1": {
            s: "promQL query: prometheus_rule_evaluation_failures_total >= 1 had reported incorrect results:\n[{\"metric\":{\"__name__\":\"prometheus_rule_evaluation_failures_total\",\"container\":\"prometheus-proxy\",\"endpoint\":\"web\",\"instance\":\"10.128.4.11:9091\",\"job\":\"prometheus-k8s\",\"namespace\":\"openshift-monitoring\",\"pod\":\"prometheus-k8s-0\",\"rule_group\":\"/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-prometheus-k8s-rules.yaml;node.rules\",\"service\":\"prometheus-k8s\"},\"value\":[1626271038.146,\"4\"]}]",
        },
    }
to be empty
```

Comment 1 Simon Pasquier 2021-07-22 07:42:06 UTC

*** This bug has been marked as a duplicate of bug 1908655 ***