2006561 – [sig-instrumentation] Prometheus when installed on the cluster shouldn't have failing rules evaluation [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

Bug 2006561 - [sig-instrumentation] Prometheus when installed on the cluster shouldn't have failing rules evaluation [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

Summary: [sig-instrumentation] Prometheus when installed on the cluster shouldn't have...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Haoyu Sun
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2011798
TreeView+	depends on / blocked

Reported:	2021-09-21 22:05 UTC by Ben Parees
Modified:	2023-01-06 11:28 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:	job=periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-canary=all
Last Closed:	2022-03-10 16:12:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1401	0	None	open	Bug 2006561: Prometheus when installed on the cluster shouldn't have failing rules evaluation	2021-10-04 16:22:44 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:13:04 UTC

Description Ben Parees 2021-09-21 22:05:47 UTC

[sig-instrumentation] Prometheus when installed on the cluster shouldn't have failing rules evaluation [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

is failing frequently in CI, see:
https://search.ci.openshift.org/?search=+Prometheus+when+installed+on+the+cluster+shouldn%27t+have+failing+rules+evaluation&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

(responsible for 1.44% of failures in the last 2 days)


Failed in the canary job here:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-canary/1440401339979927552
(has failed a few timesin canary testing)

failure output:

fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:486]: Unexpected error:
    <errors.aggregate | len:1, cap:1>: [
        {
            s: "promQL query returned unexpected results:\nincrease(prometheus_rule_evaluation_failures_total[1h4m7s]) >= 1\n[\n  {\n    \"metric\": {\n      \"container\": \"prometheus-proxy\",\n      \"endpoint\": \"web\",\n      \"instance\": \"10.128.3.245:9091\",\n      \"job\": \"prometheus-k8s\",\n      \"namespace\": \"openshift-monitoring\",\n      \"pod\": \"prometheus-k8s-1\",\n      \"prometheus\": \"openshift-monitoring/k8s\",\n      \"rule_group\": \"/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-cluster-monitoring-operator-prometheus-rules.yaml;openshift-general.rules\",\n      \"service\": \"prometheus-k8s\"\n    },\n    \"value\": [\n      1632259998.636,\n      \"6.333333333333334\"\n    ]\n  },\n  {\n    \"metric\": {\n      \"container\": \"prometheus-proxy\",\n      \"endpoint\": \"web\",\n      \"instance\": \"10.129.2.12:9091\",\n      \"job\": \"prometheus-k8s\",\n      \"namespace\": \"openshift-monitoring\",\n      \"pod\": \"prometheus-k8s-0\",\n      \"prometheus\": \"openshift-monitoring/k8s\",\n      \"rule_group\": \"/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-cluster-monitoring-operator-prometheus-rules.yaml;openshift-general.rules\",\n      \"service\": \"prometheus-k8s\"\n    },\n    \"value\": [\n      1632259998.636,\n      \"1.0064771653543307\"\n    ]\n  },\n  {\n    \"metric\": {\n      \"container\": \"prometheus-proxy\",\n      \"endpoint\": \"web\",\n      \"instance\": \"10.131.0.82:9091\",\n      \"job\": \"prometheus-k8s\",\n      \"namespace\": \"openshift-monitoring\",\n      \"pod\": \"prometheus-k8s-1\",\n      \"prometheus\": \"openshift-monitoring/k8s\",\n      \"rule_group\": \"/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-cluster-monitoring-operator-prometheus-rules.yaml;openshift-general.rules\",\n      \"service\": \"prometheus-k8s\"\n    },\n    \"value\": [\n      1632259998.636,\n      \"6.070271358746393\"\n    ]\n  }\n]",
        },
    ]
    promQL query returned unexpected results:
    increase(prometheus_rule_evaluation_failures_total[1h4m7s]) >= 1
    [
      {
        "metric": {
          "container": "prometheus-proxy",
          "endpoint": "web",
          "instance": "10.128.3.245:9091",
          "job": "prometheus-k8s",
          "namespace": "openshift-monitoring",
          "pod": "prometheus-k8s-1",
          "prometheus": "openshift-monitoring/k8s",
          "rule_group": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-cluster-monitoring-operator-prometheus-rules.yaml;openshift-general.rules",
          "service": "prometheus-k8s"
        },
        "value": [
          1632259998.636,
          "6.333333333333334"
        ]
      },
      {
        "metric": {
          "container": "prometheus-proxy",
          "endpoint": "web",
          "instance": "10.129.2.12:9091",
          "job": "prometheus-k8s",
          "namespace": "openshift-monitoring",
          "pod": "prometheus-k8s-0",
          "prometheus": "openshift-monitoring/k8s",
          "rule_group": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-cluster-monitoring-operator-prometheus-rules.yaml;openshift-general.rules",
          "service": "prometheus-k8s"
        },
        "value": [
          1632259998.636,
          "1.0064771653543307"
        ]
      },
      {
        "metric": {
          "container": "prometheus-proxy",
          "endpoint": "web",
          "instance": "10.131.0.82:9091",
          "job": "prometheus-k8s",
          "namespace": "openshift-monitoring",
          "pod": "prometheus-k8s-1",
          "prometheus": "openshift-monitoring/k8s",
          "rule_group": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-cluster-monitoring-operator-prometheus-rules.yaml;openshift-general.rules",
          "service": "prometheus-k8s"
        },
        "value": [
          1632259998.636,
          "6.070271358746393"
        ]
      }
    ]
occurred

Comment 1 Haoyu Sun 2021-09-22 07:47:01 UTC

The problem comes from the query of alert "HighlyAvailableWorkloadIncorrectlySpread":

There is duplication of metrics "kube_pod_spec_volumes_persistentvolumeclaims_info" when joining by (namespace, pod).

We can resolve the problem by changing the query to those below. 
=> replace "kube_pod_spec_volumes_persistentvolumeclaims_info" with "max by( namespace, pod, workload ) kube_pod_spec_volumes_persistentvolumeclaims_info)"

count without(node) (group by(node, workload, namespace) (kube_pod_info{node!=""}
  * on(namespace, pod) group_left(workload) (max by( namespace, pod, workload )(kube_pod_spec_volumes_persistentvolumeclaims_info)
  * on(namespace, pod) group_left(workload) (namespace_workload_pod:kube_pod_owner:relabel
  * on(namespace, workload, workload_type) group_left() (count without(pod) (namespace_workload_pod:kube_pod_owner:relabel{namespace=~"(openshift-.*|kube-.*|default)"})> 1))))) == 1

Comment 10 errata-xmlrpc 2022-03-10 16:12:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.