Bug 2006561 - [sig-instrumentation] Prometheus when installed on the cluster shouldn't have failing rules evaluation [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
Summary: [sig-instrumentation] Prometheus when installed on the cluster shouldn't have...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.10.0
Assignee: Haoyu Sun
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks: 2011798
TreeView+ depends on / blocked
 
Reported: 2021-09-21 22:05 UTC by Ben Parees
Modified: 2023-01-06 11:28 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
job=periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-canary=all
Last Closed: 2022-03-10 16:12:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1401 0 None open Bug 2006561: Prometheus when installed on the cluster shouldn't have failing rules evaluation 2021-10-04 16:22:44 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:13:04 UTC

Description Ben Parees 2021-09-21 22:05:47 UTC
[sig-instrumentation] Prometheus when installed on the cluster shouldn't have failing rules evaluation [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

is failing frequently in CI, see:
https://search.ci.openshift.org/?search=+Prometheus+when+installed+on+the+cluster+shouldn%27t+have+failing+rules+evaluation&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

(responsible for 1.44% of failures in the last 2 days)


Failed in the canary job here:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-canary/1440401339979927552
(has failed a few timesin canary testing)

failure output:

fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:486]: Unexpected error:
    <errors.aggregate | len:1, cap:1>: [
        {
            s: "promQL query returned unexpected results:\nincrease(prometheus_rule_evaluation_failures_total[1h4m7s]) >= 1\n[\n  {\n    \"metric\": {\n      \"container\": \"prometheus-proxy\",\n      \"endpoint\": \"web\",\n      \"instance\": \"10.128.3.245:9091\",\n      \"job\": \"prometheus-k8s\",\n      \"namespace\": \"openshift-monitoring\",\n      \"pod\": \"prometheus-k8s-1\",\n      \"prometheus\": \"openshift-monitoring/k8s\",\n      \"rule_group\": \"/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-cluster-monitoring-operator-prometheus-rules.yaml;openshift-general.rules\",\n      \"service\": \"prometheus-k8s\"\n    },\n    \"value\": [\n      1632259998.636,\n      \"6.333333333333334\"\n    ]\n  },\n  {\n    \"metric\": {\n      \"container\": \"prometheus-proxy\",\n      \"endpoint\": \"web\",\n      \"instance\": \"10.129.2.12:9091\",\n      \"job\": \"prometheus-k8s\",\n      \"namespace\": \"openshift-monitoring\",\n      \"pod\": \"prometheus-k8s-0\",\n      \"prometheus\": \"openshift-monitoring/k8s\",\n      \"rule_group\": \"/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-cluster-monitoring-operator-prometheus-rules.yaml;openshift-general.rules\",\n      \"service\": \"prometheus-k8s\"\n    },\n    \"value\": [\n      1632259998.636,\n      \"1.0064771653543307\"\n    ]\n  },\n  {\n    \"metric\": {\n      \"container\": \"prometheus-proxy\",\n      \"endpoint\": \"web\",\n      \"instance\": \"10.131.0.82:9091\",\n      \"job\": \"prometheus-k8s\",\n      \"namespace\": \"openshift-monitoring\",\n      \"pod\": \"prometheus-k8s-1\",\n      \"prometheus\": \"openshift-monitoring/k8s\",\n      \"rule_group\": \"/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-cluster-monitoring-operator-prometheus-rules.yaml;openshift-general.rules\",\n      \"service\": \"prometheus-k8s\"\n    },\n    \"value\": [\n      1632259998.636,\n      \"6.070271358746393\"\n    ]\n  }\n]",
        },
    ]
    promQL query returned unexpected results:
    increase(prometheus_rule_evaluation_failures_total[1h4m7s]) >= 1
    [
      {
        "metric": {
          "container": "prometheus-proxy",
          "endpoint": "web",
          "instance": "10.128.3.245:9091",
          "job": "prometheus-k8s",
          "namespace": "openshift-monitoring",
          "pod": "prometheus-k8s-1",
          "prometheus": "openshift-monitoring/k8s",
          "rule_group": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-cluster-monitoring-operator-prometheus-rules.yaml;openshift-general.rules",
          "service": "prometheus-k8s"
        },
        "value": [
          1632259998.636,
          "6.333333333333334"
        ]
      },
      {
        "metric": {
          "container": "prometheus-proxy",
          "endpoint": "web",
          "instance": "10.129.2.12:9091",
          "job": "prometheus-k8s",
          "namespace": "openshift-monitoring",
          "pod": "prometheus-k8s-0",
          "prometheus": "openshift-monitoring/k8s",
          "rule_group": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-cluster-monitoring-operator-prometheus-rules.yaml;openshift-general.rules",
          "service": "prometheus-k8s"
        },
        "value": [
          1632259998.636,
          "1.0064771653543307"
        ]
      },
      {
        "metric": {
          "container": "prometheus-proxy",
          "endpoint": "web",
          "instance": "10.131.0.82:9091",
          "job": "prometheus-k8s",
          "namespace": "openshift-monitoring",
          "pod": "prometheus-k8s-1",
          "prometheus": "openshift-monitoring/k8s",
          "rule_group": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-cluster-monitoring-operator-prometheus-rules.yaml;openshift-general.rules",
          "service": "prometheus-k8s"
        },
        "value": [
          1632259998.636,
          "6.070271358746393"
        ]
      }
    ]
occurred

Comment 1 Haoyu Sun 2021-09-22 07:47:01 UTC
The problem comes from the query of alert "HighlyAvailableWorkloadIncorrectlySpread":

There is duplication of metrics "kube_pod_spec_volumes_persistentvolumeclaims_info" when joining by (namespace, pod).

We can resolve the problem by changing the query to those below. 
=> replace "kube_pod_spec_volumes_persistentvolumeclaims_info" with "max by( namespace, pod, workload ) kube_pod_spec_volumes_persistentvolumeclaims_info)"

count without(node) (group by(node, workload, namespace) (kube_pod_info{node!=""}
  * on(namespace, pod) group_left(workload) (max by( namespace, pod, workload )(kube_pod_spec_volumes_persistentvolumeclaims_info)
  * on(namespace, pod) group_left(workload) (namespace_workload_pod:kube_pod_owner:relabel
  * on(namespace, workload, workload_type) group_left() (count without(pod) (namespace_workload_pod:kube_pod_owner:relabel{namespace=~"(openshift-.*|kube-.*|default)"})> 1))))) == 1

Comment 10 errata-xmlrpc 2022-03-10 16:12:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.