[sig-instrumentation] Prometheus when installed on the cluster shouldn't have failing rules evaluation [Skipped:Disconnected] [Suite:openshift/conformance/parallel] is failing frequently in CI, see: https://search.ci.openshift.org/?search=+Prometheus+when+installed+on+the+cluster+shouldn%27t+have+failing+rules+evaluation&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job (responsible for 1.44% of failures in the last 2 days) Failed in the canary job here: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-canary/1440401339979927552 (has failed a few timesin canary testing) failure output: fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:486]: Unexpected error: <errors.aggregate | len:1, cap:1>: [ { s: "promQL query returned unexpected results:\nincrease(prometheus_rule_evaluation_failures_total[1h4m7s]) >= 1\n[\n {\n \"metric\": {\n \"container\": \"prometheus-proxy\",\n \"endpoint\": \"web\",\n \"instance\": \"10.128.3.245:9091\",\n \"job\": \"prometheus-k8s\",\n \"namespace\": \"openshift-monitoring\",\n \"pod\": \"prometheus-k8s-1\",\n \"prometheus\": \"openshift-monitoring/k8s\",\n \"rule_group\": \"/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-cluster-monitoring-operator-prometheus-rules.yaml;openshift-general.rules\",\n \"service\": \"prometheus-k8s\"\n },\n \"value\": [\n 1632259998.636,\n \"6.333333333333334\"\n ]\n },\n {\n \"metric\": {\n \"container\": \"prometheus-proxy\",\n \"endpoint\": \"web\",\n \"instance\": \"10.129.2.12:9091\",\n \"job\": \"prometheus-k8s\",\n \"namespace\": \"openshift-monitoring\",\n \"pod\": \"prometheus-k8s-0\",\n \"prometheus\": \"openshift-monitoring/k8s\",\n \"rule_group\": \"/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-cluster-monitoring-operator-prometheus-rules.yaml;openshift-general.rules\",\n \"service\": \"prometheus-k8s\"\n },\n \"value\": [\n 1632259998.636,\n \"1.0064771653543307\"\n ]\n },\n {\n \"metric\": {\n \"container\": \"prometheus-proxy\",\n \"endpoint\": \"web\",\n \"instance\": \"10.131.0.82:9091\",\n \"job\": \"prometheus-k8s\",\n \"namespace\": \"openshift-monitoring\",\n \"pod\": \"prometheus-k8s-1\",\n \"prometheus\": \"openshift-monitoring/k8s\",\n \"rule_group\": \"/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-cluster-monitoring-operator-prometheus-rules.yaml;openshift-general.rules\",\n \"service\": \"prometheus-k8s\"\n },\n \"value\": [\n 1632259998.636,\n \"6.070271358746393\"\n ]\n }\n]", }, ] promQL query returned unexpected results: increase(prometheus_rule_evaluation_failures_total[1h4m7s]) >= 1 [ { "metric": { "container": "prometheus-proxy", "endpoint": "web", "instance": "10.128.3.245:9091", "job": "prometheus-k8s", "namespace": "openshift-monitoring", "pod": "prometheus-k8s-1", "prometheus": "openshift-monitoring/k8s", "rule_group": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-cluster-monitoring-operator-prometheus-rules.yaml;openshift-general.rules", "service": "prometheus-k8s" }, "value": [ 1632259998.636, "6.333333333333334" ] }, { "metric": { "container": "prometheus-proxy", "endpoint": "web", "instance": "10.129.2.12:9091", "job": "prometheus-k8s", "namespace": "openshift-monitoring", "pod": "prometheus-k8s-0", "prometheus": "openshift-monitoring/k8s", "rule_group": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-cluster-monitoring-operator-prometheus-rules.yaml;openshift-general.rules", "service": "prometheus-k8s" }, "value": [ 1632259998.636, "1.0064771653543307" ] }, { "metric": { "container": "prometheus-proxy", "endpoint": "web", "instance": "10.131.0.82:9091", "job": "prometheus-k8s", "namespace": "openshift-monitoring", "pod": "prometheus-k8s-1", "prometheus": "openshift-monitoring/k8s", "rule_group": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-cluster-monitoring-operator-prometheus-rules.yaml;openshift-general.rules", "service": "prometheus-k8s" }, "value": [ 1632259998.636, "6.070271358746393" ] } ] occurred
The problem comes from the query of alert "HighlyAvailableWorkloadIncorrectlySpread": There is duplication of metrics "kube_pod_spec_volumes_persistentvolumeclaims_info" when joining by (namespace, pod). We can resolve the problem by changing the query to those below. => replace "kube_pod_spec_volumes_persistentvolumeclaims_info" with "max by( namespace, pod, workload ) kube_pod_spec_volumes_persistentvolumeclaims_info)" count without(node) (group by(node, workload, namespace) (kube_pod_info{node!=""} * on(namespace, pod) group_left(workload) (max by( namespace, pod, workload )(kube_pod_spec_volumes_persistentvolumeclaims_info) * on(namespace, pod) group_left(workload) (namespace_workload_pod:kube_pod_owner:relabel * on(namespace, workload, workload_type) group_left() (count without(pod) (namespace_workload_pod:kube_pod_owner:relabel{namespace=~"(openshift-.*|kube-.*|default)"})> 1))))) == 1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056