1779324 – flake: prometheus_rule_evaluation_failures_total >= 1 had reported incorrect results

Bug 1779324 - flake: prometheus_rule_evaluation_failures_total >= 1 had reported incorrect results

Summary: flake: prometheus_rule_evaluation_failures_total >= 1 had reported incorrect ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Simon Pasquier
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-12-03 18:42 UTC by W. Trevor King
Modified:	2020-05-13 21:53 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: the evaluation of the "record: namespace:kube_pod_container_resource_requests_cpu_cores:sum" recording rule might occasionally fail. Consequence: the "record: namespace:kube_pod_container_resource_requests_cpu_cores:sum" metric is missing. Fix: the recording rule's expression has been fixed. Result: the recording rule always evaluates successfully.
Clone Of:
Environment:
Last Closed:	2020-05-13 21:53:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes-monitoring kubernetes-mixin pull 306	'None'	closed	Fix many-to-many errors with kube_pod_labels	2021-02-03 18:53:43 UTC
Github	openshift cluster-monitoring-operator pull 588	None	closed	jsonnet: bump kubernetes-mixin	2021-02-03 18:53:43 UTC
Red Hat Product Errata	RHBA-2020:0581	None	None	None	2020-05-13 21:53:53 UTC

Description W. Trevor King 2019-12-03 18:42:43 UTC

Failures like [1]:

fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:134]: Expected
    <map[string]error | len:1>: {
        "prometheus_rule_evaluation_failures_total >= 1": {
            s: "promQL query: prometheus_rule_evaluation_failures_total >= 1 had reported incorrect results: prometheus_rule_evaluation_failures_total{endpoint=\"web\", instance=\"10.131.0.14:9091\", job=\"prometheus-k8s\", namespace=\"openshift-monitoring\", pod=\"prometheus-k8s-0\", service=\"prometheus-k8s\"} => 1 @[1575377276.01]",
        },
    }
to be empty
...
failed: (8m0s) 2019-12-03T12:47:59 "[Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't have failing rules evaluation [Suite:openshift/conformance/parallel/minimal]"

We see ~2 of these a day [2], so not a high flake rate, but still, a flake that impacts 4.3 release-informer success rates.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.3/700
[2]: https://search.svc.ci.openshift.org/?search=promQL+query%3A+prometheus_rule_evaluation_failures_total.*had+reported+incorrect+results&maxAge=336h&context=-1&type=all

Comment 2 Simon Pasquier 2019-12-04 08:40:09 UTC

At some point, prometheus-k8s-0 had 2 timeseries in kube_pod_labels for the same pod:

kube_pod_labels{endpoint="https-main", instance="10.128.2.5:8443", job="kube-state-metrics", label_apiserver="true", label_app="openshift-kube-apiserver", label_revision="6", namespace="openshift-kube-apiserver", pod="kube-apiserver-ip-10-0-137-175.ec2.internal", service="kube-state-metrics"} 1
kube_pod_labels{endpoint="https-main", instance="10.128.2.5:8443", job="kube-state-metrics", label_apiserver="true", label_app="openshift-kube-apiserver", label_revision="4", namespace="openshift-kube-apiserver", pod="kube-apiserver-ip-10-0-137-175.ec2.internal", service="kube-state-metrics"} 1

The only difference is the "label_revision" value. FWIW prometheus-k8s-1 doesn't exhibit the same issue...
I've looked at the Prometheus data [1] from the test but it doesn't contain the faulty metrics (not sure to which prometheus pod it corresponds).

To get rid of the many-to-many errors, we need to aggregate kube_pod_labels explicitly on (namespace,pod,label_name):

sum by(namespace, label_name) (sum by(namespace, pod) (kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"}  * on(endpoint, instance, job, namespace, pod, service) group_left(phase) (kube_pod_status_phase{phase=~"Pending|Running"}  == 1)) * on(namespace, pod) group_left(label_name) max by (namespace,pod,label_name) (kube_pod_labels{job="kube-state-metrics"}))

[1] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.3/700/artifacts/e2e-aws-fips/metrics/prometheus.tar

Comment 4 Joseph Callen 2019-12-04 14:09:34 UTC

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/112

Comment 9 errata-xmlrpc 2020-05-13 21:53:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.