Bug 1779324
Summary: | flake: prometheus_rule_evaluation_failures_total >= 1 had reported incorrect results | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
Component: | Monitoring | Assignee: | Simon Pasquier <spasquie> |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
Severity: | low | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.3.0 | CC: | alegrand, anpicker, erooth, jcallen, kakkoyun, lcosic, mloibl, pkrupa, spasquie, surbania |
Target Milestone: | --- | ||
Target Release: | 4.4.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: the evaluation of the "record: namespace:kube_pod_container_resource_requests_cpu_cores:sum" recording rule might occasionally fail.
Consequence: the "record: namespace:kube_pod_container_resource_requests_cpu_cores:sum" metric is missing.
Fix: the recording rule's expression has been fixed.
Result: the recording rule always evaluates successfully.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2020-05-13 21:53:51 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
W. Trevor King
2019-12-03 18:42:43 UTC
At some point, prometheus-k8s-0 had 2 timeseries in kube_pod_labels for the same pod: kube_pod_labels{endpoint="https-main", instance="10.128.2.5:8443", job="kube-state-metrics", label_apiserver="true", label_app="openshift-kube-apiserver", label_revision="6", namespace="openshift-kube-apiserver", pod="kube-apiserver-ip-10-0-137-175.ec2.internal", service="kube-state-metrics"} 1 kube_pod_labels{endpoint="https-main", instance="10.128.2.5:8443", job="kube-state-metrics", label_apiserver="true", label_app="openshift-kube-apiserver", label_revision="4", namespace="openshift-kube-apiserver", pod="kube-apiserver-ip-10-0-137-175.ec2.internal", service="kube-state-metrics"} 1 The only difference is the "label_revision" value. FWIW prometheus-k8s-1 doesn't exhibit the same issue... I've looked at the Prometheus data [1] from the test but it doesn't contain the faulty metrics (not sure to which prometheus pod it corresponds). To get rid of the many-to-many errors, we need to aggregate kube_pod_labels explicitly on (namespace,pod,label_name): sum by(namespace, label_name) (sum by(namespace, pod) (kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"} * on(endpoint, instance, job, namespace, pod, service) group_left(phase) (kube_pod_status_phase{phase=~"Pending|Running"} == 1)) * on(namespace, pod) group_left(label_name) max by (namespace,pod,label_name) (kube_pod_labels{job="kube-state-metrics"})) [1] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.3/700/artifacts/e2e-aws-fips/metrics/prometheus.tar Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 |