Description of problem: Prom creates a record based on kube_pod_status_scheduled: record: namespace_name:kube_pod_container_resource_requests_cpu_cores:sum expr: sum by(namespace, label_name) (sum by(namespace, pod) (kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"} and on(pod) kube_pod_status_scheduled{condition="true"}) * on(namespace, pod) group_left(label_name) label_replace(kube_pod_labels{job="kube-state-metrics"}, "pod_name", "$1", "pod", "(.*)")) Which is then used in an Alert: alert: KubeCPUOvercommit expr: sum(namespace_name:kube_pod_container_resource_requests_cpu_cores:sum) / sum(node:node_num_cpu:sum) > (count(node:node_num_cpu:sum) - 1) / count(node:node_num_cpu:sum) for: 5m labels: severity: warning annotations: message: Overcommited CPU resource requests on Pods, cannot tolerate node failure. In this case, kube_pod_status_scheduled is factoring in all pods with: status: conditions: type: PodScheduled In the case of a build in OpenShift, all of the completed builds are being used in the calculation because they have a PodScheduled status condition, but it does not factor in the additional status of Running, Completed or Failed: status: conditions: - lastProbeTime: null lastTransitionTime: 2019-03-21T20:34:19Z reason: PodCompleted status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: 2019-03-21T20:34:31Z reason: PodCompleted status: "False" type: Ready - lastProbeTime: null lastTransitionTime: null reason: PodCompleted status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: 2019-03-21T20:33:35Z status: "True" type: PodScheduled # oc -n devhub-tools get pods NAME READY STATUS RESTARTS AGE devhub-pr-458-4-build 0/1 Completed 0 1d devhub-pr-460-2-build 0/1 Completed 0 7d devhub-pr-473-1-build 0/1 Error 0 6d devhub-pr-473-2-build 0/1 Completed 0 6d devhub-pr-473-3-build 0/1 Completed 0 3d devhub-pr-487-3-build 0/1 Completed 0 23h devhub-pr-488-1-build 0/1 Completed 0 22h devhub-pr-488-2-build 0/1 Completed 0 21h devhub-pr-489-1-build 0/1 Completed 0 21h devhub-pr-489-2-build 0/1 Completed 0 21h devhub-pr-490-1-build 0/1 Completed 0 4h devhub-pr-490-2-build 0/1 Error 0 3h devhub-pr-490-3-build 0/1 Completed 0 3h devhub-pr-490-4-build 0/1 Completed 0 47m devhub-pr-491-1-build 0/1 Completed 0 58m devhub-pr-491-2-build 0/1 Completed 0 34m devhub-static-pr-458-3-build 0/1 Completed 0 1d devhub-static-pr-460-1-build 0/1 Completed 0 7d devhub-static-pr-473-1-build 0/1 Completed 0 6d devhub-static-pr-491-1-build 0/1 Completed 0 49m devhub-static-pr-491-2-build 0/1 Completed 0 24m Version-Release number of selected component (if applicable): OCP 3.11 How reproducible: Always Steps to Reproduce: 1. Run a number of builds and look at the overcommit. 2. 3. Actual results: Inaccurate overcommit numbers and alerts. Expected results: Overcommit should be representative of what is actually being used and not factor in completed or failed pod. Additional info:
This is a valid bug, we have a fix for this in place for memory requests but not for CPU.
This bug has been patched in https://github.com/openshift/cluster-monitoring-operator/pull/304 which was just merged. - Matthias
PR has merged, so changing to MODIFIED.
rules to only applied to pending and running cpu & memory requests image: ose-cluster-monitoring-operator-v3.11.105-1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0794
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days