Bug 1691893
| Summary: | KubeCPUOvercommit factors in Completed and Failed pods | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Matthew Robson <mrobson> |
| Component: | Monitoring | Assignee: | Matthias Loibl <mloibl> |
| Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 3.11.0 | CC: | adeshpan, anpicker, erooth, lserven, mloibl, nbhatt, steven.barre, surbania, zhuchkov.alex |
| Target Milestone: | --- | ||
| Target Release: | 3.11.z | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-06-06 02:00:29 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
This is a valid bug, we have a fix for this in place for memory requests but not for CPU. This bug has been patched in https://github.com/openshift/cluster-monitoring-operator/pull/304 which was just merged. - Matthias PR has merged, so changing to MODIFIED. rules to only applied to pending and running cpu & memory requests image: ose-cluster-monitoring-operator-v3.11.105-1 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0794 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |
Description of problem: Prom creates a record based on kube_pod_status_scheduled: record: namespace_name:kube_pod_container_resource_requests_cpu_cores:sum expr: sum by(namespace, label_name) (sum by(namespace, pod) (kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"} and on(pod) kube_pod_status_scheduled{condition="true"}) * on(namespace, pod) group_left(label_name) label_replace(kube_pod_labels{job="kube-state-metrics"}, "pod_name", "$1", "pod", "(.*)")) Which is then used in an Alert: alert: KubeCPUOvercommit expr: sum(namespace_name:kube_pod_container_resource_requests_cpu_cores:sum) / sum(node:node_num_cpu:sum) > (count(node:node_num_cpu:sum) - 1) / count(node:node_num_cpu:sum) for: 5m labels: severity: warning annotations: message: Overcommited CPU resource requests on Pods, cannot tolerate node failure. In this case, kube_pod_status_scheduled is factoring in all pods with: status: conditions: type: PodScheduled In the case of a build in OpenShift, all of the completed builds are being used in the calculation because they have a PodScheduled status condition, but it does not factor in the additional status of Running, Completed or Failed: status: conditions: - lastProbeTime: null lastTransitionTime: 2019-03-21T20:34:19Z reason: PodCompleted status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: 2019-03-21T20:34:31Z reason: PodCompleted status: "False" type: Ready - lastProbeTime: null lastTransitionTime: null reason: PodCompleted status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: 2019-03-21T20:33:35Z status: "True" type: PodScheduled # oc -n devhub-tools get pods NAME READY STATUS RESTARTS AGE devhub-pr-458-4-build 0/1 Completed 0 1d devhub-pr-460-2-build 0/1 Completed 0 7d devhub-pr-473-1-build 0/1 Error 0 6d devhub-pr-473-2-build 0/1 Completed 0 6d devhub-pr-473-3-build 0/1 Completed 0 3d devhub-pr-487-3-build 0/1 Completed 0 23h devhub-pr-488-1-build 0/1 Completed 0 22h devhub-pr-488-2-build 0/1 Completed 0 21h devhub-pr-489-1-build 0/1 Completed 0 21h devhub-pr-489-2-build 0/1 Completed 0 21h devhub-pr-490-1-build 0/1 Completed 0 4h devhub-pr-490-2-build 0/1 Error 0 3h devhub-pr-490-3-build 0/1 Completed 0 3h devhub-pr-490-4-build 0/1 Completed 0 47m devhub-pr-491-1-build 0/1 Completed 0 58m devhub-pr-491-2-build 0/1 Completed 0 34m devhub-static-pr-458-3-build 0/1 Completed 0 1d devhub-static-pr-460-1-build 0/1 Completed 0 7d devhub-static-pr-473-1-build 0/1 Completed 0 6d devhub-static-pr-491-1-build 0/1 Completed 0 49m devhub-static-pr-491-2-build 0/1 Completed 0 24m Version-Release number of selected component (if applicable): OCP 3.11 How reproducible: Always Steps to Reproduce: 1. Run a number of builds and look at the overcommit. 2. 3. Actual results: Inaccurate overcommit numbers and alerts. Expected results: Overcommit should be representative of what is actually being used and not factor in completed or failed pod. Additional info: