Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1691893

Summary: KubeCPUOvercommit factors in Completed and Failed pods
Product: OpenShift Container Platform Reporter: Matthew Robson <mrobson>
Component: MonitoringAssignee: Matthias Loibl <mloibl>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: adeshpan, anpicker, erooth, lserven, mloibl, nbhatt, steven.barre, surbania, zhuchkov.alex
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-06 02:00:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Matthew Robson 2019-03-22 18:49:21 UTC
Description of problem:

Prom creates a record based on kube_pod_status_scheduled:

record: namespace_name:kube_pod_container_resource_requests_cpu_cores:sum
expr: sum
  by(namespace, label_name) (sum by(namespace, pod) (kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"}
  and on(pod) kube_pod_status_scheduled{condition="true"}) * on(namespace,
  pod) group_left(label_name) label_replace(kube_pod_labels{job="kube-state-metrics"},
  "pod_name", "$1", "pod", "(.*)"))

Which is then used in an Alert:

alert: KubeCPUOvercommit
expr: sum(namespace_name:kube_pod_container_resource_requests_cpu_cores:sum)
  / sum(node:node_num_cpu:sum) > (count(node:node_num_cpu:sum) - 1) / count(node:node_num_cpu:sum)
for: 5m
labels:
  severity: warning
annotations:
  message: Overcommited CPU resource requests on Pods, cannot tolerate node failure.


In this case, kube_pod_status_scheduled is factoring in all pods with: 

status:
  conditions:
    type: PodScheduled

In the case of a build in OpenShift, all of the completed builds are being used in the calculation because they have a PodScheduled status condition, but it does not factor in the additional status of Running, Completed or Failed:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2019-03-21T20:34:19Z
    reason: PodCompleted
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: 2019-03-21T20:34:31Z
    reason: PodCompleted
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: null
    reason: PodCompleted
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: 2019-03-21T20:33:35Z
    status: "True"
    type: PodScheduled


# oc -n devhub-tools get pods
NAME                                 READY     STATUS      RESTARTS   AGE
devhub-pr-458-4-build                0/1       Completed   0          1d
devhub-pr-460-2-build                0/1       Completed   0          7d
devhub-pr-473-1-build                0/1       Error       0          6d
devhub-pr-473-2-build                0/1       Completed   0          6d
devhub-pr-473-3-build                0/1       Completed   0          3d
devhub-pr-487-3-build                0/1       Completed   0          23h
devhub-pr-488-1-build                0/1       Completed   0          22h
devhub-pr-488-2-build                0/1       Completed   0          21h
devhub-pr-489-1-build                0/1       Completed   0          21h
devhub-pr-489-2-build                0/1       Completed   0          21h
devhub-pr-490-1-build                0/1       Completed   0          4h
devhub-pr-490-2-build                0/1       Error       0          3h
devhub-pr-490-3-build                0/1       Completed   0          3h
devhub-pr-490-4-build                0/1       Completed   0          47m
devhub-pr-491-1-build                0/1       Completed   0          58m
devhub-pr-491-2-build                0/1       Completed   0          34m
devhub-static-pr-458-3-build         0/1       Completed   0          1d
devhub-static-pr-460-1-build         0/1       Completed   0          7d
devhub-static-pr-473-1-build         0/1       Completed   0          6d
devhub-static-pr-491-1-build         0/1       Completed   0          49m
devhub-static-pr-491-2-build         0/1       Completed   0          24m


Version-Release number of selected component (if applicable):

OCP 3.11

How reproducible:

Always


Steps to Reproduce:
1. Run a number of builds and look at the overcommit.
2.
3.

Actual results:

Inaccurate overcommit numbers and alerts.

Expected results:

Overcommit should be representative of what is actually being used and not factor in completed or failed pod.


Additional info:

Comment 1 lserven 2019-03-27 16:34:10 UTC
This is a valid bug, we have a fix for this in place for memory requests but not for CPU.

Comment 2 Matthias Loibl 2019-04-03 13:18:57 UTC
This bug has been patched in https://github.com/openshift/cluster-monitoring-operator/pull/304 which was just merged.

- Matthias

Comment 3 Andrew Pickering 2019-04-04 06:09:18 UTC
PR has merged, so changing to MODIFIED.

Comment 5 Junqi Zhao 2019-04-16 06:48:37 UTC
rules to only applied to pending and running cpu & memory requests
image: ose-cluster-monitoring-operator-v3.11.105-1

Comment 8 errata-xmlrpc 2019-06-06 02:00:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0794

Comment 9 Red Hat Bugzilla 2023-09-14 05:25:50 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days