1691893 – KubeCPUOvercommit factors in Completed and Failed pods

Bug 1691893 - KubeCPUOvercommit factors in Completed and Failed pods

Summary: KubeCPUOvercommit factors in Completed and Failed pods

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	3.11.0
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Matthias Loibl
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-22 18:49 UTC by Matthew Robson
Modified:	2023-09-14 05:25 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-06 02:00:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0794	0	None	None	None	2019-06-06 02:00:39 UTC

Description Matthew Robson 2019-03-22 18:49:21 UTC

Description of problem:

Prom creates a record based on kube_pod_status_scheduled:

record: namespace_name:kube_pod_container_resource_requests_cpu_cores:sum
expr: sum
  by(namespace, label_name) (sum by(namespace, pod) (kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"}
  and on(pod) kube_pod_status_scheduled{condition="true"}) * on(namespace,
  pod) group_left(label_name) label_replace(kube_pod_labels{job="kube-state-metrics"},
  "pod_name", "$1", "pod", "(.*)"))

Which is then used in an Alert:

alert: KubeCPUOvercommit
expr: sum(namespace_name:kube_pod_container_resource_requests_cpu_cores:sum)
  / sum(node:node_num_cpu:sum) > (count(node:node_num_cpu:sum) - 1) / count(node:node_num_cpu:sum)
for: 5m
labels:
  severity: warning
annotations:
  message: Overcommited CPU resource requests on Pods, cannot tolerate node failure.


In this case, kube_pod_status_scheduled is factoring in all pods with: 

status:
  conditions:
    type: PodScheduled

In the case of a build in OpenShift, all of the completed builds are being used in the calculation because they have a PodScheduled status condition, but it does not factor in the additional status of Running, Completed or Failed:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2019-03-21T20:34:19Z
    reason: PodCompleted
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: 2019-03-21T20:34:31Z
    reason: PodCompleted
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: null
    reason: PodCompleted
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: 2019-03-21T20:33:35Z
    status: "True"
    type: PodScheduled


# oc -n devhub-tools get pods
NAME                                 READY     STATUS      RESTARTS   AGE
devhub-pr-458-4-build                0/1       Completed   0          1d
devhub-pr-460-2-build                0/1       Completed   0          7d
devhub-pr-473-1-build                0/1       Error       0          6d
devhub-pr-473-2-build                0/1       Completed   0          6d
devhub-pr-473-3-build                0/1       Completed   0          3d
devhub-pr-487-3-build                0/1       Completed   0          23h
devhub-pr-488-1-build                0/1       Completed   0          22h
devhub-pr-488-2-build                0/1       Completed   0          21h
devhub-pr-489-1-build                0/1       Completed   0          21h
devhub-pr-489-2-build                0/1       Completed   0          21h
devhub-pr-490-1-build                0/1       Completed   0          4h
devhub-pr-490-2-build                0/1       Error       0          3h
devhub-pr-490-3-build                0/1       Completed   0          3h
devhub-pr-490-4-build                0/1       Completed   0          47m
devhub-pr-491-1-build                0/1       Completed   0          58m
devhub-pr-491-2-build                0/1       Completed   0          34m
devhub-static-pr-458-3-build         0/1       Completed   0          1d
devhub-static-pr-460-1-build         0/1       Completed   0          7d
devhub-static-pr-473-1-build         0/1       Completed   0          6d
devhub-static-pr-491-1-build         0/1       Completed   0          49m
devhub-static-pr-491-2-build         0/1       Completed   0          24m


Version-Release number of selected component (if applicable):

OCP 3.11

How reproducible:

Always


Steps to Reproduce:
1. Run a number of builds and look at the overcommit.
2.
3.

Actual results:

Inaccurate overcommit numbers and alerts.

Expected results:

Overcommit should be representative of what is actually being used and not factor in completed or failed pod.


Additional info:

Comment 1 lserven 2019-03-27 16:34:10 UTC

This is a valid bug, we have a fix for this in place for memory requests but not for CPU.

Comment 2 Matthias Loibl 2019-04-03 13:18:57 UTC

This bug has been patched in https://github.com/openshift/cluster-monitoring-operator/pull/304 which was just merged.

- Matthias

Comment 3 Andrew Pickering 2019-04-04 06:09:18 UTC

PR has merged, so changing to MODIFIED.

Comment 5 Junqi Zhao 2019-04-16 06:48:37 UTC

rules to only applied to pending and running cpu & memory requests
image: ose-cluster-monitoring-operator-v3.11.105-1

Comment 8 errata-xmlrpc 2019-06-06 02:00:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0794

Comment 9 Red Hat Bugzilla 2023-09-14 05:25:50 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.