Bug 1920159

Summary:

Shrink the default request reservation of some components in the cluster based on input from testing

Product:

OpenShift Container Platform

Reporter:

Clayton Coleman <ccoleman>

Component:

kube-apiserver

Assignee:

Clayton Coleman <ccoleman>

Status:

CLOSED NOTABUG

QA Contact:

Ke Wang <kewang>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.7

CC:

aos-bugs, dhellmann, ijolliff, mfojtik, mpatel, rfreiman, sttts, wking, wlewis, xxia

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-02-25 15:40:44 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
cpu request overage over a week on build01	none
build01 week cpu overage for kcm	none

Description Clayton Coleman 2021-01-25 17:06:50 UTC

A number of components requests could be tuned downwards a small amount (less than 20 or 30%) without impacting the ability of the cluster to function in high density environments due to improvements we have made to the CPU and memory usage of the product over releases.

Tuning these down gives small clusters (single-node and compact) more headroom and are still realistic given the data we have. This tuning would take into account the principles described in https://bugzilla.redhat.com/show_bug.cgi?id=1812583 and codified into https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#resources-and-limits.

This bug will track measuring the current effective usage in single-node, e2e, and upgrade in terms of proportionality of 

 Operator
Request (mcpus)
etcd 440
kube-apiserver 350
network 240
monitoring 232
authentication 180
ingress 120
openshift-apiserver 120
kube-controller-manager 110
kube-storage-version-migrator 110
openshift-controller-manager 110
machine-config 100
dns 85
console 50
machine-api 50

against each other and their actual usage in e2e runs. We will be looking for components that significantly over request during periods of high rate of change OR high rate traffic (i.e. the p90-p99(cpu) over a run, measured against other components).

We will be looking to concretely reduce the single node master contribution of components by 200mc in total, which should be achievable.

Notes

* kube-storage-version-migrator is likely oversubscribed
* monitoring will be handled separately
* authentication is likely oversubscribed substantially
* dns and network may need to be tweaked slightly.

Comment 1 Stefan Schimanski 2021-01-26 08:39:44 UTC

> kube-storage-version-migrator is likely oversubscribed

The operator should scale down the migrator if there is no active migration CR around. Note that it needs some memory to keep pages of objects in memory.

Comment 2 Clayton Coleman 2021-01-27 18:25:15 UTC

On the prod build01 and app.ci clusters, etcd requested usage is consistently below the request:

sort_desc(max without(container,endpoint,exported_namespace,exported_pod,instance,job,service,scheduler,priority,unit,resource) (label_replace(label_replace(kube_pod_resource_request{resource="cpu",exported_namespace!~"ci-op-.*|ci",exported_namespace="openshift-etcd"}, "pod", "$1", "exported_pod", "(.*)"), "namespace", "$1", "exported_namespace", "(.*)")) - on (pod,namespace) group_left() max by (pod,namespace) (rate(container_cpu_usage_seconds_total{container="",namespace="openshift-etcd"}[60m]))) * 1000

On average etcd used ~200mc less than requested, but max was 100mc less than requested and min was 300mc less than requested over a one week period.

I think we can tune each pod down by 100mc in 4.7

Comment 3 Clayton Coleman 2021-01-27 18:26:34 UTC

Etcd on app.ci in 4.6.13 (which fixed the issues with IO starvation from the BRQ scheduler changed) is roughly similar to build01 - the minimum gap between request and usage is about 100mc under, so we it should be safe to reduce.

Comment 4 Clayton Coleman 2021-01-27 18:34:28 UTC

Created attachment 1751335 [details]
cpu request overage over a week on build01

Comment 5 Clayton Coleman 2021-01-27 18:35:56 UTC

CPU usage of kube-apiserver is less clear cut, varies between 500mc under for all three to -200mc or more over.  The attached graph captures some of this count, I think a 50mc per pod reduction for kube-apiserver total would be reasonable and there is still plenty of room for burst.

Comment 6 Clayton Coleman 2021-01-27 18:40:16 UTC

Created attachment 1751340 [details]
build01 week cpu overage for kcm

The kube-controller-manager has a smaller request but even on loaded clusters is more efficient than it used to be.  Peak usage on app.ci (heavy operator load) is 50mc, peak usage on build01 is about 0.9 but only for specific intervals.  I think setting request closer to actual usage in these bulky clusters by removing 20mc of request is reasonable.

Comment 7 Clayton Coleman 2021-01-27 18:45:05 UTC

> The operator should scale down the migrator if there is no active migration CR around. Note that it needs some memory to keep pages of objects in memory.

Should as in "this is implemented today"?

Right now I see a single deployment with a 100mc core request which is not appropriate.  Burst is still enabled, even as is today, so a 10mc request is more accurate.

Comment 8 Clayton Coleman 2021-01-27 18:46:36 UTC

Altogether the minimal cpu request tunings above would remove 270mc of reservation and still give production clusters reasonable proportionality as well as headroom.

Comment 9 Clayton Coleman 2021-01-28 04:05:03 UTC

OVS is still requesting 100m in pods (it's actually running on the node), but actual usage the script logic in both 4.6 and 4.7 is roughly 5m.  Reducing that down to 15m to allow some headroom - 4.8 will remove that daemonset.

Comment 10 Clayton Coleman 2021-01-28 04:18:28 UTC

Combined test in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1354643623982927872

Comment 11 Clayton Coleman 2021-01-28 19:18:15 UTC

Second combined test also passed https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1354850792707723264

First had a flake that we pinned down to being a test issue https://github.com/openshift/origin/pull/25834.

I think this looks clean, releasing the holds.

Comment 13 Ke Wang 2021-02-05 08:34:46 UTC

Hi ccoleman, I found there are still https://github.com/openshift/cluster-kube-apiserver-operator/pull/1032, https://github.com/openshift/cluster-kube-controller-manager-operator/pull/500 and https://github.com/openshift/cluster-network-operator/pull/963 which have not been merged. Would you please merge them? I assigned the bug back first, when they are merged, will go on versification.