Bug 1920159 - Shrink the default request reservation of some components in the cluster based on input from testing
Summary: Shrink the default request reservation of some components in the cluster base...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Clayton Coleman
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-25 17:06 UTC by Clayton Coleman
Modified: 2022-02-25 15:40 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-02-25 15:40:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
cpu request overage over a week on build01 (132.03 KB, image/png)
2021-01-27 18:34 UTC, Clayton Coleman
no flags Details
build01 week cpu overage for kcm (189.39 KB, image/png)
2021-01-27 18:40 UTC, Clayton Coleman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 535 0 None Merged Bug 1920159: CPU requests overstate actual needs 2022-02-25 15:37:51 UTC
Github openshift cluster-kube-apiserver-operator pull 1032 0 None Merged Bug 1920159: kube-apiservers overstate steady-state CPU needs slightly 2022-02-25 15:37:51 UTC
Github openshift cluster-kube-controller-manager-operator pull 500 0 None Merged Bug 1920159: Adjust CPU request for controller manager more precisely 2022-02-25 15:37:50 UTC
Github openshift cluster-kube-storage-version-migrator-operator pull 41 0 None Merged Bug 1920159: CPU request for migrator should not be higher than average use 2022-02-25 15:37:50 UTC
Github openshift cluster-network-operator pull 963 0 None Merged Bug 1920159: Reduce CPU requests of ovs daemonset 2022-02-25 15:37:49 UTC

Description Clayton Coleman 2021-01-25 17:06:50 UTC
A number of components requests could be tuned downwards a small amount (less than 20 or 30%) without impacting the ability of the cluster to function in high density environments due to improvements we have made to the CPU and memory usage of the product over releases.

Tuning these down gives small clusters (single-node and compact) more headroom and are still realistic given the data we have. This tuning would take into account the principles described in https://bugzilla.redhat.com/show_bug.cgi?id=1812583 and codified into https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#resources-and-limits.

This bug will track measuring the current effective usage in single-node, e2e, and upgrade in terms of proportionality of 

 Operator
Request (mcpus)
etcd 440
kube-apiserver 350
network 240
monitoring 232
authentication 180
ingress 120
openshift-apiserver 120
kube-controller-manager 110
kube-storage-version-migrator 110
openshift-controller-manager 110
machine-config 100
dns 85
console 50
machine-api 50

against each other and their actual usage in e2e runs. We will be looking for components that significantly over request during periods of high rate of change OR high rate traffic (i.e. the p90-p99(cpu) over a run, measured against other components).

We will be looking to concretely reduce the single node master contribution of components by 200mc in total, which should be achievable.

Notes

* kube-storage-version-migrator is likely oversubscribed
* monitoring will be handled separately
* authentication is likely oversubscribed substantially
* dns and network may need to be tweaked slightly.

Comment 1 Stefan Schimanski 2021-01-26 08:39:44 UTC
> kube-storage-version-migrator is likely oversubscribed

The operator should scale down the migrator if there is no active migration CR around. Note that it needs some memory to keep pages of objects in memory.

Comment 2 Clayton Coleman 2021-01-27 18:25:15 UTC
On the prod build01 and app.ci clusters, etcd requested usage is consistently below the request:

sort_desc(max without(container,endpoint,exported_namespace,exported_pod,instance,job,service,scheduler,priority,unit,resource) (label_replace(label_replace(kube_pod_resource_request{resource="cpu",exported_namespace!~"ci-op-.*|ci",exported_namespace="openshift-etcd"}, "pod", "$1", "exported_pod", "(.*)"), "namespace", "$1", "exported_namespace", "(.*)")) - on (pod,namespace) group_left() max by (pod,namespace) (rate(container_cpu_usage_seconds_total{container="",namespace="openshift-etcd"}[60m]))) * 1000

On average etcd used ~200mc less than requested, but max was 100mc less than requested and min was 300mc less than requested over a one week period.

I think we can tune each pod down by 100mc in 4.7

Comment 3 Clayton Coleman 2021-01-27 18:26:34 UTC
Etcd on app.ci in 4.6.13 (which fixed the issues with IO starvation from the BRQ scheduler changed) is roughly similar to build01 - the minimum gap between request and usage is about 100mc under, so we it should be safe to reduce.

Comment 4 Clayton Coleman 2021-01-27 18:34:28 UTC
Created attachment 1751335 [details]
cpu request overage over a week on build01

Comment 5 Clayton Coleman 2021-01-27 18:35:56 UTC
CPU usage of kube-apiserver is less clear cut, varies between 500mc under for all three to -200mc or more over.  The attached graph captures some of this count, I think a 50mc per pod reduction for kube-apiserver total would be reasonable and there is still plenty of room for burst.

Comment 6 Clayton Coleman 2021-01-27 18:40:16 UTC
Created attachment 1751340 [details]
build01 week cpu overage for kcm

The kube-controller-manager has a smaller request but even on loaded clusters is more efficient than it used to be.  Peak usage on app.ci (heavy operator load) is 50mc, peak usage on build01 is about 0.9 but only for specific intervals.  I think setting request closer to actual usage in these bulky clusters by removing 20mc of request is reasonable.

Comment 7 Clayton Coleman 2021-01-27 18:45:05 UTC
> The operator should scale down the migrator if there is no active migration CR around. Note that it needs some memory to keep pages of objects in memory.

Should as in "this is implemented today"?

Right now I see a single deployment with a 100mc core request which is not appropriate.  Burst is still enabled, even as is today, so a 10mc request is more accurate.

Comment 8 Clayton Coleman 2021-01-27 18:46:36 UTC
Altogether the minimal cpu request tunings above would remove 270mc of reservation and still give production clusters reasonable proportionality as well as headroom.

Comment 9 Clayton Coleman 2021-01-28 04:05:03 UTC
OVS is still requesting 100m in pods (it's actually running on the node), but actual usage the script logic in both 4.6 and 4.7 is roughly 5m.  Reducing that down to 15m to allow some headroom - 4.8 will remove that daemonset.

Comment 11 Clayton Coleman 2021-01-28 19:18:15 UTC
Second combined test also passed https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1354850792707723264

First had a flake that we pinned down to being a test issue https://github.com/openshift/origin/pull/25834.

I think this looks clean, releasing the holds.

Comment 13 Ke Wang 2021-02-05 08:34:46 UTC
Hi ccoleman, I found there are still https://github.com/openshift/cluster-kube-apiserver-operator/pull/1032, https://github.com/openshift/cluster-kube-controller-manager-operator/pull/500 and https://github.com/openshift/cluster-network-operator/pull/963 which have not been merged. Would you please merge them? I assigned the bug back first, when they are merged, will go on versification.


Note You need to log in before you can comment on or make changes to this bug.