A number of components requests could be tuned downwards a small amount (less than 20 or 30%) without impacting the ability of the cluster to function in high density environments due to improvements we have made to the CPU and memory usage of the product over releases. Tuning these down gives small clusters (single-node and compact) more headroom and are still realistic given the data we have. This tuning would take into account the principles described in https://bugzilla.redhat.com/show_bug.cgi?id=1812583 and codified into https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#resources-and-limits. This bug will track measuring the current effective usage in single-node, e2e, and upgrade in terms of proportionality of Operator Request (mcpus) etcd 440 kube-apiserver 350 network 240 monitoring 232 authentication 180 ingress 120 openshift-apiserver 120 kube-controller-manager 110 kube-storage-version-migrator 110 openshift-controller-manager 110 machine-config 100 dns 85 console 50 machine-api 50 against each other and their actual usage in e2e runs. We will be looking for components that significantly over request during periods of high rate of change OR high rate traffic (i.e. the p90-p99(cpu) over a run, measured against other components). We will be looking to concretely reduce the single node master contribution of components by 200mc in total, which should be achievable. Notes * kube-storage-version-migrator is likely oversubscribed * monitoring will be handled separately * authentication is likely oversubscribed substantially * dns and network may need to be tweaked slightly.
> kube-storage-version-migrator is likely oversubscribed The operator should scale down the migrator if there is no active migration CR around. Note that it needs some memory to keep pages of objects in memory.
On the prod build01 and app.ci clusters, etcd requested usage is consistently below the request: sort_desc(max without(container,endpoint,exported_namespace,exported_pod,instance,job,service,scheduler,priority,unit,resource) (label_replace(label_replace(kube_pod_resource_request{resource="cpu",exported_namespace!~"ci-op-.*|ci",exported_namespace="openshift-etcd"}, "pod", "$1", "exported_pod", "(.*)"), "namespace", "$1", "exported_namespace", "(.*)")) - on (pod,namespace) group_left() max by (pod,namespace) (rate(container_cpu_usage_seconds_total{container="",namespace="openshift-etcd"}[60m]))) * 1000 On average etcd used ~200mc less than requested, but max was 100mc less than requested and min was 300mc less than requested over a one week period. I think we can tune each pod down by 100mc in 4.7
Etcd on app.ci in 4.6.13 (which fixed the issues with IO starvation from the BRQ scheduler changed) is roughly similar to build01 - the minimum gap between request and usage is about 100mc under, so we it should be safe to reduce.
Created attachment 1751335 [details] cpu request overage over a week on build01
CPU usage of kube-apiserver is less clear cut, varies between 500mc under for all three to -200mc or more over. The attached graph captures some of this count, I think a 50mc per pod reduction for kube-apiserver total would be reasonable and there is still plenty of room for burst.
Created attachment 1751340 [details] build01 week cpu overage for kcm The kube-controller-manager has a smaller request but even on loaded clusters is more efficient than it used to be. Peak usage on app.ci (heavy operator load) is 50mc, peak usage on build01 is about 0.9 but only for specific intervals. I think setting request closer to actual usage in these bulky clusters by removing 20mc of request is reasonable.
> The operator should scale down the migrator if there is no active migration CR around. Note that it needs some memory to keep pages of objects in memory. Should as in "this is implemented today"? Right now I see a single deployment with a 100mc core request which is not appropriate. Burst is still enabled, even as is today, so a 10mc request is more accurate.
Altogether the minimal cpu request tunings above would remove 270mc of reservation and still give production clusters reasonable proportionality as well as headroom.
OVS is still requesting 100m in pods (it's actually running on the node), but actual usage the script logic in both 4.6 and 4.7 is roughly 5m. Reducing that down to 15m to allow some headroom - 4.8 will remove that daemonset.
Combined test in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1354643623982927872
Second combined test also passed https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1354850792707723264 First had a flake that we pinned down to being a test issue https://github.com/openshift/origin/pull/25834. I think this looks clean, releasing the holds.
Hi ccoleman, I found there are still https://github.com/openshift/cluster-kube-apiserver-operator/pull/1032, https://github.com/openshift/cluster-kube-controller-manager-operator/pull/500 and https://github.com/openshift/cluster-network-operator/pull/963 which have not been merged. Would you please merge them? I assigned the bug back first, when they are merged, will go on versification.