Bug 1812709 - Default openshift install requests too many CPU resources to install all components, requests of components on cluster are wrong [NEEDINFO]
Summary: Default openshift install requests too many CPU resources to install all comp...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.4
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.4.0
Assignee: Clayton Coleman
QA Contact: weiwei jiang
URL:
Whiteboard:
: 1814048 (view as bug list)
Depends On: 1812583
Blocks: 1820432 1822770
TreeView+ depends on / blocked
 
Reported: 2020-03-11 22:33 UTC by Clayton Coleman
Modified: 2020-05-04 11:46 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1812583
: 1821291 (view as bug list)
Environment:
Last Closed: 2020-05-04 11:45:48 UTC
Target Upstream Version:
wjiang: needinfo? (ccoleman)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 255 0 None closed [release-4.4] Bug 1812709: Normalize CPU requests on masters 2020-07-07 15:21:33 UTC
Github openshift cluster-kube-apiserver-operator pull 796 0 None closed [release-4.4] Bug 1812709: Normalize CPU requests on masters 2020-07-07 15:21:33 UTC
Github openshift cluster-kube-controller-manager-operator pull 377 0 None closed [release-4.4] Bug 1812709: Normalize CPU requests on masters 2020-07-07 15:21:33 UTC
Github openshift cluster-kube-scheduler-operator pull 228 0 None closed [release-4.4] Bug 1812709: Normalize CPU requests on masters 2020-07-07 15:21:33 UTC
Github openshift cluster-openshift-apiserver-operator pull 342 0 None closed [release-4.4] Bug 1812709: Normalize CPU requests on masters 2020-07-07 15:21:33 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:46:15 UTC

Description Clayton Coleman 2020-03-11 22:33:14 UTC
+++ This bug was initially created as a clone of Bug #1812583 +++

Our default install is 3x4core masters, 3x2core workers. In an e2e run which is a representative workload for customers of medium to large scale clusters, we use

6.6 cores on average across masters (out of 12)
4.06 cores on average across workers (out of 6)

However, our default requests and limits for the pods components on this cluster are:

11.7 cores for the masters (out of 12)
8.04 cores for the workers (out of 6)

As a result, our default cluster cannot correctly install (runs out of requested CPU) but runs fine, which is a 4.4 release blocker.

---

CPU is compressible, and in general we don't set requests based on how much CPU you use, we establish ratios between components on the same roles.  For instance, etcd and kube-apiserver should receive the most CPU on the masters because if they get starved, all other components suffer.

The other master components should request CPU based on a formula like:

etcd request * (component usage / etcd usage)

The worker components that run on all nodes should follow a similar rule, based around either kubelet or openshift-sdn.  However, large components on the nodes may need a tweak factor:

sdn request * (component usage / sdn usage)


---

Request by namespace and role:

{namespace="openshift-monitoring",role="worker"}	4.009765625
{namespace="openshift-etcd",role="master"}	2.5810546875
{namespace="openshift-sdn",role="master"}	1.869140625
{namespace="openshift-sdn",role="worker"}	1.8046875
{namespace="openshift-kube-controller-manager",role="master"}	1.3212890625
{namespace="openshift-kube-apiserver",role="master"}	1.0810546875
{namespace="openshift-apiserver",role="master"}	0.90234375
{namespace="openshift-monitoring",role="master"}	0.736328125
{namespace="openshift-dns",role="worker"}	0.662109375
{namespace="openshift-dns",role="master"}	0.662109375
{namespace="openshift-controller-manager",role="master"}	0.603515625
{namespace="openshift-machine-config-operator",role="master"}	0.509765625
{namespace="openshift-image-registry",role="worker"}	0.466796875
{namespace="openshift-ingress",role="worker"}	0.40234375
{namespace="openshift-machine-config-operator",role="worker"}	0.240234375
{namespace="openshift-kube-storage-version-migrator",role="worker"}	0.201171875
{namespace="openshift-machine-api",role="master"}	0.181640625
{namespace="openshift-multus",role="master"}	0.12890625
{namespace="kube-system",role="master"}	0.123046875
{namespace="openshift-image-registry",role="master"}	0.10546875
{namespace="openshift-marketplace",role="worker"}	0.0859375
{namespace="openshift-console",role="master"}	0.0859375
{namespace="openshift-operator-lifecycle-manager",role="master"}	0.0859375
{namespace="openshift-cluster-node-tuning-operator",role="master"}	0.0859375
{namespace="openshift-kube-scheduler",role="master"}	0.0703125
{namespace="openshift-cluster-node-tuning-operator",role="worker"}	0.064453125
{namespace="openshift-multus",role="worker"}	0.064453125
{namespace="openshift-authentication",role="master"}	0.04296875
{namespace="openshift-dns-operator",role="master"}	0.041015625
{namespace="openshift-cluster-machine-approver",role="master"}	0.041015625
{namespace="openshift-cluster-version",role="master"}	0.041015625
{namespace="openshift-ingress-operator",role="master"}	0.041015625
{namespace="openshift-cluster-samples-operator",role="master"}	0.041015625
{namespace="openshift-network-operator",role="master"}	0.021484375
{namespace="openshift-kube-storage-version-migrator-operator",role="master"}	0.021484375
{namespace="openshift-service-ca-operator",role="master"}	0.021484375
{namespace="openshift-insights",role="master"}	0.021484375
{namespace="openshift-csi-snapshot-controller-operator",role="worker"}	0.021484375
{namespace="openshift-kube-controller-manager-operator",role="master"}	0.021484375
{namespace="openshift-authentication-operator",role="master"}	0.021484375
{namespace="openshift-cloud-credential-operator",role="master"}	0.021484375
{namespace="openshift-etcd-operator",role="master"}	0.021484375
{namespace="openshift-console-operator",role="master"}	0.021484375
{namespace="openshift-kube-apiserver-operator",role="master"}	0.021484375
{namespace="openshift-cluster-storage-operator",role="master"}	0.021484375
{namespace="openshift-controller-manager-operator",role="master"}	0.021484375
{namespace="openshift-service-catalog-controller-manager-operator",role="master"}	0.021484375
{namespace="openshift-service-ca",role="master"}	0.021484375
{namespace="openshift-marketplace",role="master"}	0.021484375
{namespace="openshift-csi-snapshot-controller",role="worker"}	0.021484375
{namespace="openshift-apiserver-operator",role="master"}	0.021484375
{namespace="openshift-kube-scheduler-operator",role="master"}	0.005859375
{namespace="openshift-service-catalog-apiserver-operator",role="master"}	0.005859375

Usage by namespace and role

{namespace="openshift-kube-apiserver",role="master"}	2.1619876333846624
{namespace="openshift-etcd",role="master"}	1.6578208562491923
{namespace="openshift-sdn",role="worker"}	0.7902127474534021
{namespace="openshift-apiserver",role="master"}	0.6654059737563104
{namespace="openshift-sdn",role="master"}	0.45042689393591295
{namespace="openshift-monitoring",role="worker"}	0.3504143483072635
{namespace="openshift-kube-controller-manager",role="master"}	0.20461836474549036
{namespace="openshift-etcd-operator",role="master"}	0.13743432789929938
{namespace="openshift-operator-lifecycle-manager",role="master"}	0.0884434181260195
{namespace="openshift-must-gather-kkjhl",role="master"}	0.08221540997755557
{namespace="openshift-monitoring",role="master"}	0.06295023150044035
{namespace="openshift-machine-config-operator",role="master"}	0.03892652122963432
{namespace="openshift-controller-manager",role="master"}	0.03838177104774991
{namespace="openshift-kube-apiserver-operator",role="master"}	0.03564176578869584
{namespace="openshift-ingress",role="worker"}	0.034161760609146746
{namespace="openshift-kube-scheduler",role="master"}	0.028997043612977058
{namespace="openshift-multus",role="master"}	0.028357733575176187
{namespace="openshift-cloud-credential-operator",role="master"}	0.02382491640890165
{namespace="openshift-kube-scheduler-operator",role="master"}	0.019677113395487122
{namespace="openshift-service-ca",role="master"}	0.019193497863481977
{namespace="openshift-kube-controller-manager-operator",role="master"}	0.017732005265559132
{namespace="openshift-marketplace",role="master"}	0.01595692508887505
{namespace="openshift-dns",role="worker"}	0.014646411228389108
{namespace="openshift-apiserver-operator",role="master"}	0.014566308284011975
{namespace="openshift-dns",role="master"}	0.013228066245094485
{namespace="openshift-image-registry",role="worker"}	0.013138524957335649
{namespace="openshift-marketplace",role="worker"}	0.012014092321005393
{namespace="openshift-console",role="master"}	0.007421924226788351
{namespace="openshift-image-registry",role="master"}	0.0071860119119124865
{namespace="openshift-authentication-operator",role="master"}	0.007069369592108443
{namespace="openshift-cluster-version",role="master"}	0.006795354406059571
{namespace="openshift-machine-config-operator",role="worker"}	0.006361723323325576
{namespace="openshift-authentication",role="master"}	0.006188662943334761
{namespace="openshift-machine-api",role="master"}	0.0052930512518087145
{namespace="openshift-network-operator",role="master"}	0.005136690421827466
{namespace="openshift-console-operator",role="master"}	0.004844998002650943
{namespace="openshift-controller-manager-operator",role="master"}	0.0045221224014901865
{namespace="openshift-multus",role="worker"}	0.0042029771037126965
{namespace="openshift-cluster-storage-operator",role="master"}	0.0038974091762590986
{namespace="openshift-service-ca-operator",role="master"}	0.0032243799219438558
{namespace="openshift-service-catalog-apiserver-operator",role="master"}	0.003034163254101611
{namespace="openshift-csi-snapshot-controller-operator",role="worker"}	0.002924888169013756
{namespace="openshift-service-catalog-controller-manager-operator",role="master"}	0.002260159790181888
{namespace="openshift-insights",role="master"}	0.002075358385262947
{namespace="openshift-cluster-samples-operator",role="master"}	0.002067038536157853
{namespace="openshift-cluster-node-tuning-operator",role="master"}	0.0019490981770514527
{namespace="openshift-kube-storage-version-migrator-operator",role="master"}	0.0018690529355203974
{namespace="kube-system",role="master"}	0.0013121596008085876
{namespace="openshift-ingress-operator",role="master"}	0.0012782041893306024
{namespace="openshift-csi-snapshot-controller",role="worker"}	0.0012313933297429486
{namespace="openshift-cluster-node-tuning-operator",role="worker"}	0.0009219469658359441
{namespace="openshift-cluster-machine-approver",role="master"}	0.0006950561574913539
{namespace="openshift-dns-operator",role="master"}	0.0006673132853684085
{namespace="openshift-kube-storage-version-migrator",role="worker"}	0.00018306567985376928

Request - usage

{namespace="openshift-monitoring",role="worker"}	3.6593512766927363
{namespace="openshift-sdn",role="master"}	1.418713731064087
{namespace="openshift-kube-controller-manager",role="master"}	1.1166706977545096
{namespace="openshift-sdn",role="worker"}	1.0144747525465978
{namespace="openshift-etcd",role="master"}	0.9232338312508075
{namespace="openshift-monitoring",role="master"}	0.6733778934995597
{namespace="openshift-dns",role="master"}	0.6488813087549055
{namespace="openshift-dns",role="worker"}	0.6474629637716109
{namespace="openshift-controller-manager",role="master"}	0.5651338539522501
{namespace="openshift-machine-config-operator",role="master"}	0.4708391037703657
{namespace="openshift-image-registry",role="worker"}	0.45365835004266436
{namespace="openshift-ingress",role="worker"}	0.3681819893908532
{namespace="openshift-apiserver",role="master"}	0.2369377762436895
{namespace="openshift-machine-config-operator",role="worker"}	0.23387265167667443
{namespace="openshift-kube-storage-version-migrator",role="worker"}	0.20098880932014623
{namespace="openshift-machine-api",role="master"}	0.1763475737481913
{namespace="kube-system",role="master"}	0.12173471539919141
{namespace="openshift-multus",role="master"}	0.10054851642482382
{namespace="openshift-image-registry",role="master"}	0.09828273808808752
{namespace="openshift-cluster-node-tuning-operator",role="master"}	0.08398840182294855
{namespace="openshift-console",role="master"}	0.07851557577321165
{namespace="openshift-marketplace",role="worker"}	0.07392340767899461
{namespace="openshift-cluster-node-tuning-operator",role="worker"}	0.06353117803416405
{namespace="openshift-multus",role="worker"}	0.060250147896287305
{namespace="openshift-kube-scheduler",role="master"}	0.04131545638702294
{namespace="openshift-dns-operator",role="master"}	0.040348311714631595
{namespace="openshift-cluster-machine-approver",role="master"}	0.04032056884250865
{namespace="openshift-ingress-operator",role="master"}	0.039737420810669395
{namespace="openshift-cluster-samples-operator",role="master"}	0.038948586463842146
{namespace="openshift-authentication",role="master"}	0.03678008705666524
{namespace="openshift-cluster-version",role="master"}	0.03422027059394043
{namespace="openshift-csi-snapshot-controller",role="worker"}	0.02025298167025705
{namespace="openshift-kube-storage-version-migrator-operator",role="master"}	0.019615322064479603
{namespace="openshift-insights",role="master"}	0.019409016614737054
{namespace="openshift-service-catalog-controller-manager-operator",role="master"}	0.01922421520981811
{namespace="openshift-csi-snapshot-controller-operator",role="worker"}	0.018559486830986245
{namespace="openshift-service-ca-operator",role="master"}	0.018259995078056146
{namespace="openshift-cluster-storage-operator",role="master"}	0.0175869658237409
{namespace="openshift-controller-manager-operator",role="master"}	0.016962252598509815
{namespace="openshift-console-operator",role="master"}	0.016639376997349055
{namespace="openshift-network-operator",role="master"}	0.016347684578172532
{namespace="openshift-authentication-operator",role="master"}	0.014415005407891557
{namespace="openshift-apiserver-operator",role="master"}	0.006918066715988025
{namespace="openshift-marketplace",role="master"}	0.00552744991112495
{namespace="openshift-kube-controller-manager-operator",role="master"}	0.0037523697344408677
{namespace="openshift-service-catalog-apiserver-operator",role="master"}	0.002825211745898389
{namespace="openshift-service-ca",role="master"}	0.0022908771365180228
{namespace="openshift-cloud-credential-operator",role="master"}	-0.00234054140890165
{namespace="openshift-operator-lifecycle-manager",role="master"}	-0.0025059181260194963
{namespace="openshift-kube-scheduler-operator",role="master"}	-0.013817738395487122
{namespace="openshift-kube-apiserver-operator",role="master"}	-0.014157390788695837
{namespace="openshift-etcd-operator",role="master"}	-0.11594995289929938
{namespace="openshift-kube-apiserver",role="master"}	-1.0809329458846624

We will update this shortly with the ratios that everyone should use.

--- Additional comment from Clayton Coleman on 2020-03-11 12:29:49 EDT ---

To gather these, grab an e2e run prometheus, find the time that the e2e tests mostly stopped, set that as the current time in your promecius or local prometheus instance in the graph selector, and run the queries:

(note the role stuff only works on GCP and Azure because AWS naming sucks)

Requests per namespace: sort_desc(sum by (namespace,role) ((max without (id,endpoint,image,job,metrics_path,instance,name,service) (label_replace(label_replace(container_spec_cpu_shares{pod!="",namespace!~"e2e.*"}, "role", "master", "node", ".*-m-.*"), "role", "worker", "node", ".*-w-.*")) / 1024)) > 0)

Usage per namespace (measured from the last 15m): sort_desc(sum by (namespace,role) ((max without (id,endpoint,image,job,metrics_path,instance,name,service) (label_replace(label_replace(rate(container_cpu_usage_seconds_total{pod!="",container="",namespace!~"e2e.*"}[15m]), "role", "master", "node", ".*-m-.*"), "role", "worker", "node", ".*-w-.*")))) > 0)

Difference between them: sort_desc(sort_desc(sum by (namespace,role) ((max without (id,endpoint,image,job,metrics_path,instance,name,service) (label_replace(label_replace(container_spec_cpu_shares{pod!="",namespace!~"e2e.*"}, "role", "master", "node", ".*-m-.*"), "role", "worker", "node", ".*-w-.*")) / 1024))) - sort_desc(sum by (namespace,role) ((max without (id,endpoint,image,job,metrics_path,instance,name,service) (label_replace(label_replace(rate(container_cpu_usage_seconds_total{pod!="",container="",namespace!~"e2e.*"}[15m]), "role", "master", "node", ".*-m-.*"), "role", "worker", "node", ".*-w-.*"))))))

Actual usage: sum by (role) (label_replace(label_replace(rate(container_cpu_usage_seconds_total{id="/"}[15m]), "role", "master", "node", ".*-m-.*"), "role", "worker", "node", ".*-w-.*"))

--- Additional comment from Clayton Coleman on 2020-03-11 13:53:29 EDT ---

After some basic data analysis, it's reasonable to say that out of the 6.6 cores in use, the fraction used by the key components on masters is:

25% etcd 25%
33% kube-apiserver
10% openshift-apiserver
5%  kcm

That's 73% of total usage.  I would expect the requests to be roughly proportional to these percentages out of our arbitrary floor.  I think 3 cores requested is a reasonable base master spec for idle, in which case the requests would be:

330m etcd
250m kube-apiserver
 10m openshift-apiserver
  5m kcm

And then the remaining components should have requests that consume no more than 270m, divied up fairly based on their average use.  A good default is going to be 5m for an operator that doesn't serve traffic or answer lots of queries.

This would end up with us on masters having roughly 1 core set aside for core workload, and we would only schedule flex workloads on masters down to that single core.  In very large masters these numbers might have to flex upwards, but we can't solve that component by component.

--- Additional comment from Clayton Coleman on 2020-03-11 14:40:20 EDT ---

Working on draft recommendations.

--- Additional comment from Clayton Coleman on 2020-03-11 18:31:34 EDT ---

Here is the draft recommendation that child bugs should follow:

Rules for component teams:

1. Determine your average CPU usage from the list above (breaking down any components that are split across namespace)

2. Pods that run on a master that are not on all nodes (i.e. exclude dns, openshift-sdn, mcd) should have a request that is proportional to your CPU usage relative to kube-apiserver.

kube-apiserver is allocated 33% of all resources (2.16 cores out of 6.6 cores). Calculate your CPU usage relative to kube-apiserver, and then multiply 330m by your fraction of kube-apiserver use (i.e. kube-scheduler uses 0.028 cores, so 0.028 * 330m = 9.24m)

Special cases:
* Certain infra components will be assigned slightly higher requests (kcm, scheduler, ocm given known problems if they fall behind)
* Leader elected components should set their request to their expected usage across all pods, even though they will be over provisioned (kcm should request Xm on all nodes, not Xm/3)
* No component should be lower than 5m per pod, but it must be set to 5m

3. Pods that run on a worker should be proportional to actual CPU use on our minimum e2e run. openshift-sdn uses 790m on workers, so between ovs and sdn per node there should be 790m/3 ~ 250m of CPU allocation between ovs and sdn pods.

4. Large infra components like monitoring should set their request proportional to openshift-sdn namespace

openshift-sdn is allocated 750m and uses 790m.  Prometheus uses 350m and requests 4 on workers.  Because prometheus has a significant vertical scaling component, it should probably be close to openshift-sdn in terms of requests, and if it needs more active resource growth the operator should manage that.  Node-exporter should be set relative to openshift-sdn.

Comment 1 Scott Dodson 2020-03-17 13:20:06 UTC
*** Bug 1814048 has been marked as a duplicate of this bug. ***

Comment 2 W. Trevor King 2020-04-06 21:09:41 UTC
This bug should currently bring back everything from bug 1812583 that is applicable to 4.4 except for cluster-network-operator#530 which is being handled for 4.4 in https://bugzilla.redhat.com/show_bug.cgi?id=1821291#c1

Comment 6 weiwei jiang 2020-04-23 05:06:23 UTC
Checked with 4.4.0-0.nightly-2020-04-22-215658 and now the requests on each master are less than 3 cores, so move to verified.

$ oc describe nodes -l node-role.kubernetes.io/master= | grep -i Allocated -A 5
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests      Limits
  --------                   --------      ------
  cpu                        2139m (28%)   0 (0%)
  memory                     5589Mi (37%)  512Mi (3%)
--
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests      Limits
  --------                   --------      ------
  cpu                        2114m (28%)   0 (0%)
  memory                     5629Mi (37%)  512Mi (3%)
--
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests      Limits
  --------                   --------      ------
  cpu                        2369m (31%)   0 (0%)
  memory                     7019Mi (47%)  512Mi (3%)

Comment 7 weiwei jiang 2020-04-23 05:07:26 UTC
And also for workers.

$ oc describe nodes -l node-role.kubernetes.io/worker= | grep -i Allocated -A 5
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests      Limits
  --------                   --------      ------
  cpu                        984m (13%)    300m (4%)
  memory                     3737Mi (25%)  587Mi (3%)
--
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests      Limits
  --------                   --------      ------
  cpu                        1281m (17%)   100m (1%)
  memory                     3627Mi (24%)  537Mi (3%)
--
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests      Limits
  --------                   --------      ------
  cpu                        989m (13%)    300m (4%)
  memory                     3667Mi (24%)  587Mi (3%)

Comment 9 errata-xmlrpc 2020-05-04 11:45:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.