Bug 1812583
Summary: | Default openshift install requests too many CPU resources to install all components, requests of components on cluster are wrong | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> | |
Component: | Installer | Assignee: | Clayton Coleman <ccoleman> | |
Installer sub component: | openshift-installer | QA Contact: | weiwei jiang <wjiang> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | adahiya, bparees, esimard, jeder, maszulik, miwilson, rphillips, sdodson, wking | |
Version: | 4.4 | Keywords: | ServiceDeliveryImpact | |
Target Milestone: | --- | |||
Target Release: | 4.5.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1812709 1812719 1814048 1822770 (view as bug list) | Environment: | ||
Last Closed: | 2020-07-13 17:19:39 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1812709, 1814048, 1821291 |
Description
Clayton Coleman
2020-03-11 15:51:57 UTC
To gather these, grab an e2e run prometheus, find the time that the e2e tests mostly stopped, set that as the current time in your promecius or local prometheus instance in the graph selector, and run the queries: (note the role stuff only works on GCP and Azure because AWS naming sucks) Requests per namespace: sort_desc(sum by (namespace,role) ((max without (id,endpoint,image,job,metrics_path,instance,name,service) (label_replace(label_replace(container_spec_cpu_shares{pod!="",namespace!~"e2e.*"}, "role", "master", "node", ".*-m-.*"), "role", "worker", "node", ".*-w-.*")) / 1024)) > 0) Usage per namespace (measured from the last 15m): sort_desc(sum by (namespace,role) ((max without (id,endpoint,image,job,metrics_path,instance,name,service) (label_replace(label_replace(rate(container_cpu_usage_seconds_total{pod!="",container="",namespace!~"e2e.*"}[15m]), "role", "master", "node", ".*-m-.*"), "role", "worker", "node", ".*-w-.*")))) > 0) Difference between them: sort_desc(sort_desc(sum by (namespace,role) ((max without (id,endpoint,image,job,metrics_path,instance,name,service) (label_replace(label_replace(container_spec_cpu_shares{pod!="",namespace!~"e2e.*"}, "role", "master", "node", ".*-m-.*"), "role", "worker", "node", ".*-w-.*")) / 1024))) - sort_desc(sum by (namespace,role) ((max without (id,endpoint,image,job,metrics_path,instance,name,service) (label_replace(label_replace(rate(container_cpu_usage_seconds_total{pod!="",container="",namespace!~"e2e.*"}[15m]), "role", "master", "node", ".*-m-.*"), "role", "worker", "node", ".*-w-.*")))))) Actual usage: sum by (role) (label_replace(label_replace(rate(container_cpu_usage_seconds_total{id="/"}[15m]), "role", "master", "node", ".*-m-.*"), "role", "worker", "node", ".*-w-.*")) After some basic data analysis, it's reasonable to say that out of the 6.6 cores in use, the fraction used by the key components on masters is: 25% etcd 25% 33% kube-apiserver 10% openshift-apiserver 5% kcm That's 73% of total usage. I would expect the requests to be roughly proportional to these percentages out of our arbitrary floor. I think 3 cores requested is a reasonable base master spec for idle, in which case the requests would be: 330m etcd 250m kube-apiserver 10m openshift-apiserver 5m kcm And then the remaining components should have requests that consume no more than 270m, divied up fairly based on their average use. A good default is going to be 5m for an operator that doesn't serve traffic or answer lots of queries. This would end up with us on masters having roughly 1 core set aside for core workload, and we would only schedule flex workloads on masters down to that single core. In very large masters these numbers might have to flex upwards, but we can't solve that component by component. Working on draft recommendations. Here is the draft recommendation that child bugs should follow: Rules for component teams: 1. Determine your average CPU usage from the list above (breaking down any components that are split across namespace) 2. Pods that run on a master that are not on all nodes (i.e. exclude dns, openshift-sdn, mcd) should have a request that is proportional to your CPU usage relative to kube-apiserver. kube-apiserver is allocated 33% of all resources (2.16 cores out of 6.6 cores). Calculate your CPU usage relative to kube-apiserver, and then multiply 330m by your fraction of kube-apiserver use (i.e. kube-scheduler uses 0.028 cores, so 0.028 * 330m = 9.24m) Special cases: * Certain infra components will be assigned slightly higher requests (kcm, scheduler, ocm given known problems if they fall behind) * Leader elected components should set their request to their expected usage across all pods, even though they will be over provisioned (kcm should request Xm on all nodes, not Xm/3) * No component should be lower than 5m per pod, but it must be set to 5m 3. Pods that run on a worker should be proportional to actual CPU use on our minimum e2e run. openshift-sdn uses 790m on workers, so between ovs and sdn per node there should be 790m/3 ~ 250m of CPU allocation between ovs and sdn pods. 4. Large infra components like monitoring should set their request proportional to openshift-sdn namespace openshift-sdn is allocated 750m and uses 790m. Prometheus uses 350m and requests 4 on workers. Because prometheus has a significant vertical scaling component, it should probably be close to openshift-sdn in terms of requests, and if it needs more active resource growth the operator should manage that. Node-exporter should be set relative to openshift-sdn Commit message recommendation: Normalize CPU requests on masters The {x} uses approximately {percent}% of master CPU in a reasonable medium sized workload. Given a 1 core per master baseline (since CPU is compressible and shared), assign the kube-apiserver roughly {desired_percent}% of that core on each master. PRs will be opened tracking each component in the list above (core control plane) and then the worst offenders in the list will be updated. Allocating 10% for the kube-controller-manager (1/3 of kube-apiserver). openshift-apiserver is 10% (1/3 of kube-apiserver) etcd is 25% (~70-80% of kube-apiserver) On the control plane the remaining components are expected to allocate (100 - 33 - 25 - 10 - 10 = ) 220m between them. The most aggressive components will be targeted first - monitoring is the highest requester. On the nodes, SDN is reasonably sized right now, and OVS is using about half of its 200m request. SDN is slightly more expensive (why?) so we should tune OVS down to 100m for now. *** Bug 1814900 has been marked as a duplicate of this bug. *** Additional work remains, move back to ASSIGNED. *** Bug 1820432 has been marked as a duplicate of this bug. *** Checked with 4.5.0-0.nightly-2020-04-21-103613, all requests on each master are less than 3 cores, so moved to verified. $ oc describe nodes -l node-role.kubernetes.io/master= | grep -i Allocated -A 5 Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 2204m (29%) 0 (0%) memory 6359Mi (42%) 512Mi (3%) -- Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 2115m (28%) 0 (0%) memory 5939Mi (39%) 512Mi (3%) -- Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 2150m (28%) 0 (0%) memory 5969Mi (40%) 512Mi (3%) $ oc describe nodes -l node-role.kubernetes.io/worker= | grep -i Allocated -A 5 Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 849m (11%) 0 (0%) memory 3982Mi (26%) 512Mi (3%) -- Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1100m (14%) 0 (0%) memory 4866Mi (32%) 512Mi (3%) -- Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 752m (10%) 0 (0%) memory 2158Mi (14%) 512Mi (3%) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |