Created attachment 1205981 [details] sample yaml files Description of problem: Master vertical scale up of 500 nodes to 125 pause pods/node. The scaleup script is aggressive - it creates 10 namespaces at a time, each namespace has 4 DC, 4 RC and 6 running pods. After running for 2 hours, many/tons of the following messages started popping in the master controllers log - messages_master1 log - messages start around Sep 28 17:56 and really pick up around Sep 28 18:00:23 Sep 28 18:00:23 svt-m-1 atomic-openshift-master-controllers: I0928 18:00:23.237148 74228 priorities.go:39] Combined requested resources 3200 from existing pods exceeds capacity 1000 on node 192.1.3.97 Sep 28 18:00:23 svt-m-1 atomic-openshift-master-controllers: I0928 18:00:23.237155 74228 priorities.go:39] Combined requested resources 6710886400 from existing pods exceeds capacity 3975618560 on node 192.1.3.97 Sep 28 18:00:23 svt-m-1 atomic-openshift-master-controllers: I0928 18:00:23.237170 74228 priorities.go:39] Combined requested resources 1700 from existing pods exceeds capacity 1000 on node 192.1.1.247 Sep 28 18:00:23 svt-m-1 atomic-openshift-master-controllers: I0928 18:00:23.237180 74228 priorities.go:39] Combined requested resources 2300 from existing pods exceeds capacity 1000 on node 192.1.3.46 The odd thing is, the dc/rc/pod have no resource requests defined - see attached yaml. At 18:02 the master controller fails over to master3. Unfortunately rsyslog throttled the messages on master1 directly before failover. I'm opening a bz to get that throttling window widened. But it looks like this in the master1 log: Sep 28 18:01:23 svt-m-1 atomic-openshift-master-controllers: I0928 18:01:23.450298 74228 priorities.go:39] Combined requested resources 2000 from existing pods exceeds capacity 1000 on node 192.1.2.45 Sep 28 18:01:23 svt-m-1 rsyslogd-2177: imjournal: begin to drop messages due to rate-limiting Sep 28 18:08:52 svt-m-1 rsyslogd-2177: imjournal: 17351 messages lost due to rate-limiting At this point master3 picks up as controller and you can see the activty in messages_master3 at Sep 28 18:02:47. At 18:03:17 master3 starts getting the "exceeds capacity" messages. From here it is off to the races, the master controller continuously fails between cluster masters, never lasting very long. This went on until the AM of Sep 29 when things were shutdown Configuration - OpenShift 3.3.0.32 on OSP9 - 3 etcd (c4.8xlarge: 32 vCPU, 60GB RAM) - 3 masters (c4.8xlarge as above) - 1 master lb - 2 router/registry - 500 application nodes (1 vCPU, 4GB RAM) Logs and yaml files for dc/rc/pod attached Version-Release number of selected component (if applicable): 3.3.0.32 How reproducible: Unknown - 1 attempt so far.
Mike, does the scaleup script changes node's capacity? It seems to me it does. The scheduler caches each node's allocated resources (mem, cpu, gpu) calculated from resources of all pods. The allocated amount (approximated data in cache) is compared with capacity which is stored in node status reported by each node (real data). In this case the recently reported capacities are: node 192.1.3.97 capacity: 1cpu, 3975618560B of memory node 192.1.1.247 capacity: 1cpu node 192.1.3.46 capacity: 1cpu My theory: 1) each node is given some capacity 2) pods are scheduled and allocated resources consumed more than 50% of capacity 3) capacities of nodes are decreased to half of their amount 4) the scheduler receives new nodes statuses with decreased capacities 5) the scheduler (priority function) starts complaining about exceeded capacity Can you check the initial capacity of each node to see of they changed? Can you check if some nodes are under memory pressure (kubectl describe node)?
To be more precise the scheduler cache is populated with all pods that are scheduled or assumed to be scheduled. Once any scheduled pod is terminated, it is eventually removed from the cache and the total allocatable is decreased. So if my theory is correct, the total allocable amount of each node drops down below to capacity once enough pods is evicted. In general, the scheduler caches does not magically adds up additional resources. So there must be some other hidden pods that consumes resources.
The script does not change node capacity. The script's only job is to continuously create new projects based on a template until the desired number of projects is reached. The template is here: https://github.com/openshift/svt/blob/master/openshift_scalability/config/pyconfigMasterVirtScalePause.yaml There is no modification of capacity. Your description of 192.1.3.97 in comment 4 is accurate. Each node has 1 CPU and 4GB RAM
Other way to smuggle pods into the scheduler cache and increase the total amount of CPU/Memory consumed is syncing with the Apiserver. Each time a new (and not terminated) pod is assigned to a node (can happen outside of the scheduler) and the assignment is reflected in the Apiserver, it is added into the scheduler cache. Is the issue always reproducible? Can you provide a list of all pods (kubectl describe pod POD) that are currently not terminated and running on some affected node?
Jan, I ran a test - not sure if it answers comment 9 the way you expected - if not, let me know what to modify. Unlike the original issue, this is on 3.4, not 3.3. I can re-run on 3.3 if requested. - Cluster with 2 small application nodes (1 core, 3.7 GB RAM) - ran the cluster loader script requesting 20 pods/node without specifying any resource requests (as in original issue) - Immediately, messages like this started popping in the master log: Oct 4 10:56:53 ip-172-31-26-26 atomic-openshift-master: I1004 10:56:53.569915 95355 priorities.go:51] Combined requested resources 4194304000 from existing pods exceeds capacity 3777728512 on node ip-172-31-17-185.us-west-2.compute.internal Oct 4 10:56:53 ip-172-31-26-26 atomic-openshift-master: I1004 10:56:53.569948 95355 priorities.go:51] Combined requested resources 2000 from existing pods exceeds capacity 1000 on node ip-172-31-17-186.us-west-2.compute.internal Oct 4 10:56:53 ip-172-31-26-26 atomic-openshift-master: I1004 10:56:53.569952 95355 priorities.go:51] Combined requested resources 4194304000 from existing pods exceeds capacity 3777728512 on node ip-172-31-17-186.us-west-2.compute.internal Oct 4 10:56:53 ip-172-31-26-26 atomic-openshift-master: I1004 10:56:53.817014 95355 priorities.go:51] Combined requested resources 2100 from existing pods exceeds capacity 1000 on node ip-172-31-17-186.us-west-2.compute.internal Oct 4 10:56:53 ip-172-31-26-26 atomic-openshift-master: I1004 10:56:53.817033 95355 priorities.go:51] Combined requested resources 4404019200 from existing pods exceeds capacity 3777728512 on node ip-172-31-17-186.us-west-2.compute.internal Oct 4 10:56:53 ip-172-31-26-26 atomic-openshift-master: I1004 10:56:53.817038 95355 priorities.go:51] Combined requested resources 2000 from existing pods exceeds capacity 1000 on node ip-172-31-17-185.us-west-2.compute.internal Oct 4 10:56:53 ip-172-31-26-26 atomic-openshift-master: I1004 10:56:53.817042 95355 priorities.go:51] Combined requested resources 4194304000 from existing pods exceeds capacity 3777728512 on node ip-172-31-17-185.us-west-2.compute.internal - All 40 pods (20/node) started successfully and entered Running state - oc describe pod of one of the pods after it starts: root@ip-172-31-26-26: ~/svt/openshift_scalability # oc describe pod -n clusterproject0 pausepods39 Name: pausepods39 Namespace: clusterproject0 Security Policy: anyuid Node: ip-172-31-17-185.us-west-2.compute.internal/172.31.17.185 Start Time: Tue, 04 Oct 2016 10:57:20 -0400 Labels: name=pause-amd64 Status: Running IP: 172.20.3.21 Controllers: <none> Containers: pause-amd64: Container ID: docker://f94e3c32d57c835f7be588219bb3dc696a2f0bb3e8e8a3144c6a4df107d6aa3a Image: gcr.io/google_containers/pause-amd64:3.0 Image ID: docker://sha256:99e59f495ffaa222bfeb67580213e8c28c1e885f1d245ab2bbe3b1b1ec3bd0b2 Port: 8080/TCP State: Running Started: Tue, 04 Oct 2016 10:57:59 -0400 Ready: True Restart Count: 0 Volume Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-ubv9y (ro) Environment Variables: <none> Conditions: Type Status Initialized True Ready True PodScheduled True Volumes: default-token-ubv9y: Type: Secret (a volume populated by a Secret) SecretName: default-token-ubv9y QoS Class: BestEffort Tolerations: <none> Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 2m 2m 1 {default-scheduler } Normal Scheduled Successfully assigned pausepods39 to ip-172-31-17-185.us-west-2.compute.internal 1m 1m 1 {kubelet ip-172-31-17-185.us-west-2.compute.internal} spec.containers{pause-amd64} Normal Pulled Container image "gcr.io/google_containers/pause-amd64:3.0" already present on machine 1m 1m 1 {kubelet ip-172-31-17-185.us-west-2.compute.internal} spec.containers{pause-amd64} Normal Created Created container with docker id f94e3c32d57c; Security:[seccomp=unconfined] 1m 1m 1 {kubelet ip-172-31-17-185.us-west-2.compute.internal} spec.containers{pause-amd64} Normal Started Started container with docker id f94e3c32d57c
kubelet config on these nodes is max-pods = 250 and pods-per-node = 0
We've done a number of efforts to turn down the noise during failure scenarios. In 1.5/3.5 we've turned down and analyzed all the hot-loop logging paths. In 1.6 we've aggregated summary information in a single event and updated pod-message with the same overall summary. The primary objective of this original issue was to not excessively spam to the point of log-clipping when failure conditions occur, those modifications have been made upstream. We should probably stress the capacity dimensions again and see if we can re-create fail-over conditions.
This should be the way you want it in OCP v3.5.0.10 or newer.
Preparing cluster horizontal run on 3.5.0.17 - will watch for this issue.
Verified on 3.5.0.50. Forced several master-api failovers during a cluster horizontal scale up and the error logging was informative and useful. No "rolling" (multiple sequential) failovers were seen. Namespace and deployment creation continued on successfully.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0884