Description of problem: This issue can be addressed on IPI on OSP or IPI on BM since they have static pods(coredns, keepalived and mdns-publisher) which have requests.resources related bug: https://bugzilla.redhat.com/show_bug.cgi?id=1753067 Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-09-15-052022 How reproducible: Always Steps to Reproduce: 1. Create clusterautoscaler, machineautoscaler cr 2. oc adm new-project openshift-kni-infra 3. Create pod to scale up the cluster apiVersion: extensions/v1beta1 kind: Deployment metadata: name: scale-up labels: app: scale-up spec: replicas: 15 selector: matchLabels: app: scale-up template: metadata: labels: app: scale-up spec: containers: - name: busybox image: docker.io/library/busybox resources: requests: memory: 2Gi command: - /bin/sh - "-c" - "echo 'this should be in the logs' && sleep 86400" terminationGracePeriodSeconds: 0 4. Check pod Actual results: After a while, some pod will OutOfmemory $ oc get pod NAME READY STATUS RESTARTS AGE cluster-autoscaler-default-5dd4b8d85-dtrzz 1/1 Running 0 23h cluster-autoscaler-operator-59b86c4d95-4r5wb 1/1 Running 0 24h machine-api-controllers-776587cf7d-9ddqx 3/3 Running 0 24h machine-api-operator-5bc8f8df49-pnf4c 1/1 Running 0 24h scale-up-5f76786964-24tlg 1/1 Running 0 9m35s scale-up-5f76786964-252lw 0/1 OutOfmemory 0 56s scale-up-5f76786964-2cvmm 0/1 OutOfmemory 0 38s scale-up-5f76786964-2k7cc 0/1 OutOfmemory 0 4m39s scale-up-5f76786964-2kngx 0/1 OutOfmemory 0 64s scale-up-5f76786964-2wmk2 0/1 OutOfmemory 0 60s scale-up-5f76786964-2z5jc 1/1 Running 0 9m35s scale-up-5f76786964-4fbv4 0/1 OutOfmemory 0 4m33s scale-up-5f76786964-4fdlx 0/1 OutOfmemory 0 49s scale-up-5f76786964-4n5c4 1/1 Running 0 9m35s scale-up-5f76786964-4n8tr 0/1 OutOfmemory 0 73s Expected results: Autoscaler could work well. Additional info:
This issue happens after applied the workaround which mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1753067#c3
can you share autoscaler logs?
Created attachment 1616672 [details] autoscaler log
Does every node require all three (coredns, keepalived and mdns-publisher) static pods?
@ Joel Smith, I think you are right, static pods' mirror pods are created when the scheduler may have already scheduled other workloads that result in the workload pod are in OutOfmemory status, autoscaler is working as expected. Just cluster autoscaler only handles the pod in the pending state, but the added workload is always outofmemory.
I would be curious if the following patch helps this issue [1]. This BZ was created at around the same time as [1] merged. 1. https://github.com/openshift/origin/pull/23812
I didn't manage to test the backport yet, but I'll try to do it next sprint.
This appears to be fixed in 4.5, based upon my testing. Whether because of https://github.com/openshift/origin/pull/23812 or something else, the current behavior is that a static pod will preempt a pod that has been scheduled to a node if the node doesn't have enough resources for the static pod.
I have tried a few scenarios for autoscaling. All include static pods on all worker nodes. I have not seen a pod in status OutOfMemory. They are correctly in Pending when they trigger a scaling event and eventually deploy to the new node. Testing on latest released (4.5.2). I would say this is verified, but I am seeing unexpected behavior with the autoscaler: multiple nodes get spun up when only one should be need to satisfy memory requests and removing the reproducer deployment doesn't scale back down completely. I am double checking my math and what the expected behavior of autoscaler in latest code.
I am going to stand by my statement that this has been verified as fixed. All other issues are unrelated and I can track them down separately. I am a little concerned that it is not clear what actually fixed it, but it is fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196