Description of problem: A cluster's cluster autoscaler has been configured to scale down nodes, but does not seem to be scaling down nodes which would seem to qualify for scale-down criteria. The autoscaler can successfully scale down, but has only done so on a few occasions over the span of a week. The cluster owner in this specific scenario is scaling down their workloads during 01:00 UTC -> 09:00 UTC each day and is expecting it to make a larger impact on the number of cluster nodes. A must-gather for the specific cluster exhibiting this behaviour will be included in a comment attached to the Bugzilla. For example, node "ip-10-244-54-105.ec2.internal" was running no non-core cluster workloads, and had minimal CPU/Memory resource consumption, for an extended period of time and was not considered for scaledown. Non-terminated Pods: (16 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- openshift-cluster-csi-drivers aws-ebs-csi-driver-node-jhk9k 30m (1%) 0 (0%) 150Mi (1%) 0 (0%) 31h openshift-cluster-node-tuning-operator tuned-stbk7 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 31h openshift-dns dns-default-vbl42 60m (2%) 0 (0%) 110Mi (0%) 0 (0%) 31h openshift-dns node-resolver-9kf6w 5m (0%) 0 (0%) 21Mi (0%) 0 (0%) 31h openshift-image-registry node-ca-gt66p 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 31h openshift-ingress-canary ingress-canary-dvk9v 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 31h openshift-machine-config-operator machine-config-daemon-8s92c 40m (1%) 0 (0%) 100Mi (0%) 0 (0%) 31h openshift-marketplace redhat-operators-hqstb 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 31h openshift-monitoring node-exporter-q9s85 9m (0%) 0 (0%) 47Mi (0%) 0 (0%) 31h openshift-monitoring sre-dns-latency-exporter-x4k48 0 (0%) 0 (0%) 0 (0%) 0 (0%) 31h openshift-multus multus-additional-cni-plugins-ghvvr 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 31h openshift-multus multus-tvs85 10m (0%) 0 (0%) 65Mi (0%) 0 (0%) 31h openshift-multus network-metrics-daemon-n88xb 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 31h openshift-network-diagnostics network-check-target-8r8m8 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 31h openshift-sdn sdn-dqlgt 110m (3%) 0 (0%) 220Mi (1%) 0 (0%) 31h openshift-security splunkforwarder-ds-mwcl5 0 (0%) 0 (0%) 0 (0%) 0 (0%) 31h --- Resource Requests Limits -------- -------- ------ cpu 344m (11%) 0 (0%) memory 988Mi (6%) 0 (0%) --- Similarly node "ip-10-244-54-117.ec2.internal". No non-core-cluster pods on the cluster were observed to be: - using local storage (with hostpath + emptydir set) - running in kube-system - being blocked by PDBs Version-Release number of selected component (if applicable): 4.8.28
i got a chance to investigate deeper into the must-gather and i think that Matt correctly identified the root cause here. from the original text, we can see that we expected node "ip-10-244-54-105.ec2.internal" to be scaled down by the autoscaler. when i look at the pods in the openshift-marketplace namespace i see this: NAME READY STATUS RESTARTS AGE IP NODE redhat-operators-hqstb 1/1 Running 0 1d 10.130.10.18 ip-10-244-54-105.ec2.internal we clearly have a pod running in the node. when looking deeper into the pod manifest, Matt again is spot on, we see this: (i have clipped just the relevant portion) ownerReferences: - apiVersion: operators.coreos.com/v1alpha1 blockOwnerDeletion: false controller: false kind: CatalogSource name: redhat-operators and indeed, looking through the autoscaler drain code it will not be able to remove this entry. so, now for the fix, how do we handle this? i have a couple ideas, but i think to get a permanent fix will take some time as i will need to do some research aboujt the CatalogSources and how we can control them. but with that said, here are some possibilities: 1. quick fix, delete the pod that is blocking. this is very manual, but should at least prove that the autoscaler will scale down those nodes and hopefully the pod will move to a different node. but, if it doesn't this could cause more frustration. 2. change the expendable pod priority cut off by adjusting "podPriorityThreshold" in the ClusterAutoscaler. i noticed that the marketplace pods are running at priority 0. it is possible that the user could set priority threshold to "1", which would instruct the autoscaler to delete pods below that priority regardless of their owner. *NOTE* this could be highly deleterious if their workload pods are not above priority "0", so be careful with this. 3. change the way the marketplace pods are deployed to make sure they don't land on autoscaler enabled machinesets. i'm not sure if this is possible, but perhaps there is a way to label the autoscaler machinesets so that the marketplace pods do not land there. if so, this would be the easiest and most fruitful fix. 4. modify the autoscaler code to understand CatalogSources in the drain code. this will require some discussion with upstream and investigation to determine if this is appropriate. if this marketplace problem is limited to openshift only, then making an upstream change will probably not happen, but we could always consider carrying a patch for this situation. at this point, i will need to investigate around how the marketplace works to determine what we can do. Matt, if you have suggestions on people to connect with on the marketplace team, i would be grateful to learn more =)
ok, a little more research, and a few more answers. it looks like this is a known issue with the marketplace, https://github.com/operator-framework/operator-lifecycle-manager/issues/2666 there is also a patch in the upstream for it, https://github.com/operator-framework/operator-lifecycle-manager/pull/2669 that patch gives another possible way to mitigate this, the user could annotate the marketplace pods with "cluster-autoscaler.kubernetes.io/safe-to-evict", which would tell the autoscaler that it can evict those pods. this would still be a manual process of adding the annotation, but it's another tool to mitigate this issue.
given that this bug is being tracked by the team that works on the upstream marketplace-operator, i am changing the component to OLM. ideally this situation will be solved by the upstream bug fix that has been proposed.
Seems like this was fixed in https://github.com/openshift/operator-framework-olm/pull/300, specifically this commit: https://github.com/openshift/operator-framework-olm/pull/300/commits/b5b3a041b77a33e56c5708835a8136e3a72801b9
pass on 4.11 -- [root@preserve-olm-env2 2053343]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-07-05-083948 True False 172m Cluster version is 4.11.0-0.nightly-2022-07-05-083948 [root@preserve-olm-env2 2053343]# [root@preserve-olm-env2 2053343]# oc project openshift-marketplace Now using project "openshift-marketplace" on server "https://api.qe-daily-0706.qe.devcluster.openshift.com:6443". [root@preserve-olm-env2 2053343]# oc get pod NAME READY STATUS RESTARTS AGE certified-operators-qcfvk 1/1 Running 0 7h32m community-operators-cj9sb 1/1 Running 0 7h32m marketplace-operator-6bd7679ddd-mltkb 1/1 Running 0 7h37m qe-app-registry-bbnc9 1/1 Running 0 3h56m redhat-marketplace-bmv8b 1/1 Running 0 7h32m redhat-operators-pdhjh 1/1 Running 0 7h32m [root@preserve-olm-env2 2053343]# oc get pod qe-app-registry-bbnc9 -o yaml|grep safe-to-evict cluster-autoscaler.kubernetes.io/safe-to-evict: "true" [root@preserve-olm-env2 2053343]# oc get pod qe-app-registry-bbnc9 -o yaml|grep hostIP hostIP: 10.0.141.32 //add new node to move the pod to that node //get the node from pod's information and then get machineset from the node's info [root@preserve-olm-env2 2053343]# oc get machineset qe-daily-0706-q64pj-worker-ap-southeast-1a -o yaml -n openshift-machine-api > ms.yaml [root@preserve-olm-env2 2053343]# vi ms.yaml [root@preserve-olm-env2 2053343]# cat ms.yaml apiVersion: machine.openshift.io/v1beta1 kind: MachineSet metadata: annotations: machine.openshift.io/memoryMb: "16384" machine.openshift.io/vCPU: "4" labels: machine.openshift.io/cluster-api-cluster: qe-daily-0706-q64pj name: wk namespace: openshift-machine-api spec: replicas: 1 selector: matchLabels: machine.openshift.io/cluster-api-cluster: qe-daily-0706-q64pj machine.openshift.io/cluster-api-machineset: wk template: metadata: labels: machine.openshift.io/cluster-api-cluster: qe-daily-0706-q64pj machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: wk spec: lifecycleHooks: {} metadata: {} providerSpec: value: ami: id: ami-09a19b51d526c1385 apiVersion: machine.openshift.io/v1beta1 blockDevices: - ebs: encrypted: true iops: 0 kmsKey: arn: "" volumeSize: 120 volumeType: gp3 credentialsSecret: name: aws-cloud-credentials deviceIndex: 0 iamInstanceProfile: id: qe-daily-0706-q64pj-worker-profile instanceType: m5.xlarge kind: AWSMachineProviderConfig metadata: creationTimestamp: null metadataServiceOptions: {} placement: availabilityZone: ap-southeast-1a region: ap-southeast-1 securityGroups: - filters: - name: tag:Name values: - qe-daily-0706-q64pj-worker-sg subnet: filters: - name: tag:Name values: - qe-daily-0706-q64pj-private-ap-southeast-1a tags: - name: kubernetes.io/cluster/qe-daily-0706-q64pj value: owned userDataSecret: name: worker-user-data [root@preserve-olm-env2 2053343]# [root@preserve-olm-env2 2053343]# oc apply -f ms.yaml machineset.machine.openshift.io/wk created [root@preserve-olm-env2 2053343]# oc get machine -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api qe-daily-0706-q64pj-master-0 Running m5.xlarge ap-southeast-1 ap-southeast-1a 7h25m openshift-machine-api qe-daily-0706-q64pj-master-1 Running m5.xlarge ap-southeast-1 ap-southeast-1b 7h25m openshift-machine-api qe-daily-0706-q64pj-master-2 Running m5.xlarge ap-southeast-1 ap-southeast-1c 7h25m openshift-machine-api qe-daily-0706-q64pj-worker-ap-southeast-1a-fcqq2 Running m5.xlarge ap-southeast-1 ap-southeast-1a 7h19m openshift-machine-api qe-daily-0706-q64pj-worker-ap-southeast-1b-lnprf Running m5.xlarge ap-southeast-1 ap-southeast-1b 7h19m openshift-machine-api qe-daily-0706-q64pj-worker-ap-southeast-1c-gqm5x Running m5.xlarge ap-southeast-1 ap-southeast-1c 7h19m openshift-machine-api wk-gfx2v Provisioned m5.xlarge ap-southeast-1 ap-southeast-1a 75s [root@preserve-olm-env2 2053343]# oc get machine -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api qe-daily-0706-q64pj-master-0 Running m5.xlarge ap-southeast-1 ap-southeast-1a 7h30m openshift-machine-api qe-daily-0706-q64pj-master-1 Running m5.xlarge ap-southeast-1 ap-southeast-1b 7h30m openshift-machine-api qe-daily-0706-q64pj-master-2 Running m5.xlarge ap-southeast-1 ap-southeast-1c 7h30m openshift-machine-api qe-daily-0706-q64pj-worker-ap-southeast-1a-fcqq2 Running m5.xlarge ap-southeast-1 ap-southeast-1a 7h24m openshift-machine-api qe-daily-0706-q64pj-worker-ap-southeast-1b-lnprf Running m5.xlarge ap-southeast-1 ap-southeast-1b 7h24m openshift-machine-api qe-daily-0706-q64pj-worker-ap-southeast-1c-gqm5x Running m5.xlarge ap-southeast-1 ap-southeast-1c 7h24m openshift-machine-api wk-gfx2v Running m5.xlarge ap-southeast-1 ap-southeast-1a 5m40s [root@preserve-olm-env2 2053343]# oc adm top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ip-10-0-139-30.ap-southeast-1.compute.internal 123m 3% 1723Mi 11% ip-10-0-141-32.ap-southeast-1.compute.internal 399m 11% 4926Mi 33% ip-10-0-148-131.ap-southeast-1.compute.internal 884m 25% 8690Mi 60% ip-10-0-165-22.ap-southeast-1.compute.internal 766m 21% 8405Mi 58% ip-10-0-173-186.ap-southeast-1.compute.internal 927m 26% 5125Mi 35% ip-10-0-196-197.ap-southeast-1.compute.internal 694m 19% 10075Mi 69% ip-10-0-217-14.ap-southeast-1.compute.internal 987m 28% 6586Mi 45% //ip-10-0-139-30.ap-southeast-1.compute.internal is new added node [root@preserve-olm-env2 2053343]# oc get pod NAME READY STATUS RESTARTS AGE certified-operators-qcfvk 1/1 Running 0 7h32m community-operators-cj9sb 1/1 Running 0 7h32m marketplace-operator-6bd7679ddd-mltkb 1/1 Running 0 7h37m qe-app-registry-bbnc9 1/1 Running 0 3h56m redhat-marketplace-bmv8b 1/1 Running 0 7h32m redhat-operators-pdhjh 1/1 Running 0 7h32m [root@preserve-olm-env2 2053343]# oc get pod qe-app-registry-bbnc9 -o yaml|grep hostIP hostIP: 10.0.141.32 [root@preserve-olm-env2 2053343]# oc delete pod qe-app-registry-bbnc9 pod "qe-app-registry-bbnc9" deleted [root@preserve-olm-env2 2053343]# oc get pod NAME READY STATUS RESTARTS AGE certified-operators-qcfvk 1/1 Running 0 7h34m community-operators-cj9sb 1/1 Running 0 7h34m marketplace-operator-6bd7679ddd-mltkb 1/1 Running 0 7h39m qe-app-registry-hm7fq 0/1 ContainerCreating 0 3s redhat-marketplace-bmv8b 1/1 Running 0 7h34m redhat-operators-pdhjh 1/1 Running 0 7h34m [root@preserve-olm-env2 2053343]# oc get pod qe-app-registry-hm7fq -o yaml|grep hostIP hostIP: 10.0.139.30 // catsrc pod move to that node ip-10-0-139-30.ap-southeast-1.compute.internal [root@preserve-olm-env2 2053343]# cat clusterauto.yaml apiVersion: "autoscaling.openshift.io/v1" kind: "ClusterAutoscaler" metadata: name: default spec: resourceLimits: maxNodesTotal: 6 scaleDown: enabled: true delayAfterAdd: 10s delayAfterDelete: 10s delayAfterFailure: 10s unneededTime: 10s [root@preserve-olm-env2 2053343]# [root@preserve-olm-env2 2053343]# oc apply -f clusterauto.yaml clusterautoscaler.autoscaling.openshift.io/default created [root@preserve-olm-env2 2053343]# cat machinesetauto.yaml apiVersion: autoscaling.openshift.io/v1beta1 kind: MachineAutoscaler metadata: name: wkma namespace: openshift-machine-api spec: maxReplicas: 1 minReplicas: 0 scaleTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: wk [root@preserve-olm-env2 2053343]# [root@preserve-olm-env2 2053343]# oc apply -f machinesetauto.yaml machineautoscaler.autoscaling.openshift.io/wkma created [root@preserve-olm-env2 2053343]# oc get po -n openshift-machine-api NAME READY STATUS RESTARTS AGE cluster-autoscaler-default-6f496d446-q69wd 1/1 Running 0 65s cluster-autoscaler-operator-b9f6b4779-47nh6 2/2 Running 0 7h43m cluster-baremetal-operator-fd8749f6f-rl9k5 2/2 Running 0 7h43m machine-api-controllers-666c749d87-jngnn 7/7 Running 1 (7h37m ago) 7h38m machine-api-operator-5db457cd7c-xtzsn 2/2 Running 0 7h43m [root@preserve-olm-env2 2053343]# oc logs cluster-autoscaler-default-6f496d446-q69wd -n openshift-machine-api I0706 06:55:11.411231 1 main.go:430] Cluster Autoscaler 1.24.0 I0706 06:55:12.490993 1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-api/cluster-autoscaler... [root@preserve-olm-env2 2053343]# oc logs cluster-autoscaler-default-6f496d446-q69wd -n openshift-machine-api I0706 06:55:11.411231 1 main.go:430] Cluster Autoscaler 1.24.0 I0706 06:55:12.490993 1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-api/cluster-autoscaler... I0706 06:57:39.487241 1 leaderelection.go:258] successfully acquired lease openshift-machine-api/cluster-autoscaler W0706 06:57:39.505696 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget I0706 06:57:39.517297 1 cloud_provider_builder.go:29] Building clusterapi cloud provider. W0706 06:57:39.517317 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. W0706 06:57:39.517605 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget W0706 06:57:39.517617 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0706 06:57:39.523565 1 clusterapi_controller.go:345] Using version "v1beta1" for API group "machine.openshift.io" I0706 06:57:39.537105 1 clusterapi_controller.go:422] Resource "machinesets" available I0706 06:57:39.537212 1 clusterapi_controller.go:422] Resource "machinesets/status" available I0706 06:57:39.537248 1 clusterapi_controller.go:422] Resource "machinesets/scale" available I0706 06:57:39.537274 1 clusterapi_controller.go:422] Resource "machines" available I0706 06:57:39.537299 1 clusterapi_controller.go:422] Resource "machines/status" available I0706 06:57:39.537325 1 clusterapi_controller.go:422] Resource "machinehealthchecks" available I0706 06:57:39.537349 1 clusterapi_controller.go:422] Resource "machinehealthchecks/status" available I0706 06:57:39.643307 1 main.go:322] Registered cleanup signal handler I0706 06:57:39.643455 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache I0706 06:57:39.688297 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 44.789511ms W0706 06:57:49.681685 1 clusterstate.go:423] AcceptableRanges have not been populated yet. Skip checking I0706 06:57:50.450322 1 static_autoscaler.go:445] No unschedulable pods I0706 06:57:51.254245 1 legacy.go:717] No candidates for scale down I0706 06:57:51.278744 1 delete.go:103] Successfully added DeletionCandidateTaint on node ip-10-0-139-30.ap-southeast-1.compute.internal I0706 06:58:02.267569 1 static_autoscaler.go:445] No unschedulable pods I0706 06:58:03.094767 1 delete.go:103] Successfully added ToBeDeletedTaint on node ip-10-0-139-30.ap-southeast-1.compute.internal I0706 06:58:03.100234 1 actuator.go:194] Scale-down: removing node ip-10-0-139-30.ap-southeast-1.compute.internal, utilization: {0.12685714285714286 0.11808737326873289 0 cpu 0.12685714285714286}, pods to reschedule: qe-app-registry-hm7fq I0706 06:58:04.280747 1 request.go:601] Waited for 1.178014024s due to client-side throttling, not priority and fairness, request: POST:https://172.30.0.1:443/api/v1/namespaces/openshift-network-diagnostics/pods/network-check-target-nm7wq/eviction I0706 06:58:04.691708 1 drain.go:139] Not deleted yet openshift-marketplace/qe-app-registry-hm7fq I0706 06:58:09.697792 1 drain.go:150] All pods removed from ip-10-0-139-30.ap-southeast-1.compute.internal ... //the node is removed [root@preserve-olm-env2 2053343]# oc get pod NAME READY STATUS RESTARTS AGE certified-operators-qcfvk 1/1 Running 0 7h47m community-operators-cj9sb 1/1 Running 0 7h47m marketplace-operator-6bd7679ddd-mltkb 1/1 Running 0 7h52m qe-app-registry-srskp 1/1 Running 0 7m2s redhat-marketplace-bmv8b 1/1 Running 0 7h47m redhat-operators-pdhjh 1/1 Running 0 7h47m [root@preserve-olm-env2 2053343]# oc get pod qe-app-registry-srskp -o yaml|grep hostIP hostIP: 10.0.141.32 //catsrc pod move to other node [root@preserve-olm-env2 2053343]# oc adm top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ip-10-0-141-32.ap-southeast-1.compute.internal 602m 17% 5035Mi 34% ip-10-0-148-131.ap-southeast-1.compute.internal 794m 22% 8875Mi 61% ip-10-0-165-22.ap-southeast-1.compute.internal 580m 16% 8564Mi 59% ip-10-0-173-186.ap-southeast-1.compute.internal 964m 27% 5114Mi 35% ip-10-0-196-197.ap-southeast-1.compute.internal 824m 23% 10123Mi 70% ip-10-0-217-14.ap-southeast-1.compute.internal 1016m 29% 6605Mi 45% [root@preserve-olm-env2 2053343]# --
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days