Bug 2053343 - Cluster Autoscaler not scaling down nodes which seem to qualify for scale-down
Description Matt Bargenquast 2022-02-11 02:57:09 UTC
Description of problem:

A cluster's cluster autoscaler has been configured to scale down nodes, but does not seem to be scaling down nodes which would seem to qualify for scale-down criteria.

The autoscaler can successfully scale down, but has only done so on a few occasions over the span of a week. The cluster owner in this specific scenario is scaling down their workloads during 01:00 UTC -> 09:00 UTC each day and is expecting it to make a larger impact on the number of cluster nodes. 

A must-gather for the specific cluster exhibiting this behaviour will be included in a comment attached to the Bugzilla.

For example, node "ip-10-244-54-105.ec2.internal" was running no non-core cluster workloads, and had minimal CPU/Memory resource consumption, for an extended period of time and was not considered for scaledown. 

Non-terminated Pods:                      (16 in total)                                                                                                                                                                             
  Namespace                               Name                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE                                                                                      
  ---------                               ----                                   ------------  ----------  ---------------  -------------  ---                                                                                      
  openshift-cluster-csi-drivers           aws-ebs-csi-driver-node-jhk9k          30m (1%)      0 (0%)      150Mi (1%)       0 (0%)         31h                                                                                      
  openshift-cluster-node-tuning-operator  tuned-stbk7                            10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         31h                                                                                      
  openshift-dns                           dns-default-vbl42                      60m (2%)      0 (0%)      110Mi (0%)       0 (0%)         31h                                                                                      
  openshift-dns                           node-resolver-9kf6w                    5m (0%)       0 (0%)      21Mi (0%)        0 (0%)         31h                                                                                      
  openshift-image-registry                node-ca-gt66p                          10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         31h                                                                                      
  openshift-ingress-canary                ingress-canary-dvk9v                   10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         31h                                                                                      
  openshift-machine-config-operator       machine-config-daemon-8s92c            40m (1%)      0 (0%)      100Mi (0%)       0 (0%)         31h                                                                                      
  openshift-marketplace                   redhat-operators-hqstb                 10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         31h                                                                                      
  openshift-monitoring                    node-exporter-q9s85                    9m (0%)       0 (0%)      47Mi (0%)        0 (0%)         31h                                                                                      
  openshift-monitoring                    sre-dns-latency-exporter-x4k48         0 (0%)        0 (0%)      0 (0%)           0 (0%)         31h                                                                                      
  openshift-multus                        multus-additional-cni-plugins-ghvvr    10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         31h                                                                                      
  openshift-multus                        multus-tvs85                           10m (0%)      0 (0%)      65Mi (0%)        0 (0%)         31h                                                                                      
  openshift-multus                        network-metrics-daemon-n88xb           20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         31h                                                                                      
  openshift-network-diagnostics           network-check-target-8r8m8             10m (0%)      0 (0%)      15Mi (0%)        0 (0%)         31h                                                                                      
  openshift-sdn                           sdn-dqlgt                              110m (3%)     0 (0%)      220Mi (1%)       0 (0%)         31h                                                                                      
  openshift-security                      splunkforwarder-ds-mwcl5               0 (0%)        0 (0%)      0 (0%)           0 (0%)         31h                                
  Resource                    Requests    Limits         
  --------                    --------    ------         
  cpu                         344m (11%)  0 (0%)         
  memory                      988Mi (6%)  0 (0%)         

Similarly node "ip-10-244-54-117.ec2.internal".

No non-core-cluster pods on the cluster were observed to be:
- using local storage (with hostpath + emptydir set)
- running in kube-system
- being blocked by PDBs

Version-Release number of selected component (if applicable):


Comment 10 Michael McCune 2022-02-23 15:00:55 UTC
i got a chance to investigate deeper into the must-gather and i think that Matt correctly identified the root cause here.

from the original text, we can see that we expected node "ip-10-244-54-105.ec2.internal" to be scaled down by the autoscaler. when i look at the pods in the openshift-marketplace namespace i see this:

NAME                                          READY  STATUS     RESTARTS  AGE    IP            NODE
redhat-operators-hqstb                        1/1    Running    0         1d  ip-10-244-54-105.ec2.internal

we clearly have a pod running in the node. when looking deeper into the pod manifest, Matt again is spot on, we see this: (i have clipped just the relevant portion)

  - apiVersion: operators.coreos.com/v1alpha1
    blockOwnerDeletion: false
    controller: false
    kind: CatalogSource
    name: redhat-operators

and indeed, looking through the autoscaler drain code it will not be able to remove this entry.

so, now for the fix, how do we handle this?

i have a couple ideas, but i think to get a permanent fix will take some time as i will need to do some research aboujt the CatalogSources and how we can control them. but with that said, here are some possibilities:

1. quick fix, delete the pod that is blocking. this is very manual, but should at least prove that the autoscaler will scale down those nodes and hopefully the pod will move to a different node. but, if it doesn't this could cause more frustration.

2. change the expendable pod priority cut off by adjusting "podPriorityThreshold" in the ClusterAutoscaler. i noticed that the marketplace pods are running at priority 0. it is possible that the user could set priority threshold to "1", which would instruct the autoscaler to delete pods below that priority regardless of their owner. *NOTE* this could be highly deleterious if their workload pods are not above priority "0", so be careful with this.

3. change the way the marketplace pods are deployed to make sure they don't land on autoscaler enabled machinesets. i'm not sure if this is possible, but perhaps there is a way to label the autoscaler machinesets so that the marketplace pods do not land there. if so, this would be the easiest and most fruitful fix.

4. modify the autoscaler code to understand CatalogSources in the drain code. this will require some discussion with upstream and investigation to determine if this is appropriate. if this marketplace problem is limited to openshift only, then making an upstream change will probably not happen, but we could always consider carrying a patch for this situation.

at this point, i will need to investigate around how the marketplace works to determine what we can do. Matt, if you have suggestions on people to connect with on the marketplace team, i would be grateful to learn more =)

Comment 11 Michael McCune 2022-02-23 16:47:27 UTC
ok, a little more research, and a few more answers.

it looks like this is a known issue with the marketplace, https://github.com/operator-framework/operator-lifecycle-manager/issues/2666

there is also a patch in the upstream for it, https://github.com/operator-framework/operator-lifecycle-manager/pull/2669

that patch gives another possible way to mitigate this, the user could annotate the marketplace pods with "cluster-autoscaler.kubernetes.io/safe-to-evict", which would tell the autoscaler that it can evict those pods. this would still be a manual process of adding the annotation, but it's another tool to mitigate this issue.

Comment 12 Michael McCune 2022-02-23 17:20:48 UTC
given that this bug is being tracked by the team that works on the upstream marketplace-operator, i am changing the component to OLM. ideally this situation will be solved by the upstream bug fix that has been proposed.

Comment 15 kuiwang 2022-07-06 07:57:22 UTC
pass on 4.11

[root@preserve-olm-env2 2053343]# oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-07-05-083948   True        False         172m    Cluster version is 4.11.0-0.nightly-2022-07-05-083948
[root@preserve-olm-env2 2053343]# 

[root@preserve-olm-env2 2053343]# oc project openshift-marketplace
Now using project "openshift-marketplace" on server "https://api.qe-daily-0706.qe.devcluster.openshift.com:6443".

[root@preserve-olm-env2 2053343]# oc get pod
NAME                                    READY   STATUS    RESTARTS   AGE
certified-operators-qcfvk               1/1     Running   0          7h32m
community-operators-cj9sb               1/1     Running   0          7h32m
marketplace-operator-6bd7679ddd-mltkb   1/1     Running   0          7h37m
qe-app-registry-bbnc9                   1/1     Running   0          3h56m
redhat-marketplace-bmv8b                1/1     Running   0          7h32m
redhat-operators-pdhjh                  1/1     Running   0          7h32m
[root@preserve-olm-env2 2053343]# oc get pod qe-app-registry-bbnc9 -o yaml|grep safe-to-evict
    cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
[root@preserve-olm-env2 2053343]# oc get pod qe-app-registry-bbnc9 -o yaml|grep hostIP

//add new node to move the pod to that node
//get the node from pod's information and then get machineset from the node's info
[root@preserve-olm-env2 2053343]# oc get machineset qe-daily-0706-q64pj-worker-ap-southeast-1a -o yaml -n openshift-machine-api > ms.yaml
[root@preserve-olm-env2 2053343]# vi ms.yaml 
[root@preserve-olm-env2 2053343]# cat ms.yaml 
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
    machine.openshift.io/memoryMb: "16384"
    machine.openshift.io/vCPU: "4"
    machine.openshift.io/cluster-api-cluster: qe-daily-0706-q64pj
  name: wk
  namespace: openshift-machine-api
  replicas: 1
      machine.openshift.io/cluster-api-cluster: qe-daily-0706-q64pj
      machine.openshift.io/cluster-api-machineset: wk
        machine.openshift.io/cluster-api-cluster: qe-daily-0706-q64pj
        machine.openshift.io/cluster-api-machine-role: worker
        machine.openshift.io/cluster-api-machine-type: worker
        machine.openshift.io/cluster-api-machineset: wk
      lifecycleHooks: {}
      metadata: {}
            id: ami-09a19b51d526c1385
          apiVersion: machine.openshift.io/v1beta1
          - ebs:
              encrypted: true
              iops: 0
                arn: ""
              volumeSize: 120
              volumeType: gp3
            name: aws-cloud-credentials
          deviceIndex: 0
            id: qe-daily-0706-q64pj-worker-profile
          instanceType: m5.xlarge
          kind: AWSMachineProviderConfig
            creationTimestamp: null
          metadataServiceOptions: {}
            availabilityZone: ap-southeast-1a
            region: ap-southeast-1
          - filters:
            - name: tag:Name
              - qe-daily-0706-q64pj-worker-sg
            - name: tag:Name
              - qe-daily-0706-q64pj-private-ap-southeast-1a
          - name: kubernetes.io/cluster/qe-daily-0706-q64pj
            value: owned
            name: worker-user-data
[root@preserve-olm-env2 2053343]# 
[root@preserve-olm-env2 2053343]# oc apply -f ms.yaml 
machineset.machine.openshift.io/wk created
[root@preserve-olm-env2 2053343]# oc get machine -A
NAMESPACE               NAME                                               PHASE         TYPE        REGION           ZONE              AGE
openshift-machine-api   qe-daily-0706-q64pj-master-0                       Running       m5.xlarge   ap-southeast-1   ap-southeast-1a   7h25m
openshift-machine-api   qe-daily-0706-q64pj-master-1                       Running       m5.xlarge   ap-southeast-1   ap-southeast-1b   7h25m
openshift-machine-api   qe-daily-0706-q64pj-master-2                       Running       m5.xlarge   ap-southeast-1   ap-southeast-1c   7h25m
openshift-machine-api   qe-daily-0706-q64pj-worker-ap-southeast-1a-fcqq2   Running       m5.xlarge   ap-southeast-1   ap-southeast-1a   7h19m
openshift-machine-api   qe-daily-0706-q64pj-worker-ap-southeast-1b-lnprf   Running       m5.xlarge   ap-southeast-1   ap-southeast-1b   7h19m
openshift-machine-api   qe-daily-0706-q64pj-worker-ap-southeast-1c-gqm5x   Running       m5.xlarge   ap-southeast-1   ap-southeast-1c   7h19m
openshift-machine-api   wk-gfx2v                                           Provisioned   m5.xlarge   ap-southeast-1   ap-southeast-1a   75s

[root@preserve-olm-env2 2053343]# oc get machine -A
NAMESPACE               NAME                                               PHASE     TYPE        REGION           ZONE              AGE
openshift-machine-api   qe-daily-0706-q64pj-master-0                       Running   m5.xlarge   ap-southeast-1   ap-southeast-1a   7h30m
openshift-machine-api   qe-daily-0706-q64pj-master-1                       Running   m5.xlarge   ap-southeast-1   ap-southeast-1b   7h30m
openshift-machine-api   qe-daily-0706-q64pj-master-2                       Running   m5.xlarge   ap-southeast-1   ap-southeast-1c   7h30m
openshift-machine-api   qe-daily-0706-q64pj-worker-ap-southeast-1a-fcqq2   Running   m5.xlarge   ap-southeast-1   ap-southeast-1a   7h24m
openshift-machine-api   qe-daily-0706-q64pj-worker-ap-southeast-1b-lnprf   Running   m5.xlarge   ap-southeast-1   ap-southeast-1b   7h24m
openshift-machine-api   qe-daily-0706-q64pj-worker-ap-southeast-1c-gqm5x   Running   m5.xlarge   ap-southeast-1   ap-southeast-1c   7h24m
openshift-machine-api   wk-gfx2v                                           Running   m5.xlarge   ap-southeast-1   ap-southeast-1a   5m40s

[root@preserve-olm-env2 2053343]# oc adm top node 
NAME                                              CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
ip-10-0-139-30.ap-southeast-1.compute.internal    123m         3%     1723Mi          11%       
ip-10-0-141-32.ap-southeast-1.compute.internal    399m         11%    4926Mi          33%       
ip-10-0-148-131.ap-southeast-1.compute.internal   884m         25%    8690Mi          60%       
ip-10-0-165-22.ap-southeast-1.compute.internal    766m         21%    8405Mi          58%       
ip-10-0-173-186.ap-southeast-1.compute.internal   927m         26%    5125Mi          35%       
ip-10-0-196-197.ap-southeast-1.compute.internal   694m         19%    10075Mi         69%       
ip-10-0-217-14.ap-southeast-1.compute.internal    987m         28%    6586Mi          45%     

//ip-10-0-139-30.ap-southeast-1.compute.internal is new added node

[root@preserve-olm-env2 2053343]# oc get pod
NAME                                    READY   STATUS    RESTARTS   AGE
certified-operators-qcfvk               1/1     Running   0          7h32m
community-operators-cj9sb               1/1     Running   0          7h32m
marketplace-operator-6bd7679ddd-mltkb   1/1     Running   0          7h37m
qe-app-registry-bbnc9                   1/1     Running   0          3h56m
redhat-marketplace-bmv8b                1/1     Running   0          7h32m
redhat-operators-pdhjh                  1/1     Running   0          7h32m
[root@preserve-olm-env2 2053343]# oc get pod qe-app-registry-bbnc9 -o yaml|grep hostIP
[root@preserve-olm-env2 2053343]# oc delete pod qe-app-registry-bbnc9
pod "qe-app-registry-bbnc9" deleted
[root@preserve-olm-env2 2053343]# oc get pod
NAME                                    READY   STATUS              RESTARTS   AGE
certified-operators-qcfvk               1/1     Running             0          7h34m
community-operators-cj9sb               1/1     Running             0          7h34m
marketplace-operator-6bd7679ddd-mltkb   1/1     Running             0          7h39m
qe-app-registry-hm7fq                   0/1     ContainerCreating   0          3s
redhat-marketplace-bmv8b                1/1     Running             0          7h34m
redhat-operators-pdhjh                  1/1     Running             0          7h34m
[root@preserve-olm-env2 2053343]# oc get pod qe-app-registry-hm7fq -o yaml|grep hostIP
// catsrc pod move to that node ip-10-0-139-30.ap-southeast-1.compute.internal
[root@preserve-olm-env2 2053343]# cat clusterauto.yaml 
apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
  name: default
    maxNodesTotal: 6
    enabled: true
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 10s
    unneededTime: 10s
[root@preserve-olm-env2 2053343]# 
[root@preserve-olm-env2 2053343]# oc apply -f clusterauto.yaml 
clusterautoscaler.autoscaling.openshift.io/default created
[root@preserve-olm-env2 2053343]# cat machinesetauto.yaml 
apiVersion: autoscaling.openshift.io/v1beta1
kind: MachineAutoscaler
  name: wkma
  namespace: openshift-machine-api
  maxReplicas: 1
  minReplicas: 0
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet
    name: wk
[root@preserve-olm-env2 2053343]# 
[root@preserve-olm-env2 2053343]# oc apply -f machinesetauto.yaml 
machineautoscaler.autoscaling.openshift.io/wkma created
[root@preserve-olm-env2 2053343]# oc get po -n openshift-machine-api
NAME                                          READY   STATUS    RESTARTS        AGE
cluster-autoscaler-default-6f496d446-q69wd    1/1     Running   0               65s
cluster-autoscaler-operator-b9f6b4779-47nh6   2/2     Running   0               7h43m
cluster-baremetal-operator-fd8749f6f-rl9k5    2/2     Running   0               7h43m
machine-api-controllers-666c749d87-jngnn      7/7     Running   1 (7h37m ago)   7h38m
machine-api-operator-5db457cd7c-xtzsn         2/2     Running   0               7h43m
[root@preserve-olm-env2 2053343]# oc logs cluster-autoscaler-default-6f496d446-q69wd -n openshift-machine-api
I0706 06:55:11.411231       1 main.go:430] Cluster Autoscaler 1.24.0
I0706 06:55:12.490993       1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-api/cluster-autoscaler...
[root@preserve-olm-env2 2053343]# oc logs cluster-autoscaler-default-6f496d446-q69wd -n openshift-machine-api
I0706 06:55:11.411231       1 main.go:430] Cluster Autoscaler 1.24.0
I0706 06:55:12.490993       1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-api/cluster-autoscaler...
I0706 06:57:39.487241       1 leaderelection.go:258] successfully acquired lease openshift-machine-api/cluster-autoscaler
W0706 06:57:39.505696       1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
I0706 06:57:39.517297       1 cloud_provider_builder.go:29] Building clusterapi cloud provider.
W0706 06:57:39.517317       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
W0706 06:57:39.517605       1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
W0706 06:57:39.517617       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0706 06:57:39.523565       1 clusterapi_controller.go:345] Using version "v1beta1" for API group "machine.openshift.io"
I0706 06:57:39.537105       1 clusterapi_controller.go:422] Resource "machinesets" available
I0706 06:57:39.537212       1 clusterapi_controller.go:422] Resource "machinesets/status" available
I0706 06:57:39.537248       1 clusterapi_controller.go:422] Resource "machinesets/scale" available
I0706 06:57:39.537274       1 clusterapi_controller.go:422] Resource "machines" available
I0706 06:57:39.537299       1 clusterapi_controller.go:422] Resource "machines/status" available
I0706 06:57:39.537325       1 clusterapi_controller.go:422] Resource "machinehealthchecks" available
I0706 06:57:39.537349       1 clusterapi_controller.go:422] Resource "machinehealthchecks/status" available
I0706 06:57:39.643307       1 main.go:322] Registered cleanup signal handler
I0706 06:57:39.643455       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0706 06:57:39.688297       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 44.789511ms
W0706 06:57:49.681685       1 clusterstate.go:423] AcceptableRanges have not been populated yet. Skip checking
I0706 06:57:50.450322       1 static_autoscaler.go:445] No unschedulable pods
I0706 06:57:51.254245       1 legacy.go:717] No candidates for scale down
I0706 06:57:51.278744       1 delete.go:103] Successfully added DeletionCandidateTaint on node ip-10-0-139-30.ap-southeast-1.compute.internal
I0706 06:58:02.267569       1 static_autoscaler.go:445] No unschedulable pods
I0706 06:58:03.094767       1 delete.go:103] Successfully added ToBeDeletedTaint on node ip-10-0-139-30.ap-southeast-1.compute.internal
I0706 06:58:03.100234       1 actuator.go:194] Scale-down: removing node ip-10-0-139-30.ap-southeast-1.compute.internal, utilization: {0.12685714285714286 0.11808737326873289 0 cpu 0.12685714285714286}, pods to reschedule: qe-app-registry-hm7fq
I0706 06:58:04.280747       1 request.go:601] Waited for 1.178014024s due to client-side throttling, not priority and fairness, request: POST:
I0706 06:58:04.691708       1 drain.go:139] Not deleted yet openshift-marketplace/qe-app-registry-hm7fq
I0706 06:58:09.697792       1 drain.go:150] All pods removed from ip-10-0-139-30.ap-southeast-1.compute.internal

//the node is removed
[root@preserve-olm-env2 2053343]# oc get pod
NAME                                    READY   STATUS    RESTARTS   AGE
certified-operators-qcfvk               1/1     Running   0          7h47m
community-operators-cj9sb               1/1     Running   0          7h47m
marketplace-operator-6bd7679ddd-mltkb   1/1     Running   0          7h52m
qe-app-registry-srskp                   1/1     Running   0          7m2s
redhat-marketplace-bmv8b                1/1     Running   0          7h47m
redhat-operators-pdhjh                  1/1     Running   0          7h47m
[root@preserve-olm-env2 2053343]# oc get pod qe-app-registry-srskp -o yaml|grep hostIP

//catsrc pod move to other node
[root@preserve-olm-env2 2053343]# oc adm top node 
NAME                                              CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
ip-10-0-141-32.ap-southeast-1.compute.internal    602m         17%    5035Mi          34%       
ip-10-0-148-131.ap-southeast-1.compute.internal   794m         22%    8875Mi          61%       
ip-10-0-165-22.ap-southeast-1.compute.internal    580m         16%    8564Mi          59%       
ip-10-0-173-186.ap-southeast-1.compute.internal   964m         27%    5114Mi          35%       
ip-10-0-196-197.ap-southeast-1.compute.internal   824m         23%    10123Mi         70%       
ip-10-0-217-14.ap-southeast-1.compute.internal    1016m        29%    6605Mi          45%       
[root@preserve-olm-env2 2053343]# 


