Bug 2053343
Summary: | Cluster Autoscaler not scaling down nodes which seem to qualify for scale-down | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Matt Bargenquast <mbargenq> | |
Component: | OLM | Assignee: | Per da Silva <pegoncal> | |
OLM sub component: | OLM | QA Contact: | kuiwang | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | low | |||
Priority: | medium | CC: | agreene, aos-bugs, bbabbar, jiazha, kramraja, krizza, mimccune, oarribas, pegoncal, pmagotra, wking | |
Version: | 4.8 | Keywords: | ServiceDeliveryImpact | |
Target Milestone: | --- | |||
Target Release: | 4.11.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: Catalog source pods from operator-marketplace were preventing nodes from draining
Consequence: autoscaler would not be able to scale down effectively
Fix: added cluster-autoscaler.kubernetes.io/safe-to-evict annotation to Catalog Source pods
Result: autoscaler effectively scaling down nodes
|
Story Points: | --- | |
Clone Of: | ||||
: | 2057740 (view as bug list) | Environment: | ||
Last Closed: | 2022-08-10 10:49:30 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: |
Description
Matt Bargenquast
2022-02-11 02:57:09 UTC
i got a chance to investigate deeper into the must-gather and i think that Matt correctly identified the root cause here. from the original text, we can see that we expected node "ip-10-244-54-105.ec2.internal" to be scaled down by the autoscaler. when i look at the pods in the openshift-marketplace namespace i see this: NAME READY STATUS RESTARTS AGE IP NODE redhat-operators-hqstb 1/1 Running 0 1d 10.130.10.18 ip-10-244-54-105.ec2.internal we clearly have a pod running in the node. when looking deeper into the pod manifest, Matt again is spot on, we see this: (i have clipped just the relevant portion) ownerReferences: - apiVersion: operators.coreos.com/v1alpha1 blockOwnerDeletion: false controller: false kind: CatalogSource name: redhat-operators and indeed, looking through the autoscaler drain code it will not be able to remove this entry. so, now for the fix, how do we handle this? i have a couple ideas, but i think to get a permanent fix will take some time as i will need to do some research aboujt the CatalogSources and how we can control them. but with that said, here are some possibilities: 1. quick fix, delete the pod that is blocking. this is very manual, but should at least prove that the autoscaler will scale down those nodes and hopefully the pod will move to a different node. but, if it doesn't this could cause more frustration. 2. change the expendable pod priority cut off by adjusting "podPriorityThreshold" in the ClusterAutoscaler. i noticed that the marketplace pods are running at priority 0. it is possible that the user could set priority threshold to "1", which would instruct the autoscaler to delete pods below that priority regardless of their owner. *NOTE* this could be highly deleterious if their workload pods are not above priority "0", so be careful with this. 3. change the way the marketplace pods are deployed to make sure they don't land on autoscaler enabled machinesets. i'm not sure if this is possible, but perhaps there is a way to label the autoscaler machinesets so that the marketplace pods do not land there. if so, this would be the easiest and most fruitful fix. 4. modify the autoscaler code to understand CatalogSources in the drain code. this will require some discussion with upstream and investigation to determine if this is appropriate. if this marketplace problem is limited to openshift only, then making an upstream change will probably not happen, but we could always consider carrying a patch for this situation. at this point, i will need to investigate around how the marketplace works to determine what we can do. Matt, if you have suggestions on people to connect with on the marketplace team, i would be grateful to learn more =) ok, a little more research, and a few more answers. it looks like this is a known issue with the marketplace, https://github.com/operator-framework/operator-lifecycle-manager/issues/2666 there is also a patch in the upstream for it, https://github.com/operator-framework/operator-lifecycle-manager/pull/2669 that patch gives another possible way to mitigate this, the user could annotate the marketplace pods with "cluster-autoscaler.kubernetes.io/safe-to-evict", which would tell the autoscaler that it can evict those pods. this would still be a manual process of adding the annotation, but it's another tool to mitigate this issue. given that this bug is being tracked by the team that works on the upstream marketplace-operator, i am changing the component to OLM. ideally this situation will be solved by the upstream bug fix that has been proposed. Seems like this was fixed in https://github.com/openshift/operator-framework-olm/pull/300, specifically this commit: https://github.com/openshift/operator-framework-olm/pull/300/commits/b5b3a041b77a33e56c5708835a8136e3a72801b9 pass on 4.11 -- [root@preserve-olm-env2 2053343]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-07-05-083948 True False 172m Cluster version is 4.11.0-0.nightly-2022-07-05-083948 [root@preserve-olm-env2 2053343]# [root@preserve-olm-env2 2053343]# oc project openshift-marketplace Now using project "openshift-marketplace" on server "https://api.qe-daily-0706.qe.devcluster.openshift.com:6443". [root@preserve-olm-env2 2053343]# oc get pod NAME READY STATUS RESTARTS AGE certified-operators-qcfvk 1/1 Running 0 7h32m community-operators-cj9sb 1/1 Running 0 7h32m marketplace-operator-6bd7679ddd-mltkb 1/1 Running 0 7h37m qe-app-registry-bbnc9 1/1 Running 0 3h56m redhat-marketplace-bmv8b 1/1 Running 0 7h32m redhat-operators-pdhjh 1/1 Running 0 7h32m [root@preserve-olm-env2 2053343]# oc get pod qe-app-registry-bbnc9 -o yaml|grep safe-to-evict cluster-autoscaler.kubernetes.io/safe-to-evict: "true" [root@preserve-olm-env2 2053343]# oc get pod qe-app-registry-bbnc9 -o yaml|grep hostIP hostIP: 10.0.141.32 //add new node to move the pod to that node //get the node from pod's information and then get machineset from the node's info [root@preserve-olm-env2 2053343]# oc get machineset qe-daily-0706-q64pj-worker-ap-southeast-1a -o yaml -n openshift-machine-api > ms.yaml [root@preserve-olm-env2 2053343]# vi ms.yaml [root@preserve-olm-env2 2053343]# cat ms.yaml apiVersion: machine.openshift.io/v1beta1 kind: MachineSet metadata: annotations: machine.openshift.io/memoryMb: "16384" machine.openshift.io/vCPU: "4" labels: machine.openshift.io/cluster-api-cluster: qe-daily-0706-q64pj name: wk namespace: openshift-machine-api spec: replicas: 1 selector: matchLabels: machine.openshift.io/cluster-api-cluster: qe-daily-0706-q64pj machine.openshift.io/cluster-api-machineset: wk template: metadata: labels: machine.openshift.io/cluster-api-cluster: qe-daily-0706-q64pj machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: wk spec: lifecycleHooks: {} metadata: {} providerSpec: value: ami: id: ami-09a19b51d526c1385 apiVersion: machine.openshift.io/v1beta1 blockDevices: - ebs: encrypted: true iops: 0 kmsKey: arn: "" volumeSize: 120 volumeType: gp3 credentialsSecret: name: aws-cloud-credentials deviceIndex: 0 iamInstanceProfile: id: qe-daily-0706-q64pj-worker-profile instanceType: m5.xlarge kind: AWSMachineProviderConfig metadata: creationTimestamp: null metadataServiceOptions: {} placement: availabilityZone: ap-southeast-1a region: ap-southeast-1 securityGroups: - filters: - name: tag:Name values: - qe-daily-0706-q64pj-worker-sg subnet: filters: - name: tag:Name values: - qe-daily-0706-q64pj-private-ap-southeast-1a tags: - name: kubernetes.io/cluster/qe-daily-0706-q64pj value: owned userDataSecret: name: worker-user-data [root@preserve-olm-env2 2053343]# [root@preserve-olm-env2 2053343]# oc apply -f ms.yaml machineset.machine.openshift.io/wk created [root@preserve-olm-env2 2053343]# oc get machine -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api qe-daily-0706-q64pj-master-0 Running m5.xlarge ap-southeast-1 ap-southeast-1a 7h25m openshift-machine-api qe-daily-0706-q64pj-master-1 Running m5.xlarge ap-southeast-1 ap-southeast-1b 7h25m openshift-machine-api qe-daily-0706-q64pj-master-2 Running m5.xlarge ap-southeast-1 ap-southeast-1c 7h25m openshift-machine-api qe-daily-0706-q64pj-worker-ap-southeast-1a-fcqq2 Running m5.xlarge ap-southeast-1 ap-southeast-1a 7h19m openshift-machine-api qe-daily-0706-q64pj-worker-ap-southeast-1b-lnprf Running m5.xlarge ap-southeast-1 ap-southeast-1b 7h19m openshift-machine-api qe-daily-0706-q64pj-worker-ap-southeast-1c-gqm5x Running m5.xlarge ap-southeast-1 ap-southeast-1c 7h19m openshift-machine-api wk-gfx2v Provisioned m5.xlarge ap-southeast-1 ap-southeast-1a 75s [root@preserve-olm-env2 2053343]# oc get machine -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api qe-daily-0706-q64pj-master-0 Running m5.xlarge ap-southeast-1 ap-southeast-1a 7h30m openshift-machine-api qe-daily-0706-q64pj-master-1 Running m5.xlarge ap-southeast-1 ap-southeast-1b 7h30m openshift-machine-api qe-daily-0706-q64pj-master-2 Running m5.xlarge ap-southeast-1 ap-southeast-1c 7h30m openshift-machine-api qe-daily-0706-q64pj-worker-ap-southeast-1a-fcqq2 Running m5.xlarge ap-southeast-1 ap-southeast-1a 7h24m openshift-machine-api qe-daily-0706-q64pj-worker-ap-southeast-1b-lnprf Running m5.xlarge ap-southeast-1 ap-southeast-1b 7h24m openshift-machine-api qe-daily-0706-q64pj-worker-ap-southeast-1c-gqm5x Running m5.xlarge ap-southeast-1 ap-southeast-1c 7h24m openshift-machine-api wk-gfx2v Running m5.xlarge ap-southeast-1 ap-southeast-1a 5m40s [root@preserve-olm-env2 2053343]# oc adm top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ip-10-0-139-30.ap-southeast-1.compute.internal 123m 3% 1723Mi 11% ip-10-0-141-32.ap-southeast-1.compute.internal 399m 11% 4926Mi 33% ip-10-0-148-131.ap-southeast-1.compute.internal 884m 25% 8690Mi 60% ip-10-0-165-22.ap-southeast-1.compute.internal 766m 21% 8405Mi 58% ip-10-0-173-186.ap-southeast-1.compute.internal 927m 26% 5125Mi 35% ip-10-0-196-197.ap-southeast-1.compute.internal 694m 19% 10075Mi 69% ip-10-0-217-14.ap-southeast-1.compute.internal 987m 28% 6586Mi 45% //ip-10-0-139-30.ap-southeast-1.compute.internal is new added node [root@preserve-olm-env2 2053343]# oc get pod NAME READY STATUS RESTARTS AGE certified-operators-qcfvk 1/1 Running 0 7h32m community-operators-cj9sb 1/1 Running 0 7h32m marketplace-operator-6bd7679ddd-mltkb 1/1 Running 0 7h37m qe-app-registry-bbnc9 1/1 Running 0 3h56m redhat-marketplace-bmv8b 1/1 Running 0 7h32m redhat-operators-pdhjh 1/1 Running 0 7h32m [root@preserve-olm-env2 2053343]# oc get pod qe-app-registry-bbnc9 -o yaml|grep hostIP hostIP: 10.0.141.32 [root@preserve-olm-env2 2053343]# oc delete pod qe-app-registry-bbnc9 pod "qe-app-registry-bbnc9" deleted [root@preserve-olm-env2 2053343]# oc get pod NAME READY STATUS RESTARTS AGE certified-operators-qcfvk 1/1 Running 0 7h34m community-operators-cj9sb 1/1 Running 0 7h34m marketplace-operator-6bd7679ddd-mltkb 1/1 Running 0 7h39m qe-app-registry-hm7fq 0/1 ContainerCreating 0 3s redhat-marketplace-bmv8b 1/1 Running 0 7h34m redhat-operators-pdhjh 1/1 Running 0 7h34m [root@preserve-olm-env2 2053343]# oc get pod qe-app-registry-hm7fq -o yaml|grep hostIP hostIP: 10.0.139.30 // catsrc pod move to that node ip-10-0-139-30.ap-southeast-1.compute.internal [root@preserve-olm-env2 2053343]# cat clusterauto.yaml apiVersion: "autoscaling.openshift.io/v1" kind: "ClusterAutoscaler" metadata: name: default spec: resourceLimits: maxNodesTotal: 6 scaleDown: enabled: true delayAfterAdd: 10s delayAfterDelete: 10s delayAfterFailure: 10s unneededTime: 10s [root@preserve-olm-env2 2053343]# [root@preserve-olm-env2 2053343]# oc apply -f clusterauto.yaml clusterautoscaler.autoscaling.openshift.io/default created [root@preserve-olm-env2 2053343]# cat machinesetauto.yaml apiVersion: autoscaling.openshift.io/v1beta1 kind: MachineAutoscaler metadata: name: wkma namespace: openshift-machine-api spec: maxReplicas: 1 minReplicas: 0 scaleTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: wk [root@preserve-olm-env2 2053343]# [root@preserve-olm-env2 2053343]# oc apply -f machinesetauto.yaml machineautoscaler.autoscaling.openshift.io/wkma created [root@preserve-olm-env2 2053343]# oc get po -n openshift-machine-api NAME READY STATUS RESTARTS AGE cluster-autoscaler-default-6f496d446-q69wd 1/1 Running 0 65s cluster-autoscaler-operator-b9f6b4779-47nh6 2/2 Running 0 7h43m cluster-baremetal-operator-fd8749f6f-rl9k5 2/2 Running 0 7h43m machine-api-controllers-666c749d87-jngnn 7/7 Running 1 (7h37m ago) 7h38m machine-api-operator-5db457cd7c-xtzsn 2/2 Running 0 7h43m [root@preserve-olm-env2 2053343]# oc logs cluster-autoscaler-default-6f496d446-q69wd -n openshift-machine-api I0706 06:55:11.411231 1 main.go:430] Cluster Autoscaler 1.24.0 I0706 06:55:12.490993 1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-api/cluster-autoscaler... [root@preserve-olm-env2 2053343]# oc logs cluster-autoscaler-default-6f496d446-q69wd -n openshift-machine-api I0706 06:55:11.411231 1 main.go:430] Cluster Autoscaler 1.24.0 I0706 06:55:12.490993 1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-api/cluster-autoscaler... I0706 06:57:39.487241 1 leaderelection.go:258] successfully acquired lease openshift-machine-api/cluster-autoscaler W0706 06:57:39.505696 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget I0706 06:57:39.517297 1 cloud_provider_builder.go:29] Building clusterapi cloud provider. W0706 06:57:39.517317 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. W0706 06:57:39.517605 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget W0706 06:57:39.517617 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0706 06:57:39.523565 1 clusterapi_controller.go:345] Using version "v1beta1" for API group "machine.openshift.io" I0706 06:57:39.537105 1 clusterapi_controller.go:422] Resource "machinesets" available I0706 06:57:39.537212 1 clusterapi_controller.go:422] Resource "machinesets/status" available I0706 06:57:39.537248 1 clusterapi_controller.go:422] Resource "machinesets/scale" available I0706 06:57:39.537274 1 clusterapi_controller.go:422] Resource "machines" available I0706 06:57:39.537299 1 clusterapi_controller.go:422] Resource "machines/status" available I0706 06:57:39.537325 1 clusterapi_controller.go:422] Resource "machinehealthchecks" available I0706 06:57:39.537349 1 clusterapi_controller.go:422] Resource "machinehealthchecks/status" available I0706 06:57:39.643307 1 main.go:322] Registered cleanup signal handler I0706 06:57:39.643455 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache I0706 06:57:39.688297 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 44.789511ms W0706 06:57:49.681685 1 clusterstate.go:423] AcceptableRanges have not been populated yet. Skip checking I0706 06:57:50.450322 1 static_autoscaler.go:445] No unschedulable pods I0706 06:57:51.254245 1 legacy.go:717] No candidates for scale down I0706 06:57:51.278744 1 delete.go:103] Successfully added DeletionCandidateTaint on node ip-10-0-139-30.ap-southeast-1.compute.internal I0706 06:58:02.267569 1 static_autoscaler.go:445] No unschedulable pods I0706 06:58:03.094767 1 delete.go:103] Successfully added ToBeDeletedTaint on node ip-10-0-139-30.ap-southeast-1.compute.internal I0706 06:58:03.100234 1 actuator.go:194] Scale-down: removing node ip-10-0-139-30.ap-southeast-1.compute.internal, utilization: {0.12685714285714286 0.11808737326873289 0 cpu 0.12685714285714286}, pods to reschedule: qe-app-registry-hm7fq I0706 06:58:04.280747 1 request.go:601] Waited for 1.178014024s due to client-side throttling, not priority and fairness, request: POST:https://172.30.0.1:443/api/v1/namespaces/openshift-network-diagnostics/pods/network-check-target-nm7wq/eviction I0706 06:58:04.691708 1 drain.go:139] Not deleted yet openshift-marketplace/qe-app-registry-hm7fq I0706 06:58:09.697792 1 drain.go:150] All pods removed from ip-10-0-139-30.ap-southeast-1.compute.internal ... //the node is removed [root@preserve-olm-env2 2053343]# oc get pod NAME READY STATUS RESTARTS AGE certified-operators-qcfvk 1/1 Running 0 7h47m community-operators-cj9sb 1/1 Running 0 7h47m marketplace-operator-6bd7679ddd-mltkb 1/1 Running 0 7h52m qe-app-registry-srskp 1/1 Running 0 7m2s redhat-marketplace-bmv8b 1/1 Running 0 7h47m redhat-operators-pdhjh 1/1 Running 0 7h47m [root@preserve-olm-env2 2053343]# oc get pod qe-app-registry-srskp -o yaml|grep hostIP hostIP: 10.0.141.32 //catsrc pod move to other node [root@preserve-olm-env2 2053343]# oc adm top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ip-10-0-141-32.ap-southeast-1.compute.internal 602m 17% 5035Mi 34% ip-10-0-148-131.ap-southeast-1.compute.internal 794m 22% 8875Mi 61% ip-10-0-165-22.ap-southeast-1.compute.internal 580m 16% 8564Mi 59% ip-10-0-173-186.ap-southeast-1.compute.internal 964m 27% 5114Mi 35% ip-10-0-196-197.ap-southeast-1.compute.internal 824m 23% 10123Mi 70% ip-10-0-217-14.ap-southeast-1.compute.internal 1016m 29% 6605Mi 45% [root@preserve-olm-env2 2053343]# -- Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |