Hide Forgot
Created attachment 1851797 [details] autoscaler logs Description of problem: Autoscaler shouldn't scale down based on scale down utilization threshold, but it will remove nodes. For example, the nodes will be removed even if I set utilizationThreshold: "0.001". Version-Release number of selected component (if applicable): 4.10.0-0.nightly-2022-01-17-223655 How reproducible: Always Steps to Reproduce: 1. Create clusterautoscaler apiVersion: "autoscaling.openshift.io/v1" kind: "ClusterAutoscaler" metadata: name: "default" spec: scaleDown: enabled: true delayAfterAdd: 10s delayAfterDelete: 10s delayAfterFailure: 10s unneededTime: 10s utilizationThreshold: "0.001" 2.Create machineautoscaler $ oc get machineautoscaler [14:25:09] NAME REF KIND REF NAME MIN MAX AGE machineautoscaler MachineSet huliu-033-tnd6f-test 1 3 170m 3. $ oc scale deployment cluster-version-operator -n openshift-cluster-version --replicas=0 $ oc scale deployment cluster-autoscaler-operator --replicas=0 $ oc edit deploy cluster-autoscaler-default - --v=4 4. Create workload to scale up 5. Waiting the node join the cluster, then delete workload 6. Check if machine could be scale down, and check autoscaler logs Actual results: Nodes will be removed even if we set “--scale-down-utilization-threshold=0.001”. spec: containers: - args: - --logtostderr - --v=4 - --cloud-provider=clusterapi - --namespace=openshift-machine-api - --scale-down-enabled=true - --scale-down-delay-after-add=10s - --scale-down-delay-after-delete=10s - --scale-down-delay-after-failure=10s - --scale-down-unneeded-time=10s - --scale-down-utilization-threshold=0.001 Add workload, machineset huliu-033-tnd6f-test scales up to 3 machines $ oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-033-tnd6f-master-0 Running bx2d-4x16 eu-gb eu-gb-1 24h huliu-033-tnd6f-master-1 Running bx2d-4x16 eu-gb eu-gb-2 24h huliu-033-tnd6f-master-2 Running bx2d-4x16 eu-gb eu-gb-3 24h huliu-033-tnd6f-test-6ff57 Running bx2d-4x16 eu-gb eu-gb-3 13m huliu-033-tnd6f-test-brvps Running bx2d-4x16 eu-gb eu-gb-3 8m3s huliu-033-tnd6f-test-xkrrf Running bx2d-4x16 eu-gb eu-gb-3 9m34s huliu-033-tnd6f-worker-1-7v4bl Running bx2d-4x16 eu-gb eu-gb-1 22h Remove workload, machineset huliu-033-tnd6f-test scales down to 1 machine $ oc get node NAME STATUS ROLES AGE VERSION huliu-033-tnd6f-master-0 Ready master 25h v1.23.0+60f5a1c huliu-033-tnd6f-master-1 Ready master 25h v1.23.0+60f5a1c huliu-033-tnd6f-master-2 Ready master 25h v1.23.0+60f5a1c huliu-033-tnd6f-test-brvps Ready worker 19m v1.23.0+60f5a1c huliu-033-tnd6f-worker-1-7v4bl Ready worker 22h v1.23.0+60f5a1c $ oc logs -f cluster-autoscaler-default-7c7bd99d87-cstm8 | grep utilization … I0119 06:00:08.735925 1 scale_down.go:444] Node huliu-033-tnd6f-test-brvps is not suitable for removal - memory utilization too big (0.087781) I0119 06:00:20.565116 1 scale_down.go:444] Node huliu-033-tnd6f-test-xkrrf is not suitable for removal - memory utilization too big (0.087781) I0119 06:00:20.565396 1 scale_down.go:444] Node huliu-033-tnd6f-test-brvps is not suitable for removal - memory utilization too big (0.087781) I0119 06:00:32.398823 1 scale_down.go:444] Node huliu-033-tnd6f-test-xkrrf is not suitable for removal - memory utilization too big (0.087781) I0119 06:00:32.399000 1 scale_down.go:444] Node huliu-033-tnd6f-test-brvps is not suitable for removal - memory utilization too big (0.087781) I0119 06:00:44.246585 1 scale_down.go:444] Node huliu-033-tnd6f-test-xkrrf is not suitable for removal - memory utilization too big (0.087781) I0119 06:00:44.246810 1 scale_down.go:444] Node huliu-033-tnd6f-test-brvps is not suitable for removal - memory utilization too big (0.087781) Expected results: Cluster autoscaler should uses utilizationThreshold to determine if a node should be scaled down, below which a node can be considered for scale down. Additional info:
Do you happen to have a must gather for the cluster on which you replicated this bug? Would be good to see what the `cluster-autoscaler-default` deployment looked like
Scratch that, the configuration is printed in the logs, will review the logs
I0119 05:55:53.919184 1 clusterapi_controller.go:556] node "huliu-033-tnd6f-test-6ff57" is in nodegroup "MachineSet/openshift-machine-api/huliu-033-tnd6f-test" I0119 05:55:53.919231 1 scale_down.go:444] Node huliu-033-tnd6f-test-6ff57 is not suitable for removal - memory utilization too big (0.087781) I0119 05:56:04.755697 1 static_autoscaler.go:335] 3 unregistered nodes present I0119 05:56:04.755739 1 static_autoscaler.go:611] Removing unregistered node ibmvpc://huliu-033-tnd6f/eu-gb-3/huliu-033-tnd6f-test-6ff57 So it scaled down the machine because it decided that it was unregistered, odd given that it had just before noted the utilization of this same node. Will need to refresh my memory on what an unregistered node is and work out why IBM is not registering its nodes
i think we will need to see a must-gather from this, as well as the logs for the ibm machine controller. for some reason, the autoscaler is considering these new instances as not becoming nodes in the cluster. i have a feeling we will need to see the node, machine, and machineset objects for when this happens, as well as the logs for the machine controller. in reference to the output above, these machines appear to exist huliu-033-tnd6f-test-6ff57 Running bx2d-4x16 eu-gb eu-gb-3 13m huliu-033-tnd6f-test-xkrrf Running bx2d-4x16 eu-gb eu-gb-3 9m34s but have no equivalent in the node listing. these are the first and last references i see in the logs to these nodes: huliu-033-tnd6f-test-6ff57 W0119 05:40:48.912285 1 clusterapi_controller.go:455] Machine "huliu-033-tnd6f-test-6ff57" has no providerID I0119 05:56:04.755739 1 static_autoscaler.go:611] Removing unregistered node ibmvpc://huliu-033-tnd6f/eu-gb-3/huliu-033-tnd6f-test-6ff57 huliu-033-tnd6f-test-xkrrf W0119 05:44:28.274254 1 clusterapi_controller.go:455] Machine "huliu-033-tnd6f-test-xkrrf" has no providerID I0119 06:03:30.986062 1 static_autoscaler.go:611] Removing unregistered node ibmvpc://huliu-033-tnd6f/eu-gb-3/huliu-033-tnd6f-test-xkrrf it looks like these nodes are unregistered for more than 15 minutes, which means they should be reaped by the autoscaler, given the max-node-provisioning-time. I0119 04:51:19.924415 1 flags.go:52] FLAG: --max-node-provision-time="15m0s" so, for some reason these machines never became nodes and the autoscaler properly deleted them as unregistered. i think to get to the bottom of this mystery we'll need the info i mentioned at the top of this comment. i am switching the component to Other Providers as i believe this is not an issue with the autoscaler.
must-gather: http://file.rdu.redhat.com/~zhsun/must-gather.local.6680862744128874926.zip
Having looked at the must gather, I can see the issue. Taking a single instance as an example: The ProviderID on the Machine ibmvpc://zhsunibm-nf2zt/eu-gb-1/zhsunibm-nf2zt-worker-1-5vszq The ProviderID on the Node ibm://fdc2e14cf8bc4d53a67f972dc2e2c861///zhsunibm-nf2zt/0787_d3914005-41f1-4b81-83b7-e0db02df3aa7 These need to match otherwise the autoscaler won't understand how to relate the two objects. Also, these should just match in general, they are referring to the same instance. We need to fix this before we ship 4.10 otherwise this will be very hard to fix down the line
just as a followup here, we have talked with IBM and they have an engineer looking into a patch for the machine api actuator.
Verified clusterversion: 4.10.0-0.nightly-2022-01-22-102609 Tested with above steps, "--scale-down-utilization-threshold" work as expected. If I set utilizationThreshold: "0.001" the nodes will not be removed. And the providerIDs are match. $ oc get machine -o yaml | grep providerID [12:44:57] providerID: ibm://fdc2e14cf8bc4d53a67f972dc2e2c861///zhsunibm24-z2nc2/0787_e78bb302-6a64-4d80-9014-7ae20d6198cf providerID: ibm://fdc2e14cf8bc4d53a67f972dc2e2c861///zhsunibm24-z2nc2/0797_393d5d52-d90f-4cf5-ad35-baa59ed0a345 providerID: ibm://fdc2e14cf8bc4d53a67f972dc2e2c861///zhsunibm24-z2nc2/07a7_aebab996-5132-4d2c-86e5-0dcfc0fa5bfd providerID: ibm://fdc2e14cf8bc4d53a67f972dc2e2c861///zhsunibm24-z2nc2/0787_3edd0f98-459c-4850-810a-adecdfd8ed18 providerID: ibm://fdc2e14cf8bc4d53a67f972dc2e2c861///zhsunibm24-z2nc2/0797_4f9f4b1e-bb36-4d42-8f4e-49bc11447061 providerID: ibm://fdc2e14cf8bc4d53a67f972dc2e2c861///zhsunibm24-z2nc2/07a7_b4eaa9f2-bc7b-4d99-9590-bcd3976dd3a3 providerID: ibm://fdc2e14cf8bc4d53a67f972dc2e2c861///zhsunibm24-z2nc2/07a7_ac90b255-e169-4ad6-ad05-4c0061ca8b63 providerID: ibm://fdc2e14cf8bc4d53a67f972dc2e2c861///zhsunibm24-z2nc2/07a7_92b8b70d-2e97-4b80-b333-ec329db3f4f9 $ oc get node -o yaml | grep providerID [12:45:21] providerID: ibm://fdc2e14cf8bc4d53a67f972dc2e2c861///zhsunibm24-z2nc2/0787_e78bb302-6a64-4d80-9014-7ae20d6198cf providerID: ibm://fdc2e14cf8bc4d53a67f972dc2e2c861///zhsunibm24-z2nc2/0797_393d5d52-d90f-4cf5-ad35-baa59ed0a345 providerID: ibm://fdc2e14cf8bc4d53a67f972dc2e2c861///zhsunibm24-z2nc2/07a7_aebab996-5132-4d2c-86e5-0dcfc0fa5bfd providerID: ibm://fdc2e14cf8bc4d53a67f972dc2e2c861///zhsunibm24-z2nc2/0787_3edd0f98-459c-4850-810a-adecdfd8ed18 providerID: ibm://fdc2e14cf8bc4d53a67f972dc2e2c861///zhsunibm24-z2nc2/0797_4f9f4b1e-bb36-4d42-8f4e-49bc11447061 providerID: ibm://fdc2e14cf8bc4d53a67f972dc2e2c861///zhsunibm24-z2nc2/07a7_b4eaa9f2-bc7b-4d99-9590-bcd3976dd3a3 providerID: ibm://fdc2e14cf8bc4d53a67f972dc2e2c861///zhsunibm24-z2nc2/07a7_ac90b255-e169-4ad6-ad05-4c0061ca8b63 providerID: ibm://fdc2e14cf8bc4d53a67f972dc2e2c861///zhsunibm24-z2nc2/07a7_92b8b70d-2e97-4b80-b333-ec329db3f4f9
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056