Failed to verify clusterversion: 4.4.0-0.nightly-2020-06-27-171816 steps: 1. Edit machineset zhsun62944-fvwvq-worker-us-east-2c with "replicas: " # oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE zhsun62944-fvwvq-worker-us-east-2a 1 1 1 1 7h8m zhsun62944-fvwvq-worker-us-east-2b 1 1 1 1 7h8m zhsun62944-fvwvq-worker-us-east-2c 1 1 1 7h8m 2. Create clusterautoscaler 3. Create machineautoscale with machineset zhsun62944-fvwvq-worker-us-east-2c # oc get machineautoscaler NAME REF KIND REF NAME MIN MAX AGE worker-c MachineSet zhsun62944-fvwvq-worker-us-east-2c 1 3 2m34s 4. Check the autoscale logs, E0629 05:58:58.268468 1 pre_filtering_processor.go:62] Error while checking node group size openshift-machine-api/zhsun62944-fvwvq-worker-us-east-2c: group size not found I0629 05:58:58.269214 1 scale_down.go:776] No candidates for scale down I0629 05:59:08.282893 1 static_autoscaler.go:343] No unschedulable pods E0629 05:59:08.283011 1 pre_filtering_processor.go:62] Error while checking node group size openshift-machine-api/zhsun62944-fvwvq-worker-us-east-2c: group size not found 5. Create workload to scale up the cluster, autoscaler couldn't scale up I0629 05:52:06.869902 1 scale_up.go:271] Pod openshift-machine-api/scale-up-788b7f7c75-dg8mh is unschedulable I0629 05:52:06.869911 1 scale_up.go:271] Pod openshift-machine-api/scale-up-788b7f7c75-6h7k4 is unschedulable I0629 05:52:06.871060 1 scale_up.go:423] No expansion options E0629 05:52:06.871367 1 pre_filtering_processor.go:62] Error while checking node group size openshift-machine-api/zhsun62944-fvwvq-worker-us-east-2c: group size not found
I've just taken a look at the codepaths that are being followed to lead to the log line `pre_filtering_processor.go:62] Error while checking node group size openshift-machine-api/zhsun62944-fvwvq-worker-us-east-2c: group size not found` and I think this is an unrelated issue to this bug That log comes from here https://github.com/openshift/kubernetes-autoscaler/blob/2ec541e3e31778428a75e0fc16469ba22d94e4bd/cluster-autoscaler/processors/nodes/pre_filtering_processor.go#L60-L64, checking the nodegroup ID as a key in a map That map come from https://github.com/openshift/kubernetes-autoscaler/blob/2ec541e3e31778428a75e0fc16469ba22d94e4bd/cluster-autoscaler/utils/utils.go#L26-L38, which suggests the only reason the group size would not be present is if there is an error in `TargetSize()`, except that in our implementation, it's impossible for that method to return an error I'll spin up a cluster and see if I can reproduce this
Ok, I've found the issue, the bug isn't fully resolved, but it is hidden in the later releases by the autoscaling from zero work that was done. https://github.com/openshift/kubernetes-autoscaler/blob/2ec541e3e31778428a75e0fc16469ba22d94e4bd/cluster-autoscaler/cloudprovider/openshiftmachineapi/machineapi_controller.go#L365 This line requires that the replicas be set, otherwise it will fail https://github.com/openshift/kubernetes-autoscaler/blob/4abdca547be45a251ba33486324f0cac8664ca25/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go#L510 In the master branch, because this bug was verified on a platform that supports scaling from zero, this was hidden
We've decided we need to revert this as the fix is not complete for 4.4. We will complete the fix in master and start the backport process, but it is unlikely to be worth completing for 4.4 by the time it is backported to 4.5
This is fixed in newer releases and is of low priority and low impact to users. It is not worth back porting this at this point.