Description of problem: ClusterAutoscaler doesn't create a new machine and a new node when needed. Version-Release number of selected component (if applicable): Client Version: 4.3.23-202005230952-4fb2d4d Server Version: 4.6.0-0.ci-2020-07-21-114552 Kubernetes Version: v1.17.0-alpha.0.7867+649a587b0a0f5d-dirty How reproducible: Every time Steps to Reproduce: Cluster setup: 2 deployed workers and 1 only provisioned worker * Full instruction are in the test case attached. 1. Create a new bmh and wait for it be to be in a "Ready" state 2. Create a new ClusterAutoscaler (maxNodesTotal=3) 3. Create a new MachineAutoscaler with the machineset specified (min-replicas=1 and max-replicas=3) 4. Create an httpd deployment the requests 6500Mi of memory for each container(So that each node could only have 1 pod running), and then create 1 pod from this deployment. 5. Scale the httpd deployment to 3 replicas pods 6. Two pods should be running(1 on each wor), but 1 application pod is pending because the cluster does not have enough resources to schedule it Actual results: The pod is still pending... Expected results: After a few minutes i expect a new machine to be created and then a new node to be created and then take the pending pod. Additional info:
Link to must-gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/must-gather-bz1861642.zip
Ok I tested this (4.6.0-0.ci-2020-07-21-114552), and I believe it's a problem with the test-case vs any issue with the auto-scaling itself. I'm not completely clear on the criteria used for making the scaling decision, but it seems that having a pod doing nothing (I tried the httpd example and a busybox container sleeping) while there are pending pods is not sufficient to trigger the scale-up. Instead, I created a new container which runs the "stress" tool to simulate memory pressure, Dockerfile looks like: $ cat Dockerfile FROM docker.io/centos:centos8 RUN dnf install -y epel-release && dnf install -y stress && dnf clean all I built this and pushed it to my local registry. I then applied the autoscaler and machineautoscaler manifests (note yaml files are available at https://gist.github.com/hardys/41a77adb69661d6c97e722905c0db169): $ oc project openshift-machine-api Now using project "openshift-machine-api" on server "https://api.ostest.test.metalkube.org:6443". $ oc get machineset -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE ostest-wb5t4-worker-0 2 2 2 2 15h $ oc apply -f autoscaler.yaml clusterautoscaler.autoscaling.openshift.io/default created $ oc apply -f machine_as.yaml machineautoscaler.autoscaling.openshift.io/scale-automatic created Then I switched to a new project and created a pod running the stress container $ oc new-project auto-scaling Now using project "auto-scaling" on server "https://api.ostest.test.metalkube.org:6443" $ oc apply -f stress.yaml deployment.apps/stress-deployment created $ oc get pods NAME READY STATUS RESTARTS AGE stress-deployment-77c4dd6786-fdk56 1/1 Running 0 10s stress-deployment-77c4dd6786-tr9s5 1/1 Running 0 10s $ oc get machineset -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE ostest-wb5t4-worker-0 2 2 2 2 15h I then scaled up the deployment: $ oc scale deployment --replicas=5 stress-deployment deployment.apps/stress-deployment scaled $ oc get pods NAME READY STATUS RESTARTS AGE stress-deployment-77c4dd6786-fdk56 1/1 Running 0 38s stress-deployment-77c4dd6786-mp7mb 0/1 Pending 0 5s stress-deployment-77c4dd6786-nql92 0/1 Pending 0 5s stress-deployment-77c4dd6786-szh2s 0/1 Pending 0 5s stress-deployment-77c4dd6786-tr9s5 1/1 Running 0 38s We see the machineset scale-up and a new machine in "Provisioning" state: $ oc get machineset -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE ostest-wb5t4-worker-0 3 3 2 2 15h $ oc get machines -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE ostest-wb5t4-master-0 Running 15h ostest-wb5t4-master-1 Running 15h ostest-wb5t4-master-2 Running 15h ostest-wb5t4-worker-0-nfkxn Running 15h ostest-wb5t4-worker-0-rfwvl Provisioning 21s ostest-wb5t4-worker-0-z8gbd Running 15h A short time later (after adding an extra BMH resource), we see the machine is associated with a BMH and marked as provisioned: $ oc get machines -n openshift-machine-api | grep ostest-wb5t4-worker-0-rfwvl ostest-wb5t4-worker-0-rfwvl Provisioned 11m $ oc get bmh -n openshift-machine-api | grep ostest-wb5t4-worker-0-rfwvl ostest-extra-worker-0 OK inspecting ostest-wb5t4-worker-0-rfwvl ipmi://[fd2e:6f44:5dd8:c956::1]:6235 However, it takes some time for the BMH resource to be provisioned and for the node to join the cluster, which appears to result in the Machine getting deleted: $ oc get machines -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE ostest-wb5t4-master-0 Running 16h ostest-wb5t4-master-1 Running 16h ostest-wb5t4-master-2 Running 16h ostest-wb5t4-worker-0-6rvjs Provisioned 18m ostest-wb5t4-worker-0-nfkxn Running 15h ostest-wb5t4-worker-0-rfwvl Deleting 45m ostest-wb5t4-worker-0-srrg2 Deleting 19m ostest-wb5t4-worker-0-z8gbd Running 15h So to make this work correctly we have to ensure whatever triggers that machine deletion waits longer, I'm not clear if the scale-down timeouts are relevant here - there doesn't seem to be any other interface in the docs that could influence this behavior: https://docs.openshift.com/container-platform/4.1/machine_management/applying-autoscaling.html#cluster-autoscaler-cr_applying-autoscaling
Ok so it seems that the cluster autoscaler defaults to waiting only 15mins for a node after a machine is created: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-the-parameters-to-ca The max-node-provision-time argument appears to control this, but AFAICS it's not yet supported by openshift/cluster-autoscaler-operator so we'll have to add it to enable a longer waiting time for baremetal deployments.
i looked at Steven's patch for the cluster-autoscaler-operator today. it looks mostly good and i feel we can probably merge once a few details are worked out. i'd also like to get a few reviews from other team members since we are modifying the CRD.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196