Bug 1861642

Summary:	baremetal: Cluster Autoscaler Operator doesn't expose max-node-provision-time arg of CA
Product:	OpenShift Container Platform	Reporter:	Daniel <dmaizel>
Component:	Cloud Compute	Assignee:	Steven Hardy <shardy>
Cloud Compute sub component:	BareMetal Provider	QA Contact:	Daniel <dmaizel>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	beth.white, dmaizel, mimccune, rbartal, stbenjam
Version:	4.6	Keywords:	Triaged, UpcomingSprint
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:	A new interface maxNodeProvisionTime was added to the ClusterAutoscaler resource. This can be used to control the time the cluster-autoscaler waits for a new machine to be provisioned before considering provisioning as failed.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:21:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Daniel 2020-07-29 07:00:11 UTC

Description of problem:
ClusterAutoscaler doesn't create a new machine and a new node when needed. 

Version-Release number of selected component (if applicable):
Client Version: 4.3.23-202005230952-4fb2d4d
Server Version: 4.6.0-0.ci-2020-07-21-114552
Kubernetes Version: v1.17.0-alpha.0.7867+649a587b0a0f5d-dirty


How reproducible:
Every time

Steps to Reproduce:
Cluster setup: 2 deployed workers and 1 only provisioned worker
* Full instruction are in the test case attached.

1. Create a new bmh and wait for it be to be in a "Ready" state
2. Create a new ClusterAutoscaler (maxNodesTotal=3)
3. Create a new MachineAutoscaler with the machineset specified (min-replicas=1 and max-replicas=3)
4. Create an httpd deployment the requests 6500Mi of memory for each container(So that each node could only have 1 pod running), and then create 1 pod from this deployment.
5. Scale the httpd deployment to 3 replicas pods
6. Two pods should be running(1 on each wor), but 1 application pod is pending because the cluster does not have enough resources to schedule it

Actual results:
The pod is still pending...

Expected results:
After a few minutes i expect a new machine to be created and then a new node to be created and then take the pending pod.

Additional info:

Comment 1 Daniel 2020-07-29 07:11:55 UTC

Link to must-gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/must-gather-bz1861642.zip

Comment 2 Steven Hardy 2020-07-30 09:11:53 UTC

Ok I tested this (4.6.0-0.ci-2020-07-21-114552), and I believe it's a problem with the test-case vs any issue with the auto-scaling itself.

I'm not completely clear on the criteria used for making the scaling decision, but it seems that having a pod doing nothing (I tried the httpd example and a busybox container sleeping) while there are pending pods is not sufficient to trigger the scale-up.

Instead, I created a new container which runs the "stress" tool to simulate memory pressure, Dockerfile looks like:

  $ cat Dockerfile 
  FROM docker.io/centos:centos8
  RUN dnf install -y epel-release && dnf install -y stress && dnf clean all

I built this and pushed it to my local registry.

I then applied the autoscaler and machineautoscaler manifests (note yaml files are available at https://gist.github.com/hardys/41a77adb69661d6c97e722905c0db169):

  $ oc project openshift-machine-api
  Now using project "openshift-machine-api" on server "https://api.ostest.test.metalkube.org:6443".

  $ oc get machineset -n openshift-machine-api
  NAME                    DESIRED   CURRENT   READY   AVAILABLE   AGE
  ostest-wb5t4-worker-0   2         2         2       2           15h


  $ oc apply -f autoscaler.yaml 
  clusterautoscaler.autoscaling.openshift.io/default created

  $ oc apply -f machine_as.yaml 
  machineautoscaler.autoscaling.openshift.io/scale-automatic created

Then I switched to a new project and created a pod running the stress container

  $ oc new-project auto-scaling
  Now using project "auto-scaling" on server "https://api.ostest.test.metalkube.org:6443"

  $ oc apply -f stress.yaml
  deployment.apps/stress-deployment created

  $ oc get pods
  NAME                                 READY   STATUS    RESTARTS   AGE
  stress-deployment-77c4dd6786-fdk56   1/1     Running   0          10s
  stress-deployment-77c4dd6786-tr9s5   1/1     Running   0          10s

  $ oc get machineset -n openshift-machine-api
  NAME                    DESIRED   CURRENT   READY   AVAILABLE   AGE
  ostest-wb5t4-worker-0   2         2         2       2           15h

I then scaled up the deployment:

  $ oc scale deployment --replicas=5 stress-deployment
  deployment.apps/stress-deployment scaled
  $ oc get pods
  NAME                                 READY   STATUS    RESTARTS   AGE
  stress-deployment-77c4dd6786-fdk56   1/1     Running   0          38s
  stress-deployment-77c4dd6786-mp7mb   0/1     Pending   0          5s
  stress-deployment-77c4dd6786-nql92   0/1     Pending   0          5s
  stress-deployment-77c4dd6786-szh2s   0/1     Pending   0          5s
  stress-deployment-77c4dd6786-tr9s5   1/1     Running   0          38s

We see the machineset scale-up and a new machine in "Provisioning" state:

  $ oc get machineset -n openshift-machine-api
  NAME                    DESIRED   CURRENT   READY   AVAILABLE   AGE
  ostest-wb5t4-worker-0   3         3         2       2           15h

  $ oc get machines -n openshift-machine-api
  NAME                          PHASE          TYPE   REGION   ZONE   AGE
  ostest-wb5t4-master-0         Running                               15h
  ostest-wb5t4-master-1         Running                               15h
  ostest-wb5t4-master-2         Running                               15h
  ostest-wb5t4-worker-0-nfkxn   Running                               15h
  ostest-wb5t4-worker-0-rfwvl   Provisioning                          21s
  ostest-wb5t4-worker-0-z8gbd   Running                               15h

A short time later (after adding an extra BMH resource), we see the machine is associated with a BMH and marked as provisioned:

  $ oc get machines -n openshift-machine-api | grep ostest-wb5t4-worker-0-rfwvl
  ostest-wb5t4-worker-0-rfwvl   Provisioned                          11m

  $ oc get bmh -n openshift-machine-api | grep ostest-wb5t4-worker-0-rfwvl
  ostest-extra-worker-0   OK       inspecting               ostest-wb5t4-worker-0-rfwvl   ipmi://[fd2e:6f44:5dd8:c956::1]:6235

However, it takes some time for the BMH resource to be provisioned and for the node to join the cluster, which appears to result in the Machine getting deleted:

  $ oc get machines -n openshift-machine-api
  NAME                          PHASE         TYPE   REGION   ZONE   AGE
  ostest-wb5t4-master-0         Running                              16h
  ostest-wb5t4-master-1         Running                              16h
  ostest-wb5t4-master-2         Running                              16h
  ostest-wb5t4-worker-0-6rvjs   Provisioned                          18m
  ostest-wb5t4-worker-0-nfkxn   Running                              15h
  ostest-wb5t4-worker-0-rfwvl   Deleting                             45m
  ostest-wb5t4-worker-0-srrg2   Deleting                             19m
  ostest-wb5t4-worker-0-z8gbd   Running                              15h

So to make this work correctly we have to ensure whatever triggers that machine deletion waits longer, I'm not clear if the scale-down timeouts are relevant here - there doesn't seem to be any other interface in the docs that could influence this behavior:

https://docs.openshift.com/container-platform/4.1/machine_management/applying-autoscaling.html#cluster-autoscaler-cr_applying-autoscaling

Comment 3 Steven Hardy 2020-07-30 09:34:13 UTC

Ok so it seems that the cluster autoscaler defaults to waiting only 15mins for a node after a machine is created:

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-the-parameters-to-ca

The max-node-provision-time argument appears to control this, but AFAICS it's not yet supported by openshift/cluster-autoscaler-operator so we'll have to add it to enable a longer waiting time for baremetal deployments.

Comment 4 Michael McCune 2020-07-31 14:12:17 UTC

i looked at Steven's patch for the cluster-autoscaler-operator today. it looks mostly good and i feel we can probably merge once a few details are worked out. i'd also like to get a few reviews from other team members since we are modifying the CRD.

Comment 9 errata-xmlrpc 2020-10-27 16:21:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196