1861642 – baremetal: Cluster Autoscaler Operator doesn't expose max-node-provision-time arg of CA

Bug 1861642 - baremetal: Cluster Autoscaler Operator doesn't expose max-node-provision-time arg of CA

Summary: baremetal: Cluster Autoscaler Operator doesn't expose max-node-provision-time...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Steven Hardy
QA Contact:	Daniel
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-29 07:00 UTC by Daniel
Modified:	2020-10-27 16:21 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	A new interface maxNodeProvisionTime was added to the ClusterAutoscaler resource. This can be used to control the time the cluster-autoscaler waits for a new machine to be provisioned before considering provisioning as failed.
Clone Of:
Environment:
Last Closed:	2020-10-27 16:21:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-autoscaler-operator pull 158	0	None	closed	Bug 1861642: Add maxNodeProvisionTime for baremetal	2020-11-13 06:01:52 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:21:39 UTC

Description Daniel 2020-07-29 07:00:11 UTC

Description of problem:
ClusterAutoscaler doesn't create a new machine and a new node when needed. 

Version-Release number of selected component (if applicable):
Client Version: 4.3.23-202005230952-4fb2d4d
Server Version: 4.6.0-0.ci-2020-07-21-114552
Kubernetes Version: v1.17.0-alpha.0.7867+649a587b0a0f5d-dirty


How reproducible:
Every time

Steps to Reproduce:
Cluster setup: 2 deployed workers and 1 only provisioned worker
* Full instruction are in the test case attached.

1. Create a new bmh and wait for it be to be in a "Ready" state
2. Create a new ClusterAutoscaler (maxNodesTotal=3)
3. Create a new MachineAutoscaler with the machineset specified (min-replicas=1 and max-replicas=3)
4. Create an httpd deployment the requests 6500Mi of memory for each container(So that each node could only have 1 pod running), and then create 1 pod from this deployment.
5. Scale the httpd deployment to 3 replicas pods
6. Two pods should be running(1 on each wor), but 1 application pod is pending because the cluster does not have enough resources to schedule it

Actual results:
The pod is still pending...

Expected results:
After a few minutes i expect a new machine to be created and then a new node to be created and then take the pending pod.

Additional info:

Comment 1 Daniel 2020-07-29 07:11:55 UTC

Link to must-gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/must-gather-bz1861642.zip

Comment 2 Steven Hardy 2020-07-30 09:11:53 UTC

Ok I tested this (4.6.0-0.ci-2020-07-21-114552), and I believe it's a problem with the test-case vs any issue with the auto-scaling itself.

I'm not completely clear on the criteria used for making the scaling decision, but it seems that having a pod doing nothing (I tried the httpd example and a busybox container sleeping) while there are pending pods is not sufficient to trigger the scale-up.

Instead, I created a new container which runs the "stress" tool to simulate memory pressure, Dockerfile looks like:

  $ cat Dockerfile 
  FROM docker.io/centos:centos8
  RUN dnf install -y epel-release && dnf install -y stress && dnf clean all

I built this and pushed it to my local registry.

I then applied the autoscaler and machineautoscaler manifests (note yaml files are available at https://gist.github.com/hardys/41a77adb69661d6c97e722905c0db169):

  $ oc project openshift-machine-api
  Now using project "openshift-machine-api" on server "https://api.ostest.test.metalkube.org:6443".

  $ oc get machineset -n openshift-machine-api
  NAME                    DESIRED   CURRENT   READY   AVAILABLE   AGE
  ostest-wb5t4-worker-0   2         2         2       2           15h


  $ oc apply -f autoscaler.yaml 
  clusterautoscaler.autoscaling.openshift.io/default created

  $ oc apply -f machine_as.yaml 
  machineautoscaler.autoscaling.openshift.io/scale-automatic created

Then I switched to a new project and created a pod running the stress container

  $ oc new-project auto-scaling
  Now using project "auto-scaling" on server "https://api.ostest.test.metalkube.org:6443"

  $ oc apply -f stress.yaml
  deployment.apps/stress-deployment created

  $ oc get pods
  NAME                                 READY   STATUS    RESTARTS   AGE
  stress-deployment-77c4dd6786-fdk56   1/1     Running   0          10s
  stress-deployment-77c4dd6786-tr9s5   1/1     Running   0          10s

  $ oc get machineset -n openshift-machine-api
  NAME                    DESIRED   CURRENT   READY   AVAILABLE   AGE
  ostest-wb5t4-worker-0   2         2         2       2           15h

I then scaled up the deployment:

  $ oc scale deployment --replicas=5 stress-deployment
  deployment.apps/stress-deployment scaled
  $ oc get pods
  NAME                                 READY   STATUS    RESTARTS   AGE
  stress-deployment-77c4dd6786-fdk56   1/1     Running   0          38s
  stress-deployment-77c4dd6786-mp7mb   0/1     Pending   0          5s
  stress-deployment-77c4dd6786-nql92   0/1     Pending   0          5s
  stress-deployment-77c4dd6786-szh2s   0/1     Pending   0          5s
  stress-deployment-77c4dd6786-tr9s5   1/1     Running   0          38s

We see the machineset scale-up and a new machine in "Provisioning" state:

  $ oc get machineset -n openshift-machine-api
  NAME                    DESIRED   CURRENT   READY   AVAILABLE   AGE
  ostest-wb5t4-worker-0   3         3         2       2           15h

  $ oc get machines -n openshift-machine-api
  NAME                          PHASE          TYPE   REGION   ZONE   AGE
  ostest-wb5t4-master-0         Running                               15h
  ostest-wb5t4-master-1         Running                               15h
  ostest-wb5t4-master-2         Running                               15h
  ostest-wb5t4-worker-0-nfkxn   Running                               15h
  ostest-wb5t4-worker-0-rfwvl   Provisioning                          21s
  ostest-wb5t4-worker-0-z8gbd   Running                               15h

A short time later (after adding an extra BMH resource), we see the machine is associated with a BMH and marked as provisioned:

  $ oc get machines -n openshift-machine-api | grep ostest-wb5t4-worker-0-rfwvl
  ostest-wb5t4-worker-0-rfwvl   Provisioned                          11m

  $ oc get bmh -n openshift-machine-api | grep ostest-wb5t4-worker-0-rfwvl
  ostest-extra-worker-0   OK       inspecting               ostest-wb5t4-worker-0-rfwvl   ipmi://[fd2e:6f44:5dd8:c956::1]:6235

However, it takes some time for the BMH resource to be provisioned and for the node to join the cluster, which appears to result in the Machine getting deleted:

  $ oc get machines -n openshift-machine-api
  NAME                          PHASE         TYPE   REGION   ZONE   AGE
  ostest-wb5t4-master-0         Running                              16h
  ostest-wb5t4-master-1         Running                              16h
  ostest-wb5t4-master-2         Running                              16h
  ostest-wb5t4-worker-0-6rvjs   Provisioned                          18m
  ostest-wb5t4-worker-0-nfkxn   Running                              15h
  ostest-wb5t4-worker-0-rfwvl   Deleting                             45m
  ostest-wb5t4-worker-0-srrg2   Deleting                             19m
  ostest-wb5t4-worker-0-z8gbd   Running                              15h

So to make this work correctly we have to ensure whatever triggers that machine deletion waits longer, I'm not clear if the scale-down timeouts are relevant here - there doesn't seem to be any other interface in the docs that could influence this behavior:

https://docs.openshift.com/container-platform/4.1/machine_management/applying-autoscaling.html#cluster-autoscaler-cr_applying-autoscaling

Comment 3 Steven Hardy 2020-07-30 09:34:13 UTC

Ok so it seems that the cluster autoscaler defaults to waiting only 15mins for a node after a machine is created:

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-the-parameters-to-ca

The max-node-provision-time argument appears to control this, but AFAICS it's not yet supported by openshift/cluster-autoscaler-operator so we'll have to add it to enable a longer waiting time for baremetal deployments.

Comment 4 Michael McCune 2020-07-31 14:12:17 UTC

i looked at Steven's patch for the cluster-autoscaler-operator today. it looks mostly good and i feel we can probably merge once a few details are worked out. i'd also like to get a few reviews from other team members since we are modifying the CRD.

Comment 9 errata-xmlrpc 2020-10-27 16:21:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.