1883978 – [oVirt] autoscaler detects nodes as unregistered and tries to delete them

Bug 1883978 - [oVirt] autoscaler detects nodes as unregistered and tries to delete them

Summary: [oVirt] autoscaler detects nodes as unregistered and tries to delete them

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Gal Zaidman
QA Contact:	Guilherme Santos
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1880136 1881051 1883979 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-30 16:27 UTC by Gal Zaidman
Modified:	2020-10-27 16:47 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:47:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-ovirt pull 70	None	closed	Bug 1883978: fix machine ProviderID	2020-12-10 11:07:15 UTC
Red Hat Knowledge Base (Solution)	4963581	None	None	None	2020-10-01 07:06:24 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 16:47:22 UTC

Description Gal Zaidman 2020-09-30 16:27:45 UTC

Description of problem:

On OCP 4.6 we started supporting autoscaling by adding providerID to our nodes and machines in OCP.
We saw that on certain situations when we define the autoscaler with:
delayAfterAdd: 5s
delayAfterDelete: 5s

We ended up in a situation where working nodes are being removed for new nodes that have just started.

When looking into the autoscaler logs we noticed that we see a lot of unregisterd warning, such as:

static_autoscaler.go:320] 2 unregistered nodes present
static_autoscaler.go:592] Removing unregistered node 7e4772d1-c272-4cea-b5d7-041aa0667d23
static_autoscaler.go:608] Failed to remove node 7e4772d1-c272-4cea-b5d7-041aa0667d23: node group min size reached, skipping unregistered node removal

We then understood that one the cluster auto scaler feels like it can it first evects the olds worker nodes, and when we define low timeouts and there is a pressure on the cluster we sometimes can endup in a situation where most of the nodes are in deleting/provisioning state.

When we looked at how a node is marked as unregistered by the autoscaler[1] we saw that we have a problem with the machine provider ID value.

[1]https://github.com/kubernetes/autoscaler/blob/fde90dee450cb4626d4d683a83e623af1753c075/cluster-autoscaler/clusterstate/clusterstate.go#L968-L985

How reproducible:
100%

Steps to Reproduce:
1. Deploy a cluster with an autoscaler https://docs.openshift.com/container-platform/4.5/machine_management/applying-autoscaling.html
2. Look at the autoscaller logs

Comment 1 Gal Zaidman 2020-10-01 07:02:08 UTC

*** Bug 1883979 has been marked as a duplicate of this bug. ***

Comment 2 Gal Zaidman 2020-10-01 07:04:52 UTC

*** Bug 1881051 has been marked as a duplicate of this bug. ***

Comment 3 Gal Zaidman 2020-10-01 07:06:24 UTC

*** Bug 1880136 has been marked as a duplicate of this bug. ***

Comment 5 Guilherme Santos 2020-10-02 16:59:16 UTC

Verified on:
openshift-4.6.0-0.nightly-2020-10-02-065738

Steps: 
1. Have OCP with 3 masters and 3 workers

2. # cat cluster_autoscaler.yaml
apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  podPriorityThreshold: -10
  resourceLimits:
    maxNodesTotal: 9
    cores:
      min: 24
      max: 40
    memory:
      min: 96
      max: 256
  scaleDown:
    enabled: true
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 30s
    unneededTime: 30s

3. # oc create -f cluster_autoscaler.yaml

4. # cat machine_autoscaler.yaml
apiVersion: "autoscaling.openshift.io/v1beta1"
kind: "MachineAutoscaler"
metadata:
  name: "primary-jnzvt-worker-0"
  namespace: "openshift-machine-api"
spec:
  minReplicas: 3
  maxReplicas: 6
  scaleTargetRef:
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet
    name: primary-jnzvt-worker-0

5. # oc create -f machine_autoscaler.yaml

6. # oc apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: autoscaler-demo
EOF

7. # cat scale-up.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scale-up
  labels:
    app: scale-up
spec:
  replicas: 18
  selector:
    matchLabels:
      app: scale-up
  template:
    metadata:
      labels:
        app: scale-up
    spec:
      containers:
      - name: origin-base
        image: openshift/origin-base
        resources:
          requests:
            memory: 2Gi
        command:
        - /bin/sh
        - "-c"
        - "echo 'this should be in the logs' && sleep 86400"
      terminationGracePeriodSeconds: 0

8. # oc apply -n autoscaler-demo -f scale-up.yaml
9. Waited around 1 hr and checked the results:
# oc get pods -n autoscaler-demo && oc get machine -n openshift-machine-api
NAME                        READY   STATUS    RESTARTS   AGE
scale-up-5fd5c67f64-24wk5   1/1     Running   0          78m
scale-up-5fd5c67f64-2qd9c   1/1     Running   0          79m
scale-up-5fd5c67f64-58tjg   1/1     Running   0          79m
scale-up-5fd5c67f64-7fh9q   1/1     Running   0          78m
scale-up-5fd5c67f64-7q5bc   1/1     Running   0          79m
scale-up-5fd5c67f64-cdhv5   1/1     Running   0          79m
scale-up-5fd5c67f64-cv4rl   1/1     Running   0          79m
scale-up-5fd5c67f64-fkq9t   1/1     Running   0          79m
scale-up-5fd5c67f64-grzl2   1/1     Running   0          79m
scale-up-5fd5c67f64-jb57z   1/1     Running   0          79m
scale-up-5fd5c67f64-jhnp9   1/1     Running   0          79m
scale-up-5fd5c67f64-rtq82   1/1     Running   0          79m
scale-up-5fd5c67f64-v5pnd   1/1     Running   0          79m
scale-up-5fd5c67f64-vfq6r   1/1     Running   0          79m
scale-up-5fd5c67f64-wv9ld   1/1     Running   0          79m
scale-up-5fd5c67f64-xmgq6   1/1     Running   0          78m
scale-up-5fd5c67f64-z4wtj   1/1     Running   0          79m
scale-up-5fd5c67f64-zm4j9   1/1     Running   0          79m
NAME                           PHASE     TYPE   REGION   ZONE   AGE
primary-8hhkw-master-0         Running                          3h51m
primary-8hhkw-master-1         Running                          3h51m
primary-8hhkw-master-2         Running                          3h51m
primary-8hhkw-worker-0-4xvnf   Running                          129m
primary-8hhkw-worker-0-pb5kv   Running                          3h40m
primary-8hhkw-worker-0-pcntp   Running                          129m
primary-8hhkw-worker-0-sxkl2   Running                          78m


10. Changed the scale-up.yaml to scaled to 24 containers waited around 20 min and checked the results
# oc get pods -n autoscaler-demo && oc get machine -n openshift-machine-api
NAME                        READY   STATUS    RESTARTS   AGE
scale-up-5fd5c67f64-24wk5   1/1     Running   0          97m
scale-up-5fd5c67f64-2qd9c   1/1     Running   0          99m
scale-up-5fd5c67f64-58tjg   1/1     Running   0          99m
scale-up-5fd5c67f64-7fh9q   1/1     Running   0          97m
scale-up-5fd5c67f64-7q5bc   1/1     Running   0          99m
scale-up-5fd5c67f64-8tx9c   1/1     Running   0          28m
scale-up-5fd5c67f64-b9zk9   1/1     Running   0          28m
scale-up-5fd5c67f64-cdhv5   1/1     Running   0          99m
scale-up-5fd5c67f64-cv4rl   1/1     Running   0          99m
scale-up-5fd5c67f64-fkq9t   1/1     Running   0          99m
scale-up-5fd5c67f64-fkxhf   1/1     Running   0          28m
scale-up-5fd5c67f64-grzl2   1/1     Running   0          99m
scale-up-5fd5c67f64-jb57z   1/1     Running   0          99m
scale-up-5fd5c67f64-jhnp9   1/1     Running   0          99m
scale-up-5fd5c67f64-qxnfk   1/1     Running   0          27m
scale-up-5fd5c67f64-rtq82   1/1     Running   0          99m
scale-up-5fd5c67f64-v5pnd   1/1     Running   0          99m
scale-up-5fd5c67f64-vfq6r   1/1     Running   0          99m
scale-up-5fd5c67f64-wv9ld   1/1     Running   0          99m
scale-up-5fd5c67f64-xm72p   1/1     Running   0          27m
scale-up-5fd5c67f64-xmgq6   1/1     Running   0          97m
scale-up-5fd5c67f64-xn6cb   1/1     Running   0          28m
scale-up-5fd5c67f64-z4wtj   1/1     Running   0          99m
scale-up-5fd5c67f64-zm4j9   1/1     Running   0          99m
NAME                           PHASE     TYPE   REGION   ZONE   AGE
primary-8hhkw-master-0         Running                          4h10m
primary-8hhkw-master-1         Running                          4h10m
primary-8hhkw-master-2         Running                          4h10m
primary-8hhkw-worker-0-4xvnf   Running                          148m
primary-8hhkw-worker-0-pb5kv   Running                          4h
primary-8hhkw-worker-0-pcntp   Running                          148m
primary-8hhkw-worker-0-sxkl2   Running                          97m
primary-8hhkw-worker-0-zjr6g   Running                          28m

11. Checked the provider ID in the new machines:
# oc describe {node,machine}/{primary-8hhkw-worker-0-sxkl2,primary-8hhkw-worker-0-zjr6g} -n openshift-machine-api | egrep "ProviderID|Provider ID"
ProviderID:                               ovirt://c17bec45-2237-433c-9661-ccdce2aa3dbc
ProviderID:                               ovirt://39bfbf06-d803-4bf6-ab1e-bb3a4324013a
  Provider ID:  ovirt://c17bec45-2237-433c-9661-ccdce2aa3dbc
  Provider ID:  ovirt://39bfbf06-d803-4bf6-ab1e-bb3a4324013a


Results:
Extra machines created successfully, provider ID matching and new machines + new containers stable over time

Comment 8 errata-xmlrpc 2020-10-27 16:47:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.