Bug 1883978

Summary: [oVirt] autoscaler detects nodes as unregistered and tries to delete them
Product: OpenShift Container Platform Reporter: Gal Zaidman <gzaidman>
Component: Cloud ComputeAssignee: Gal Zaidman <gzaidman>
Cloud Compute sub component: oVirt Provider QA Contact: Guilherme Santos <gdeolive>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: high CC: aoconnor, apjagtap
Version: 4.6   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:47:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Gal Zaidman 2020-09-30 16:27:45 UTC
Description of problem:

On OCP 4.6 we started supporting autoscaling by adding providerID to our nodes and machines in OCP.
We saw that on certain situations when we define the autoscaler with:
delayAfterAdd: 5s
delayAfterDelete: 5s

We ended up in a situation where working nodes are being removed for new nodes that have just started.

When looking into the autoscaler logs we noticed that we see a lot of unregisterd warning, such as:

static_autoscaler.go:320] 2 unregistered nodes present
static_autoscaler.go:592] Removing unregistered node 7e4772d1-c272-4cea-b5d7-041aa0667d23
static_autoscaler.go:608] Failed to remove node 7e4772d1-c272-4cea-b5d7-041aa0667d23: node group min size reached, skipping unregistered node removal

We then understood that one the cluster auto scaler feels like it can it first evects the olds worker nodes, and when we define low timeouts and there is a pressure on the cluster we sometimes can endup in a situation where most of the nodes are in deleting/provisioning state.

When we looked at how a node is marked as unregistered by the autoscaler[1] we saw that we have a problem with the machine provider ID value.

[1]https://github.com/kubernetes/autoscaler/blob/fde90dee450cb4626d4d683a83e623af1753c075/cluster-autoscaler/clusterstate/clusterstate.go#L968-L985

How reproducible:
100%

Steps to Reproduce:
1. Deploy a cluster with an autoscaler https://docs.openshift.com/container-platform/4.5/machine_management/applying-autoscaling.html
2. Look at the autoscaller logs

Comment 1 Gal Zaidman 2020-10-01 07:02:08 UTC
*** Bug 1883979 has been marked as a duplicate of this bug. ***

Comment 2 Gal Zaidman 2020-10-01 07:04:52 UTC
*** Bug 1881051 has been marked as a duplicate of this bug. ***

Comment 3 Gal Zaidman 2020-10-01 07:06:24 UTC
*** Bug 1880136 has been marked as a duplicate of this bug. ***

Comment 5 Guilherme Santos 2020-10-02 16:59:16 UTC
Verified on:
openshift-4.6.0-0.nightly-2020-10-02-065738

Steps: 
1. Have OCP with 3 masters and 3 workers

2. # cat cluster_autoscaler.yaml
apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  podPriorityThreshold: -10
  resourceLimits:
    maxNodesTotal: 9
    cores:
      min: 24
      max: 40
    memory:
      min: 96
      max: 256
  scaleDown:
    enabled: true
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 30s
    unneededTime: 30s

3. # oc create -f cluster_autoscaler.yaml

4. # cat machine_autoscaler.yaml
apiVersion: "autoscaling.openshift.io/v1beta1"
kind: "MachineAutoscaler"
metadata:
  name: "primary-jnzvt-worker-0"
  namespace: "openshift-machine-api"
spec:
  minReplicas: 3
  maxReplicas: 6
  scaleTargetRef:
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet
    name: primary-jnzvt-worker-0

5. # oc create -f machine_autoscaler.yaml

6. # oc apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: autoscaler-demo
EOF

7. # cat scale-up.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scale-up
  labels:
    app: scale-up
spec:
  replicas: 18
  selector:
    matchLabels:
      app: scale-up
  template:
    metadata:
      labels:
        app: scale-up
    spec:
      containers:
      - name: origin-base
        image: openshift/origin-base
        resources:
          requests:
            memory: 2Gi
        command:
        - /bin/sh
        - "-c"
        - "echo 'this should be in the logs' && sleep 86400"
      terminationGracePeriodSeconds: 0

8. # oc apply -n autoscaler-demo -f scale-up.yaml
9. Waited around 1 hr and checked the results:
# oc get pods -n autoscaler-demo && oc get machine -n openshift-machine-api
NAME                        READY   STATUS    RESTARTS   AGE
scale-up-5fd5c67f64-24wk5   1/1     Running   0          78m
scale-up-5fd5c67f64-2qd9c   1/1     Running   0          79m
scale-up-5fd5c67f64-58tjg   1/1     Running   0          79m
scale-up-5fd5c67f64-7fh9q   1/1     Running   0          78m
scale-up-5fd5c67f64-7q5bc   1/1     Running   0          79m
scale-up-5fd5c67f64-cdhv5   1/1     Running   0          79m
scale-up-5fd5c67f64-cv4rl   1/1     Running   0          79m
scale-up-5fd5c67f64-fkq9t   1/1     Running   0          79m
scale-up-5fd5c67f64-grzl2   1/1     Running   0          79m
scale-up-5fd5c67f64-jb57z   1/1     Running   0          79m
scale-up-5fd5c67f64-jhnp9   1/1     Running   0          79m
scale-up-5fd5c67f64-rtq82   1/1     Running   0          79m
scale-up-5fd5c67f64-v5pnd   1/1     Running   0          79m
scale-up-5fd5c67f64-vfq6r   1/1     Running   0          79m
scale-up-5fd5c67f64-wv9ld   1/1     Running   0          79m
scale-up-5fd5c67f64-xmgq6   1/1     Running   0          78m
scale-up-5fd5c67f64-z4wtj   1/1     Running   0          79m
scale-up-5fd5c67f64-zm4j9   1/1     Running   0          79m
NAME                           PHASE     TYPE   REGION   ZONE   AGE
primary-8hhkw-master-0         Running                          3h51m
primary-8hhkw-master-1         Running                          3h51m
primary-8hhkw-master-2         Running                          3h51m
primary-8hhkw-worker-0-4xvnf   Running                          129m
primary-8hhkw-worker-0-pb5kv   Running                          3h40m
primary-8hhkw-worker-0-pcntp   Running                          129m
primary-8hhkw-worker-0-sxkl2   Running                          78m


10. Changed the scale-up.yaml to scaled to 24 containers waited around 20 min and checked the results
# oc get pods -n autoscaler-demo && oc get machine -n openshift-machine-api
NAME                        READY   STATUS    RESTARTS   AGE
scale-up-5fd5c67f64-24wk5   1/1     Running   0          97m
scale-up-5fd5c67f64-2qd9c   1/1     Running   0          99m
scale-up-5fd5c67f64-58tjg   1/1     Running   0          99m
scale-up-5fd5c67f64-7fh9q   1/1     Running   0          97m
scale-up-5fd5c67f64-7q5bc   1/1     Running   0          99m
scale-up-5fd5c67f64-8tx9c   1/1     Running   0          28m
scale-up-5fd5c67f64-b9zk9   1/1     Running   0          28m
scale-up-5fd5c67f64-cdhv5   1/1     Running   0          99m
scale-up-5fd5c67f64-cv4rl   1/1     Running   0          99m
scale-up-5fd5c67f64-fkq9t   1/1     Running   0          99m
scale-up-5fd5c67f64-fkxhf   1/1     Running   0          28m
scale-up-5fd5c67f64-grzl2   1/1     Running   0          99m
scale-up-5fd5c67f64-jb57z   1/1     Running   0          99m
scale-up-5fd5c67f64-jhnp9   1/1     Running   0          99m
scale-up-5fd5c67f64-qxnfk   1/1     Running   0          27m
scale-up-5fd5c67f64-rtq82   1/1     Running   0          99m
scale-up-5fd5c67f64-v5pnd   1/1     Running   0          99m
scale-up-5fd5c67f64-vfq6r   1/1     Running   0          99m
scale-up-5fd5c67f64-wv9ld   1/1     Running   0          99m
scale-up-5fd5c67f64-xm72p   1/1     Running   0          27m
scale-up-5fd5c67f64-xmgq6   1/1     Running   0          97m
scale-up-5fd5c67f64-xn6cb   1/1     Running   0          28m
scale-up-5fd5c67f64-z4wtj   1/1     Running   0          99m
scale-up-5fd5c67f64-zm4j9   1/1     Running   0          99m
NAME                           PHASE     TYPE   REGION   ZONE   AGE
primary-8hhkw-master-0         Running                          4h10m
primary-8hhkw-master-1         Running                          4h10m
primary-8hhkw-master-2         Running                          4h10m
primary-8hhkw-worker-0-4xvnf   Running                          148m
primary-8hhkw-worker-0-pb5kv   Running                          4h
primary-8hhkw-worker-0-pcntp   Running                          148m
primary-8hhkw-worker-0-sxkl2   Running                          97m
primary-8hhkw-worker-0-zjr6g   Running                          28m

11. Checked the provider ID in the new machines:
# oc describe {node,machine}/{primary-8hhkw-worker-0-sxkl2,primary-8hhkw-worker-0-zjr6g} -n openshift-machine-api | egrep "ProviderID|Provider ID"
ProviderID:                               ovirt://c17bec45-2237-433c-9661-ccdce2aa3dbc
ProviderID:                               ovirt://39bfbf06-d803-4bf6-ab1e-bb3a4324013a
  Provider ID:  ovirt://c17bec45-2237-433c-9661-ccdce2aa3dbc
  Provider ID:  ovirt://39bfbf06-d803-4bf6-ab1e-bb3a4324013a


Results:
Extra machines created successfully, provider ID matching and new machines + new containers stable over time

Comment 8 errata-xmlrpc 2020-10-27 16:47:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196