Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1881051

Summary: ClusterAutoscaler keeps creating a new Machine even when not needed
Product: OpenShift Container Platform Reporter: Jan Zmeskal <jzmeskal>
Component: Cloud ComputeAssignee: OCP on RHV Team <ocprhvteam>
Cloud Compute sub component: oVirt Provider QA Contact: Lucie Leistnerova <lleistne>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: high CC: gzaidman, mimccune
Version: 4.6   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-01 07:04:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jan Zmeskal 2020-09-21 12:56:14 UTC
Description of problem:
When you create a MachineAutoscaler with with Min Replicas: 1, new Machine is immediately created even though one Machine already exists in the manager MachineSet. After a while, the new Machine is gone again and a new one is again being created.

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-09-10-145837

How reproducible:
100 %

Steps to Reproduce:
1. Have a OCP4.6 cluster deployed with IPI with 3 masters and 3 workers
2. # oc project openshift-machine-api
3. # oc scale --replicas=1 machineset primary-q5fmn-worker-0
4. Create a ClusterAutoscaler similar to this:

# oc get ca default -o yaml | grep -A50 "^spec:"
spec:
  podPriorityThreshold: -10
  resourceLimits:
    cores:
      max: 20
      min: 4
    maxNodesTotal: 6
    memory:
      max: 128
      min: 8
  scaleDown:
    delayAfterAdd: 5s
    delayAfterDelete: 5s
    delayAfterFailure: 30s
    enabled: true
    unneededTime: 30s

6. Double-check the number of Machines

# oc get machine
NAME                           PHASE     TYPE   REGION   ZONE   AGE
primary-q5fmn-master-0         Running                          3h8m
primary-q5fmn-master-1         Running                          3h8m
primary-q5fmn-master-2         Running                          3h8m
primary-q5fmn-worker-0-d4tkw   Running                          23m

7. Create a MachineAutoscaler like this

# oc get ma primary-q5fmn-worker-0 -o yaml | grep -A50 "^spec:"
spec:
  maxReplicas: 3
  minReplicas: 1
  scaleTargetRef:
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet
    name: primary-q5fmn-worker-0
status:
  lastTargetRef:
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet
    name: primary-q5fmn-worker-0

8. New worker Machine is immediately being created (take note that d4tkw's age is 25 minutes, while pksmr's age is 61 seconds). Pay close attention to these names. Also take note of the timestamp.

# date
Mon Sep 21 15:32:25 IDT 2020

# oc get machine
NAME                           PHASE          TYPE   REGION   ZONE   AGE
primary-q5fmn-master-0         Running                               3h10m
primary-q5fmn-master-1         Running                               3h10m
primary-q5fmn-master-2         Running                               3h10m
primary-q5fmn-worker-0-d4tkw   Running                               25m
primary-q5fmn-worker-0-pksmr   Provisioning                          61s

9. After a while, the original worker (d4tkw) is being deleted.

# date
Mon Sep 21 15:48:26 IDT 2020

# oc get machine
NAME                           PHASE         TYPE   REGION   ZONE   AGE
primary-q5fmn-master-0         Running                              3h27m
primary-q5fmn-master-1         Running                              3h27m
primary-q5fmn-master-2         Running                              3h27m
primary-q5fmn-worker-0-d4tkw   Deleting                             42m
primary-q5fmn-worker-0-pksmr   Provisioned                          18m

10. And yet again, new worker is being created (two of them actually).

# date
Mon Sep 21 15:51:01 IDT 2020

# oc get machine
NAME                           PHASE          TYPE   REGION   ZONE   AGE
primary-q5fmn-master-0         Running                               3h30m
primary-q5fmn-master-1         Running                               3h30m
primary-q5fmn-master-2         Running                               3h30m
primary-q5fmn-worker-0-9j7b9                                         3s
primary-q5fmn-worker-0-d4tkw   Deleting                              45m
primary-q5fmn-worker-0-lvmn8   Provisioning                          28s
primary-q5fmn-worker-0-pksmr   Provisioned                           20m

Actual results:
I don't know for how long this keeps going, quite possibly forever.

Expected results:
Autoscaler should only and new Machines when load increases and delete the old ones when load decreases.

Additional info:
This might or might not be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1880136. The behaviour seems quite similar. However, reported of the linked bug sees the issue once the load on the cluster increases. As opposed to that, I see this pattern from the very moment I create a MachineAutoscaler.

Comment 2 Jan Zmeskal 2020-09-21 13:43:35 UTC
One further observation: When I set Min Replicas of the MachineAutoscaler to 2, I did not observe the described behaviour.

Comment 3 Gal Zaidman 2020-09-21 14:04:44 UTC
Targeting to 4.7 since we will probably not be able to support autoscalling on 4.6.

Comment 4 Michael McCune 2020-09-21 16:47:55 UTC
i'm not seeing the logs for the cluster-autoscaler in the tar attached. i see a file named "cluster-autoscaler-default.log", but that appears to be logs from the cluster-autoscaler-operator. did i miss something?

Comment 5 Jan Zmeskal 2020-09-22 12:35:05 UTC
Hi Michael, sorry if I forgot to attach some log file, but afraid it's too late for me now, as I've lost the environment.

Comment 6 Jan Zmeskal 2020-09-22 14:14:29 UTC
I was afraid that I made a configuration mistake when originally reported this bug. The reason is that both my masters and my workers have 4 CPU cores and 16GB of memory. Therefore, in the original case, with my four VMs (16 cores), I was almost at the max capacity (20 cores). This shouldn't be a problem in the first place, but I decided I'd reproduce this just to be sure. This time I used these autoscalers:

oc describe ca default | grep -A50 "^Spec:"
Spec:
  Pod Priority Threshold:  -10
  Resource Limits:
    Cores:
      Max:            36
      Min:            16
    Max Nodes Total:  9
    Memory:
      Max:  256
      Min:  64
  Scale Down:
    Delay After Add:      10s
    Delay After Delete:   10s
    Delay After Failure:  30s
    Enabled:              true
    Unneeded Time:        30s
Events:                   <none>

Spec:
  Max Replicas:  6
  Min Replicas:  1
  Scale Target Ref:
    API Version:  machine.openshift.io/v1beta1
    Kind:         MachineSet
    Name:         primary-xxtjk-worker-0
Status:
  Last Target Ref:
    API Version:  machine.openshift.io/v1beta1
    Kind:         MachineSet
    Name:         primary-xxtjk-worker-0
Events:
  Type    Reason            Age                   From                           Message
  ----    ------            ----                  ----                           -------
  Normal  SuccessfulUpdate  5m38s (x5 over 6m8s)  machine_autoscaler_controller  Updated MachineAutoscaler target: openshift-machine-api/primary-xxtjk-worker-0

The result is quite similar as in my original description. 

Original state:
oc get machine
NAME                           PHASE     TYPE   REGION   ZONE   AGE
primary-xxtjk-master-0         Running                          129m
primary-xxtjk-master-1         Running                          129m
primary-xxtjk-master-2         Running                          129m
primary-xxtjk-worker-0-s79r9   Running                          116m

State after about half an hour:
oc get machine
NAME                           PHASE          TYPE   REGION   ZONE   AGE
primary-xxtjk-master-0         Running                               158m
primary-xxtjk-master-1         Running                               158m
primary-xxtjk-master-2         Running                               158m
primary-xxtjk-worker-0-ldd44   Deleting                              28m
primary-xxtjk-worker-0-xvh2k   Provisioning                          4m59s
primary-xxtjk-worker-0-zfnlm   Provisioned                           13m

I'm also adding all the containers' logs (except for kube-rbac-proxy containers) from openshift-machine-api namespace.

Comment 8 Michael McCune 2020-09-22 14:30:13 UTC
no worries Jan

i did notice something in the ClusterAutoscaler definition:

>     maxNodesTotal: 6

this is the total nodes for the entire cluster, so when the cluster started it was at max size. this doesn't explain the erroneous machine creations, but it gives us another avenue to explore. there might be some sort of bug around the node counting when it is already at max size.

i have also seen the type of oscillating you describe here but it usually happened when the providerID was not properly assigned from cloud provider layer. i don't think that is the issue here, but it's another avenue of exploration. you will see this in the logs for the cluster-autoscaler.

Comment 9 Jan Zmeskal 2020-09-22 14:34:06 UTC
Hi Michael, 

> when the cluster started it was at max size

That's actually not the case. I had 3 masters and 1 worker. But just to be sure, I increased this number to 9 in comment 6 anyway.

Comment 10 Michael McCune 2020-09-22 14:42:08 UTC
Jan,

i looked at the logs for the reproduction and i think there might be an issue with the ovirt provider properly adding the providerID to the machines it creates. i see a bunch of output in the autoscaler logs that correspond to this:

```
0922 13:51:58.562239       1 scale_down.go:487] Scale-down calculation: ignoring 1 nodes unremovable in the last 5m0s
I0922 13:52:04.567185       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache                          
W0922 13:52:04.579097       1 clusterapi_controller.go:440] Machine "primary-xxtjk-worker-0-ldd44" has no providerID                                           
I0922 13:52:04.579154       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 11.872748ms
W0922 13:52:08.842300       1 clusterapi_controller.go:440] Machine "primary-xxtjk-worker-0-ldd44" has no providerID  
I0922 13:52:08.897232       1 static_autoscaler.go:320] 1 unregistered nodes present                                                                           
I0922 13:52:09.390413       1 static_autoscaler.go:402] No unschedulable pods                                                                                  
I0922 13:52:09.806995       1 scale_down.go:487] Scale-down calculation: ignoring 1 nodes unremovable in the last 5m0s
I0922 13:52:20.194422       1 static_autoscaler.go:320] 2 unregistered nodes present                                                                           
I0922 13:52:20.666270       1 static_autoscaler.go:402] No unschedulable pods                                                                          
```

that looks like it has deleted the machine primary-xxtjk-worker-0-ldd44, which we see deleted in your output.
i think it's worth looking at the machine resource objects that are being created for these new machine. it could be there is an error with the ovirt provider adding this information.

another avenue of exploration would be to raise the autoscaler verbosity level and record those logs, it might give more details about the failure state. to do this you would need to scale down the CVO and CAO and then set the environment variable `CLUSTER_AUTOSCALER_VERBOSITY=4` in the deployment for the autoscaler.

Comment 13 Gal Zaidman 2020-10-01 07:04:52 UTC
This bug is caused by what we are describing in 1883978

*** This bug has been marked as a duplicate of bug 1883978 ***