Bug 1881051
| Summary: | ClusterAutoscaler keeps creating a new Machine even when not needed | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jan Zmeskal <jzmeskal> |
| Component: | Cloud Compute | Assignee: | OCP on RHV Team <ocprhvteam> |
| Cloud Compute sub component: | oVirt Provider | QA Contact: | Lucie Leistnerova <lleistne> |
| Status: | CLOSED DUPLICATE | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | gzaidman, mimccune |
| Version: | 4.6 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-01 07:04:52 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
One further observation: When I set Min Replicas of the MachineAutoscaler to 2, I did not observe the described behaviour. Targeting to 4.7 since we will probably not be able to support autoscalling on 4.6. i'm not seeing the logs for the cluster-autoscaler in the tar attached. i see a file named "cluster-autoscaler-default.log", but that appears to be logs from the cluster-autoscaler-operator. did i miss something? Hi Michael, sorry if I forgot to attach some log file, but afraid it's too late for me now, as I've lost the environment. I was afraid that I made a configuration mistake when originally reported this bug. The reason is that both my masters and my workers have 4 CPU cores and 16GB of memory. Therefore, in the original case, with my four VMs (16 cores), I was almost at the max capacity (20 cores). This shouldn't be a problem in the first place, but I decided I'd reproduce this just to be sure. This time I used these autoscalers:
oc describe ca default | grep -A50 "^Spec:"
Spec:
Pod Priority Threshold: -10
Resource Limits:
Cores:
Max: 36
Min: 16
Max Nodes Total: 9
Memory:
Max: 256
Min: 64
Scale Down:
Delay After Add: 10s
Delay After Delete: 10s
Delay After Failure: 30s
Enabled: true
Unneeded Time: 30s
Events: <none>
Spec:
Max Replicas: 6
Min Replicas: 1
Scale Target Ref:
API Version: machine.openshift.io/v1beta1
Kind: MachineSet
Name: primary-xxtjk-worker-0
Status:
Last Target Ref:
API Version: machine.openshift.io/v1beta1
Kind: MachineSet
Name: primary-xxtjk-worker-0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulUpdate 5m38s (x5 over 6m8s) machine_autoscaler_controller Updated MachineAutoscaler target: openshift-machine-api/primary-xxtjk-worker-0
The result is quite similar as in my original description.
Original state:
oc get machine
NAME PHASE TYPE REGION ZONE AGE
primary-xxtjk-master-0 Running 129m
primary-xxtjk-master-1 Running 129m
primary-xxtjk-master-2 Running 129m
primary-xxtjk-worker-0-s79r9 Running 116m
State after about half an hour:
oc get machine
NAME PHASE TYPE REGION ZONE AGE
primary-xxtjk-master-0 Running 158m
primary-xxtjk-master-1 Running 158m
primary-xxtjk-master-2 Running 158m
primary-xxtjk-worker-0-ldd44 Deleting 28m
primary-xxtjk-worker-0-xvh2k Provisioning 4m59s
primary-xxtjk-worker-0-zfnlm Provisioned 13m
I'm also adding all the containers' logs (except for kube-rbac-proxy containers) from openshift-machine-api namespace.
no worries Jan
i did notice something in the ClusterAutoscaler definition:
> maxNodesTotal: 6
this is the total nodes for the entire cluster, so when the cluster started it was at max size. this doesn't explain the erroneous machine creations, but it gives us another avenue to explore. there might be some sort of bug around the node counting when it is already at max size.
i have also seen the type of oscillating you describe here but it usually happened when the providerID was not properly assigned from cloud provider layer. i don't think that is the issue here, but it's another avenue of exploration. you will see this in the logs for the cluster-autoscaler.
Hi Michael, > when the cluster started it was at max size That's actually not the case. I had 3 masters and 1 worker. But just to be sure, I increased this number to 9 in comment 6 anyway. Jan, i looked at the logs for the reproduction and i think there might be an issue with the ovirt provider properly adding the providerID to the machines it creates. i see a bunch of output in the autoscaler logs that correspond to this: ``` 0922 13:51:58.562239 1 scale_down.go:487] Scale-down calculation: ignoring 1 nodes unremovable in the last 5m0s I0922 13:52:04.567185 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache W0922 13:52:04.579097 1 clusterapi_controller.go:440] Machine "primary-xxtjk-worker-0-ldd44" has no providerID I0922 13:52:04.579154 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 11.872748ms W0922 13:52:08.842300 1 clusterapi_controller.go:440] Machine "primary-xxtjk-worker-0-ldd44" has no providerID I0922 13:52:08.897232 1 static_autoscaler.go:320] 1 unregistered nodes present I0922 13:52:09.390413 1 static_autoscaler.go:402] No unschedulable pods I0922 13:52:09.806995 1 scale_down.go:487] Scale-down calculation: ignoring 1 nodes unremovable in the last 5m0s I0922 13:52:20.194422 1 static_autoscaler.go:320] 2 unregistered nodes present I0922 13:52:20.666270 1 static_autoscaler.go:402] No unschedulable pods ``` that looks like it has deleted the machine primary-xxtjk-worker-0-ldd44, which we see deleted in your output. i think it's worth looking at the machine resource objects that are being created for these new machine. it could be there is an error with the ovirt provider adding this information. another avenue of exploration would be to raise the autoscaler verbosity level and record those logs, it might give more details about the failure state. to do this you would need to scale down the CVO and CAO and then set the environment variable `CLUSTER_AUTOSCALER_VERBOSITY=4` in the deployment for the autoscaler. This bug is caused by what we are describing in 1883978 *** This bug has been marked as a duplicate of bug 1883978 *** |
Description of problem: When you create a MachineAutoscaler with with Min Replicas: 1, new Machine is immediately created even though one Machine already exists in the manager MachineSet. After a while, the new Machine is gone again and a new one is again being created. Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-09-10-145837 How reproducible: 100 % Steps to Reproduce: 1. Have a OCP4.6 cluster deployed with IPI with 3 masters and 3 workers 2. # oc project openshift-machine-api 3. # oc scale --replicas=1 machineset primary-q5fmn-worker-0 4. Create a ClusterAutoscaler similar to this: # oc get ca default -o yaml | grep -A50 "^spec:" spec: podPriorityThreshold: -10 resourceLimits: cores: max: 20 min: 4 maxNodesTotal: 6 memory: max: 128 min: 8 scaleDown: delayAfterAdd: 5s delayAfterDelete: 5s delayAfterFailure: 30s enabled: true unneededTime: 30s 6. Double-check the number of Machines # oc get machine NAME PHASE TYPE REGION ZONE AGE primary-q5fmn-master-0 Running 3h8m primary-q5fmn-master-1 Running 3h8m primary-q5fmn-master-2 Running 3h8m primary-q5fmn-worker-0-d4tkw Running 23m 7. Create a MachineAutoscaler like this # oc get ma primary-q5fmn-worker-0 -o yaml | grep -A50 "^spec:" spec: maxReplicas: 3 minReplicas: 1 scaleTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: primary-q5fmn-worker-0 status: lastTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: primary-q5fmn-worker-0 8. New worker Machine is immediately being created (take note that d4tkw's age is 25 minutes, while pksmr's age is 61 seconds). Pay close attention to these names. Also take note of the timestamp. # date Mon Sep 21 15:32:25 IDT 2020 # oc get machine NAME PHASE TYPE REGION ZONE AGE primary-q5fmn-master-0 Running 3h10m primary-q5fmn-master-1 Running 3h10m primary-q5fmn-master-2 Running 3h10m primary-q5fmn-worker-0-d4tkw Running 25m primary-q5fmn-worker-0-pksmr Provisioning 61s 9. After a while, the original worker (d4tkw) is being deleted. # date Mon Sep 21 15:48:26 IDT 2020 # oc get machine NAME PHASE TYPE REGION ZONE AGE primary-q5fmn-master-0 Running 3h27m primary-q5fmn-master-1 Running 3h27m primary-q5fmn-master-2 Running 3h27m primary-q5fmn-worker-0-d4tkw Deleting 42m primary-q5fmn-worker-0-pksmr Provisioned 18m 10. And yet again, new worker is being created (two of them actually). # date Mon Sep 21 15:51:01 IDT 2020 # oc get machine NAME PHASE TYPE REGION ZONE AGE primary-q5fmn-master-0 Running 3h30m primary-q5fmn-master-1 Running 3h30m primary-q5fmn-master-2 Running 3h30m primary-q5fmn-worker-0-9j7b9 3s primary-q5fmn-worker-0-d4tkw Deleting 45m primary-q5fmn-worker-0-lvmn8 Provisioning 28s primary-q5fmn-worker-0-pksmr Provisioned 20m Actual results: I don't know for how long this keeps going, quite possibly forever. Expected results: Autoscaler should only and new Machines when load increases and delete the old ones when load decreases. Additional info: This might or might not be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1880136. The behaviour seems quite similar. However, reported of the linked bug sees the issue once the load on the cluster increases. As opposed to that, I see this pattern from the very moment I create a MachineAutoscaler.