Bug 1943194
Summary: | when using gpus, more nodes than needed are created by the node autoscaler | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | raffaele spazzoli <rspazzol> |
Component: | Cloud Compute | Assignee: | Michael McCune <mimccune> |
Cloud Compute sub component: | Cluster Autoscaler | QA Contact: | Milind Yadav <miyadav> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | akamra, aos-bugs, avardeva, kpouget, llopezmo, mimccune, miyadav, sd-ecosystem, selyousf, talessio, zhsun |
Version: | 4.7 | ||
Target Milestone: | --- | ||
Target Release: | 4.13.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: Using the cluster autoscaler with GPU min/max limits enabled with workloads that require GPU access to schedule while not setting the "cluster-api/accelerator" label on Nodes with GPUs.
Consequence: Due to the time required to install the GPU drivers on the OpenShift node, it is possible in some cases for the autoscaler to create extra nodes in the GPU enabled MachineSet.
Fix: The cluster-autoscaler-operator has been modified to warn the user when their MachineSets do not have the appropriate labels for GPU awareness, and when the user has submitted an invalid value for the GPU type in the ClusterAutoscaler resource.
Result: The autoscaler will be able to more accurately detect when GPU enabled nodes have their drivers installed, and thus will not create the extra nodes.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2023-05-17 22:46:32 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
raffaele spazzoli
2021-03-25 14:47:37 UTC
I've been able to reproduce the problem, which is known upstream (cf this note: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/aws#special-note-on-gpu-instances). The upstream solution/workaround is to add the a label to the newly created nodes, so that the auto-scaler knows that it will have GPUs once the driver and everything is loaded. OpenShift loads Kubernetes autoscaler with "--cloud-provider=clusterapi", so the label to use is "cluster-api/accelerator=true" [1] (there is no known GPU for this cloud provider). To get the new nodes automatically labeled, the label must be specified in the MachineSet used by the auto-scaler: > apiVersion: machine.openshift.io/v1beta1 > kind: MachineSet > metadata: > spec: > template: > spec: > metadata: > labels: > cluster-api/accelerator: "true" 1: https://github.com/kubernetes/autoscaler/blob/6432771415846dc0f4ff9ee71dfd307c4e72aa9e/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_provider.go#L40 I change the component to "Machine Config Operator" as there is no component directly related to the autoscaler or the openshift-machine-api > I change the component to "Machine Config Operator" as there is no component directly related to the autoscaler or the openshift-machine-api
There is. It's not directly called machine-api but rather "cloud compute". Autoscaler is a sub-component.
Thanks Yu. Cloud compute team - so is this a matter of updating the OpenShift docs to document the current behavior upstream - https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/aws#special-note-on-gpu-instances? (In reply to Ashish Kamra from comment #4) > Thanks Yu. Cloud compute team - so is this a matter of updating the > OpenShift docs to document the current behavior upstream - > https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/ > cloudprovider/aws#special-note-on-gpu-instances? at the very least that sounds like a reasonable first step. i think we should probably consider handling this through a feature on our cluster-autoscaler-operator, or attempting to create a fix upstream. the reason i suggest adding a feature to the cluster-autoscaler-operator is that we could effectively know when groups with gpus will be utilized and could automate the addition of the node labels in those cases. KCS article written: https://access.redhat.com/solutions/6055181. just wanted to report back after the sig autoscaling meeting. we have a couple options to fix this which i am going to investigate. * option 1, we can create a node label, similar to what the upstream uses for AWS and GKE[0], that would apply to the cluster-api provider implementation that we use in the cluster autoscaler. for openshift this will also require a change to how/when we apply these node labels. * option 2, we can create a more generic approach which the cluster-autoscaler would use internally to mitigate this issue for all cloud providers. option 1 is the most direct fix, and i will start investigating how we could do this in the upstream and openshift. most likely this work will not land for the 4.8 release. the upstream community is open to accepting option 2 as a possible solution, but this will require more research to determine the best methodology for introducing this behavior strictly in the autoscaler. i am hoping to investigate option 2 while implementing option 1. regardless of which option we implement, i would expect these changes to land in openshift once feature freeze for 4.8 has ended. [0] https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/aws#special-note-on-gpu-instances i have created a jira card so that our team can prioritize and plan this work, https://issues.redhat.com/browse/OCPCLOUD-1180 Is there any update regarding this issue? Thank you very much. Michael, I was wondering if this couldn't be fixed with the help of NFD labels [0][1]: * NVIDIA GPU Operator rely on NFD labels to discover GPU nodes (any of these 3) [2]: > var gpuNodeLabels = map[string]string{ > "feature.node.kubernetes.io/pci-10de.present": "true", > "feature.node.kubernetes.io/pci-0302_10de.present": "true", > "feature.node.kubernetes.io/pci-0300_10de.present": "true", > } * newly-create nodes will be marked with NFD labels much after than their `nvidia.com/gpu` resource will appear (only need to spawn a NFD worker pod performing a lspci) but "much after" isn't immediate, as a MachineSet node label can do ... 0: https://github.com/openshift/cluster-nfd-operator 1: https://github.com/kubernetes-sigs/node-feature-discovery 2: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/blob/master/controllers/state_manager.go#L39 @llopezmo our team has discussed this and we have planned to do the work. it will take some time to plan it and create the necessary code/tests/docs. please follow the jira card for the latest information: https://issues.redhat.com/browse/OCPCLOUD-1180 i am hopeful we will schedule this work during the 4.9 release cycle, but i can't give a more accurate estimate than that. @kpouget that is good info, thank you! we might be able to use those labels (at least for nvidia), but i think we need something more generic if it is to be used in the cluster-autoscaler for a wider solution. the real sticky part of this issue is that we will need to add this functionality into the cluster-autoscaler to ensure that when it simulates a scaling event, it takes into account the nodes that are marked with GPUs but have not become available for scheduling GPU workloads yet. mimccune I appreciate your update. Thank you very much. I will follow up the Jira ticket. @rspazzol i am working to reproduce this so that i can instrument the cluster-autoscaler to learn a little more about why it's failing (there is some code in the autoscaler that should be reacting to this situation). would you be able to create a must-gather with autoscaler logs for when this happening on your cluster? @michael I don't have that environment up and running anymore. you can recreate my environment by following these instructions: https://github.com/raffaelespazzoli/kubeflow-ocp thanks raffaele! just wanted to leave a comment. i am making some progress on fixing this, i will continue to update the jira ticket https://issues.redhat.com/browse/OCPCLOUD-1180 with my findings. (cross posting this from the jira ticket) i have been testing out this interaction and i am not able to reproduce the error condition where it creates too many nodes. perhaps the timing is better on my cluster and so the gpu driver compilation is happening quick enough to prevent the autoscaler from creating more nodes. but, with that said, i have confirmed that applying the label `cluster-api/accelerator` in the `machineset.spec.template.spec.metadata` will cause the autoscaler to consider those nodes as unready until the gpu driver has been deployed. in the short term, we need to make an errata (or perhaps a knowledge base article, i'm not sure) that instructs users to add the `cluster-api/accelerator` label to their machinesets that will be used with gpu instances and autoscaling. in full, it should look something like this: {{ apiVersion: machine.openshift.io/v1beta1 kind: MachineSet spec: template: spec: metadata: labels: cluster-api/accelerator: "" }} (this example only shows the affected fields) in the longer term we will need to add logic to our machine controller actuators that will add the label automatically when it detects an instance type that uses gpus. we should contribute this work to the upstream cluster-api community as they need this patch as well. but, this will take us slightly longer to complete, so the errata will help users who are impacted immediately. Michael,
> i have been testing out this interaction and i am not able to reproduce the error condition where it creates too many nodes. perhaps the timing is better on my cluster and so the gpu driver compilation is happening quick enough to prevent the autoscaler from creating more nodes.
maybe an easy way to reproduce the original issue is to simply don't deploy the GPU Operator. This way the autoscaler will create a new machine, but the node will never gain a `nvidia.com/gpu` resource. If the autoscaler creates another machine --> bug; if the autoscaler waits forever for the node to become "GPU-ready" --> bug fixed
(In reply to Kevin Pouget from comment #22) > maybe an easy way to reproduce the original issue is to simply don't deploy > the GPU Operator. This way the autoscaler will create a new machine, but the > node will never gain a `nvidia.com/gpu` resource. If the autoscaler creates > another machine --> bug; if the autoscaler waits forever for the node to > become "GPU-ready" --> bug fixed that's in interesting idea, i will give it a try. there is definitely a bug here, but i think it has more to do with our lack of labeling on these nodes. just wanted to report back after trying out Kevin's suggestion. the cluster did not do what i expected. basically, here is what i did 1. create a cluster 2. start autoscaler 3. add nfd 4. create machineset with gpu instance, add to autoscaling 5. start a deployment with a gpu resource limit i expected to see some activity, but the autoscaler never considered my machineset as an option for scaling. i'm not sure why yet. i confirmed that the machineset did advertise it was gpu available, but the autoscaler did not consider it a viable candidate. i will probably need to dig deeper to understand this, but i think the label option is our best "fix" for the time being. after some investigation and discussions, i think we have a short term solution and a longer term solution. in the short term, i have created this patch[0] for our cluster-autoscaler-operator which will look for MachineSets that have GPU capacity and then label properly for the autoscaler to invoke its GPU custom node processor. i am testing this patch out on a live cluster but i have a good feeling it will alleviate the over-provisioning. in the longer term, i am working with the upstream cluster-api community to ensure that our infrastructure provider controllers are properly labeling MachineSets when GPU capacity is detected. this fix might take several releases to fix completely though as it will depend on community support from the upstream. [0] https://github.com/openshift/cluster-autoscaler-operator/pull/223 @Michael, in my testing, still more nodes than needed are created. Feel we need update this file https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/scale_up.go#L111, add some code to calculate the gpu total. I set gpu min/max to 1/16, the instacetype is "p2.8xlarge" with 8 gpus, so I am looking forward to having two nodes at most, but now it will scale up to 5 nodes. 1. create a cluster 2. create clusterautoscaler apiVersion: "autoscaling.openshift.io/v1" kind: "ClusterAutoscaler" metadata: name: "default" spec: resourceLimits: gpus: - type: nvidia.com/gpu min: 1 max: 16 scaleDown: enabled: true delayAfterAdd: 10s delayAfterDelete: 10s delayAfterFailure: 10s unneededTime: 10s 3. create a machineset with "p2.8xlarge" and add `cluster-api/accelerator` label https://privatebin-it-iso.int.open.paas.redhat.com/?ce2bb51b9fa5130a#26BxMgeYQLoADngVFnfBjzBo4Z7LHb6xVm6gjvX6kkm3 kind: MachineSet metadata: annotations: machine.openshift.io/GPU: "8" machine.openshift.io/memoryMb: "499712" machine.openshift.io/vCPU: "32" ... spec: metadata: labels: cluster-api/accelerator: "" providerSpec: 4. create machineautoscaler $ oc get machineautoscaler NAME REF KIND REF NAME MIN MAX AGE machineautoscaler MachineSet zhsunaws1018-58vff-worker-us-east-2c 1 5 2m48s 5. add workload to scale up workload: https://privatebin-it-iso.int.open.paas.redhat.com/?aac7e94374510001#6p7DFH9A9v9CJGQZLGrpSPhC7D5Wui8QpdeNd2qL7UjR 6. check autoscaler log and machines I1019 07:29:46.619912 1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/zhsunaws1018-58vff-worker-us-east-2c I1019 07:29:46.619935 1 scale_up.go:472] Estimated 4 nodes needed in MachineSet/openshift-machine-api/zhsunaws1018-58vff-worker-us-east-2c I1019 07:29:46.818502 1 scale_up.go:586] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsunaws1018-58vff-worker-us-east-2c 1->5 (max: 5)}] I1019 07:29:46.818540 1 scale_up.go:675] Scale-up: setting group MachineSet/openshift-machine-api/zhsunaws1018-58vff-worker-us-east-2c size to 5 W1019 07:29:58.243046 1 clusterapi_controller.go:455] Machine "zhsunaws1018-58vff-worker-us-east-2c-twxrt" has no providerID W1019 07:29:58.243064 1 clusterapi_controller.go:455] Machine "zhsunaws1018-58vff-worker-us-east-2c-4hgmb" has no providerID W1019 07:29:58.268297 1 clusterapi_controller.go:455] Machine "zhsunaws1018-58vff-worker-us-east-2c-twxrt" has no providerID W1019 07:29:58.268318 1 clusterapi_controller.go:455] Machine "zhsunaws1018-58vff-worker-us-east-2c-4hgmb" has no providerID oc get machine NAME PHASE TYPE REGION ZONE AGE zhsunaws1018-58vff-master-0 Running m5.xlarge us-east-2 us-east-2a 19h zhsunaws1018-58vff-master-1 Running m5.xlarge us-east-2 us-east-2b 19h zhsunaws1018-58vff-master-2 Running m5.xlarge us-east-2 us-east-2c 19h zhsunaws1018-58vff-worker-us-east-2a-cwf8v Running m5.large us-east-2 us-east-2a 19h zhsunaws1018-58vff-worker-us-east-2b-2gs8l Running m5.large us-east-2 us-east-2b 19h zhsunaws1018-58vff-worker-us-east-2c-4hgmb Provisioning 10m zhsunaws1018-58vff-worker-us-east-2c-65cwv Running p2.8xlarge us-east-2 us-east-2c 10m zhsunaws1018-58vff-worker-us-east-2c-hbg2q Running p2.8xlarge us-east-2 us-east-2c 22m zhsunaws1018-58vff-worker-us-east-2c-mwvdx Running p2.8xlarge us-east-2 us-east-2c 10m zhsunaws1018-58vff-worker-us-east-2c-twxrt Provisioning 10m $ oc edit machine zhsunaws1018-58vff-worker-us-east-2c-4hgmb providerStatus: conditions: - lastTransitionTime: "2021-10-19T07:29:57Z" message: "error creating EC2 instance: InsufficientInstanceCapacity: We currently do not have sufficient p2.8xlarge capacity in the Availability Zone you requested (us-east-2c). Our system will be working on provisioning additional capacity. You can currently get p2.8xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2a, us-east-2b.\n\tstatus code: 500, request id: 74aaed8c-f18f-4728-8c4a-02cdba9278b1" reason: MachineCreationFailed @zhsun thanks for the thorough test, i think there are a couple issues though. 1. you shouldn't need to manually add the `cluster-api/accelerated` label to the machineset, the CAO should do this automatically. 2. the workload you specified is asking for 80 replicas with each replica requesting 32Gi of memory, we should change this request to be a limit of `nvidia.com/gpu: 1` to ensure that the scheduler is trying to find gpu-enabled nodes to run on. i have a feeling the extra nodes are because we are asking for a large amount of memory with each replica. with the request changed to a limit, we should be able to use 40 replicas, which /should/ create 5 nodes (8 GPU per node, 5 nodes in node group). this is a sample deployment i have used for testing this fix ``` apiVersion: apps/v1 kind: Deployment metadata: name: gpu-sleep spec: replicas: 1 selector: matchLabels: app: gpu-sleep template: metadata: labels: app: gpu-sleep spec: containers: - name: gpu-sleep image: quay.io/elmiko/busybox resources: limits: nvidia.com/gpu: 1 command: - sleep - "3600" ``` @Michael thank you. I tested again, still more nodes than needed are created. I still set gpu min/max to 1/16, the instacetype is "p2.8xlarge" with 8 gpus, so I am looking forward to having two nodes at most, but it will scale up to 4 nodes. Am I missing something? 1. create a cluster 2. create a machineset with "p2.8xlarge" metadata: annotations: autoscaling.openshift.io/machineautoscaler: openshift-machine-api/machineautoscaler machine.openshift.io/GPU: "8" machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "5" machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "1" machine.openshift.io/memoryMb: "499712" machine.openshift.io/vCPU: "32" creationTimestamp: "2021-10-22T03:50:07Z" generation: 15 labels: cluster-api/accelerator: "" 3. deploy nfd and gpu operator refer to https://docs.nvidia.com/datacenter/cloud-native/openshift/cluster-entitlement.html $ oc get pods,daemonset -n gpu-operator-resources [22:54:11] NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/gpu-feature-discovery 0 0 0 0 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 6h12m daemonset.apps/nvidia-container-toolkit-daemonset 0 0 0 0 0 nvidia.com/gpu.deploy.container-toolkit=true 6h12m daemonset.apps/nvidia-dcgm 0 0 0 0 0 nvidia.com/gpu.deploy.dcgm=true 6h12m daemonset.apps/nvidia-dcgm-exporter 0 0 0 0 0 nvidia.com/gpu.deploy.dcgm-exporter=true 6h12m daemonset.apps/nvidia-device-plugin-daemonset 0 0 0 0 0 nvidia.com/gpu.deploy.device-plugin=true 6h12m daemonset.apps/nvidia-driver-daemonset 0 0 0 0 0 nvidia.com/gpu.deploy.driver=true 6h12m daemonset.apps/nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 6h12m daemonset.apps/nvidia-node-status-exporter 0 0 0 0 0 nvidia.com/gpu.deploy.node-status-exporter=true 6h12m daemonset.apps/nvidia-operator-validator 0 0 0 0 0 nvidia.com/gpu.deploy.operator-validator=true 6h12m 4. create clusterautoscaler apiVersion: "autoscaling.openshift.io/v1" kind: "ClusterAutoscaler" metadata: name: "default" spec: resourceLimits: gpus: - type: nvidia.com/gpu min: 1 max: 16 scaleDown: enabled: true delayAfterAdd: 10s delayAfterDelete: 10s delayAfterFailure: 10s unneededTime: 10s 5. create machineautoscaler $ oc get machineautoscaler NAME REF KIND REF NAME MIN MAX AGE machineautoscaler MachineSet zhsun1022-b96mx-worker-us-east-2c 1 5 136m 6. add workload to scale up apiVersion: apps/v1 kind: Deployment metadata: name: gpu-sleep spec: replicas: 30 selector: matchLabels: app: gpu-sleep template: metadata: labels: app: gpu-sleep spec: containers: - name: gpu-sleep image: quay.io/elmiko/busybox resources: limits: nvidia.com/gpu: 1 command: - sleep - "3600" 6. check autoscaler log and machines I1022 14:20:40.246003 1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/zhsun1022-b96mx-worker-us-east-2c I1022 14:20:40.246023 1 scale_up.go:472] Estimated 3 nodes needed in MachineSet/openshift-machine-api/zhsun1022-b96mx-worker-us-east-2c I1022 14:20:40.443299 1 scale_up.go:586] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsun1022-b96mx-worker-us-east-2c 1->4 (max: 5)}] I1022 14:20:40.443328 1 scale_up.go:675] Scale-up: setting group MachineSet/openshift-machine-api/zhsun1022-b96mx-worker-us-east-2c size to 4 @Michael tested again with instacetype "g4dn.xlarge" with 1 gpus, still more nodes are created. clusterautoscaler set gpu min/max to 1/2, , so I am looking forward to having 2 nodes at most, but it will scale up to 5 nodes. 1. create a cluster 2. create a machineset with "g4dn.xlarge" annotations: autoscaling.openshift.io/machineautoscaler: openshift-machine-api/machineautoscaler machine.openshift.io/GPU: "1" machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "5" machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "1" machine.openshift.io/memoryMb: "16384" machine.openshift.io/vCPU: "4" labels: cluster-api/accelerator: "" 3. deploy nfd and gpu operator refer to https://docs.nvidia.com/datacenter/cloud-native/openshift/cluster-entitlement.html $ oc get pods,daemonset -n gpu-operator-resources [15:33:00] NAME READY STATUS RESTARTS AGE pod/gpu-feature-discovery-9bm5w 1/1 Running 0 25m pod/nvidia-container-toolkit-daemonset-zm9f9 1/1 Running 0 25m pod/nvidia-cuda-validator-jvlb7 0/1 Completed 0 20m pod/nvidia-dcgm-9b6kx 1/1 Running 0 25m pod/nvidia-dcgm-exporter-kbbw7 1/1 Running 0 25m pod/nvidia-device-plugin-daemonset-j2kds 1/1 Running 0 25m pod/nvidia-device-plugin-validator-9vncc 0/1 Completed 0 19m pod/nvidia-driver-daemonset-f65rw 1/1 Running 0 25m pod/nvidia-node-status-exporter-dj4s4 1/1 Running 0 25m pod/nvidia-operator-validator-hkhr9 1/1 Running 0 25m NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/gpu-feature-discovery 1 1 1 1 1 nvidia.com/gpu.deploy.gpu-feature-discovery=true 25m daemonset.apps/nvidia-container-toolkit-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.container-toolkit=true 25m daemonset.apps/nvidia-dcgm 1 1 1 1 1 nvidia.com/gpu.deploy.dcgm=true 25m daemonset.apps/nvidia-dcgm-exporter 1 1 1 1 1 nvidia.com/gpu.deploy.dcgm-exporter=true 25m daemonset.apps/nvidia-device-plugin-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.device-plugin=true 25m daemonset.apps/nvidia-driver-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.driver=true 25m daemonset.apps/nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 25m daemonset.apps/nvidia-node-status-exporter 1 1 1 1 1 nvidia.com/gpu.deploy.node-status-exporter=true 25m daemonset.apps/nvidia-operator-validator 1 1 1 1 1 nvidia.com/gpu.deploy.operator-validator=true 4. create clusterautoscaler apiVersion: "autoscaling.openshift.io/v1" kind: "ClusterAutoscaler" metadata: name: "default" spec: resourceLimits: gpus: - type: nvidia.com/gpu min: 1 max: 2 scaleDown: enabled: true delayAfterAdd: 10s delayAfterDelete: 10s delayAfterFailure: 10s unneededTime: 10s 5. create machineautoscaler $ oc get machineautoscaler [15:34:04] NAME REF KIND REF NAME MIN MAX AGE machineautoscaler MachineSet zhsun1025-snfmv-worker-us-east-2c 1 5 14m 6. add workload to scale up apiVersion: apps/v1 kind: Deployment metadata: name: gpu-sleep spec: replicas: 10 selector: matchLabels: app: gpu-sleep template: metadata: labels: app: gpu-sleep spec: containers: - name: gpu-sleep image: quay.io/elmiko/busybox resources: limits: nvidia.com/gpu: 1 command: - sleep - "3600" 6. check autoscaler log and machines I1025 07:31:11.458630 1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/zhsun1025-snfmv-worker-us-east-2c I1025 07:31:11.458655 1 scale_up.go:472] Estimated 9 nodes needed in MachineSet/openshift-machine-api/zhsun1025-snfmv-worker-us-east-2c I1025 07:31:11.653606 1 scale_up.go:586] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsun1025-snfmv-worker-us-east-2c 1->5 (max: 5)}] I1025 07:31:11.653638 1 scale_up.go:675] Scale-up: setting group MachineSet/openshift-machine-api/zhsun1025-snfmv-worker-us-east-2c size to 5 I1025 07:31:23.268925 1 static_autoscaler.go:335] 4 unregistered nodes present (In reply to sunzhaohua from comment #31) > @Michael thank you. I tested again, still more nodes than needed are > created. I still set gpu min/max to 1/16, the instacetype is "p2.8xlarge" > with 8 gpus, so I am looking forward to having two nodes at most, but it > will scale up to 4 nodes. Am I missing something? > this actually appears to have created the appropriate number of machines. your deployment is asking for 30 replicas, each one asking for a gpu. the instances have 8 gpus each. 4 instances would be 32 gpus, which would fit the 30 replicas you asked for. it also does not create an extra node (5 would be the max for that node group) (In reply to sunzhaohua from comment #32) > @Michael tested again with instacetype "g4dn.xlarge" with 1 gpus, still more > nodes are created. clusterautoscaler set gpu min/max to 1/2, , so I am > looking forward to having 2 nodes at most, but it will scale up to 5 nodes. > it appears that the autoscaler created the appropriate number of instances (within its limits) your deployment asked for 10 replicas, each requesting 1 gpu. each instance only has a single gpu, and the node group maximum is 5. the autoscaler scaled up to 5 instances (its max), and it should have had 5 pods pending since it could not make more nodes. i think the tests you have shown are both accurate in terms of the expected activity. if you want to craft another test though, i would suggest re-running the second test but set the replicas to 3 on the deployment. you should see 3 nodes in the autoscaler node group at the end, so it should create 2 if it starts with 1. with these levels you should be able to see that the autoscaler creates the appropriate number of instances without creating too many. oops Zhaohua, i just noticed the min/max settings on that last run. my apologies. that does look like a bug, i'll have to investigate the min/max issue. @zhsun would it be possible for you to run the last test with the autoscaler using `--v=4` and capture the log file ? must-gather: http://file.rdu.redhat.com/~zhsun/must-gather.local.7665988797093375454.tar.gz The machine keeps creating and then deleting, keeps looping Just had a quick look through what's going on in the logs, I think there's an issue with the cluster autoscaler configuration being used in this test. scaleDown: enabled: true delayAfterAdd: 10s delayAfterDelete: 10s delayAfterFailure: 10s unneededTime: 10s Because we have unneededTime as 10s, I don't think the GPU operator is getting enough time to initialise the Nodes, which means the pods can't schedule, which is causing them to scale away, because they are empty. Can we try again but increase all of the scale down timings a bit, maybe make them all 120s to give some more time for things to settle? Also, looking at the gpu sleep, you are requesting 10 replicas, each replica requests 1 gpu, so you will need to create 10 instances to fulfil this request. Could we perhaps change this to 3 replicas instead? This would be in the middle of the scale up range, we expect it to scale to 3 replicas, but it could scale to 5 if there was a bug, so if it only creates 3 machines, we know it's not over creating. just to clarify this point: > Because we have unneededTime as 10s, I don't think the GPU operator is getting enough time to initialise the Nodes in our nightly CI, it takes ~8min to wait for the full deployment of the GPU computing stack: > Playbook run took 0 days, 0 hours, 8 minutes, 34 seconds among which 7min are for the wait of the driver deployment > Thursday 25 November 2021 23:53:52 +0000 (0:00:00.022) 0:00:06.943 ***** > TASK: gpu_operator_wait_deployment : Wait for the GPU Operator to validate the driver deployment so 8 minutes seems to be a good estimation of the time it takes between the (operator is deployed|the node is ready) and GPU Pod start to be executed https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-psap-ci-artifacts-release-4.9-gpu-operator-e2e-18x/1464006155771580416/artifacts/gpu-operator-e2e-18x/nightly/artifacts/010__gpu_operator__wait_deployment/_ansible.log sorry, i haven't had a chance to come back to this issue. i concur with what Joel is saying though, i have a feeling the failure we are seeing might be related to our specific configuration. if we run this test with values that a customer might use, or something closer to those values, i think we will see the tests pass. sorry for the confusion, I will try again with unneededTime: 10m and post result here. (In reply to Joel Speed from comment #40) > Also, looking at the gpu sleep, you are requesting 10 replicas, each replica > requests 1 gpu, so you will need to create 10 instances to fulfil this > request. > Could we perhaps change this to 3 replicas instead? This would be in the > middle of the scale up range, we expect it to scale to 3 replicas, but it > could scale to 5 if there was a bug, so if it only creates 3 machines, we > know it's not over creating. @Joel the clusterautoscaler gpus min/max settings is 1/2, so I think it should create 2 machines at most, if it creates 3 machines, it's still over creating. apiVersion: "autoscaling.openshift.io/v1" kind: "ClusterAutoscaler" metadata: name: "default" spec: resourceLimits: gpus: - type: nvidia.com/gpu min: 1 max: 2 Right, ok, I misunderstood how the test was being controller there, this makes more sense now. Not sure why that's not working as expected, will do some investigation @zhsun did you run the test again with the 10 minute unneededTime? i'm just curious where we left off on this bug. @Michael I tested it with below clusterautoscaler, all scale down timings to 10m, the pods can schedule, machines are running and will not scale away, this work as expected. The only issue is clusterautoscaler gpus min/max settings is 1/2, so it should have 2 machines at most, but it will create more machines than needed. For example, in my testing, it will create another 4 new machines. apiVersion: "autoscaling.openshift.io/v1" kind: "ClusterAutoscaler" metadata: name: "default" spec: resourceLimits: gpus: - type: nvidia.com/gpu min: 1 max: 2 scaleDown: enabled: true delayAfterAdd: 10m delayAfterDelete: 10m delayAfterFailure: 10m unneededTime: 10m $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsunaws5-fzz8v-master-0 Running m6i.xlarge us-east-2 us-east-2a 12h zhsunaws5-fzz8v-master-1 Running m6i.xlarge us-east-2 us-east-2b 12h zhsunaws5-fzz8v-master-2 Running m6i.xlarge us-east-2 us-east-2c 12h zhsunaws5-fzz8v-worker-us-east-2a-rgkg9 Running m6i.large us-east-2 us-east-2a 12h zhsunaws5-fzz8v-worker-us-east-2b-cl464 Running m6i.large us-east-2 us-east-2b 12h zhsunaws5-fzz8v-worker-us-east-2c-486fz Running g4dn.xlarge us-east-2 us-east-2c 19m zhsunaws5-fzz8v-worker-us-east-2c-9lj8f Running g4dn.xlarge us-east-2 us-east-2c 19m zhsunaws5-fzz8v-worker-us-east-2c-gjhzz Running g4dn.xlarge us-east-2 us-east-2c 19m zhsunaws5-fzz8v-worker-us-east-2c-jbqgf Running g4dn.xlarge us-east-2 us-east-2c 57m zhsunaws5-fzz8v-worker-us-east-2c-sph7h Running g4dn.xlarge us-east-2 us-east-2c 19m thanks for the update Zhaohua! i talked with Joel about this a little bit, and i have a suspicion that there might be a disconnect between how the autoscaler waits for the gpu nodes to become active (eg. driver installed) and how it registers the nodes resources with respect to the limits. we'll have to do some more debugging. this bug is being extremely difficult to solve. my current thinking is that there is an interaction happening in the core autoscaler between the way it calculates the maximum resource within the cluster, and the way it plans to add new gpu nodes. due to the way the autoscaler must wait for the gpu nodes to have a driver ready before the pending pods will schedule to the node, it may not be aware that gpu nodes are being added when it calculates the maximums. i am continuing to investigate along these lines. no new progress here, i am still investigating. Thank you for your update and your efforts on this, Michael Still working on this, Mike got some inspiration last week at the sig-autoscaling meet up at Kubecon, looking to get back to this soon @zhsun i've created a new PR for the CAO that i'm hoping will alleviate this issue in machinesets where there is at least 1 replica, i'm not 100% positive that it will solve the issue when scaling from/to zero. i've changed the CAO to add the cluster-autoscaler accelerator label to the `machineset.spec.template.spec.metadata.labels`, which should cause all the machines and nodes made from that machineset to also have the accelerator label. my hope is that this will cause the autoscaler to see new nodes will have a GPU before the capacity has been set on the node object. this pattern is inspired by actions that our users have taken to alleviate the root cause. i'd be happy to know if this fixes the bug with respect to scaling from 1, and then in a separate test scaling from 0. Tried again, clusterpolicy still wasn't ready, clusterversion 4.12.0-0.ci.test-2022-11-08-004217-ci-ln-slqh742-latest Checked "cluster-api/accelerator" label can be added to nodes automatically. Didn't install NFD and GPU operator, created clusterautoscaler, machineautoscaler and add workload to scale, more nodes than expected are created. Not sure if this test is ok for the bug, or must install nfd and gpu. $ oc get node --show-labels | grep accelerator ip-10-0-210-38.us-east-2.compute.internal Ready worker 89m v1.25.2+93b33ea beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g4dn.xlarge,beta.kubernetes.io/os=linux,cluster-api/accelerator=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-210-38.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=g4dn.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c apiVersion: "autoscaling.openshift.io/v1" kind: "ClusterAutoscaler" metadata: name: "default" spec: resourceLimits: gpus: - type: nvidia.com/gpu min: 1 max: 2 scaleDown: enabled: true delayAfterAdd: 10m delayAfterDelete: 10m delayAfterFailure: 10m unneededTime: 10m $ oc get machineautoscaler NAME REF KIND REF NAME MIN MAX AGE machineautoscaler-3 MachineSet zhsun11811-b7tsc-worker-us-east-2c 1 5 102m $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsun11811-b7tsc-master-0 Running m6i.xlarge us-east-2 us-east-2a 60m zhsun11811-b7tsc-master-1 Running m6i.xlarge us-east-2 us-east-2b 60m zhsun11811-b7tsc-master-2 Running m6i.xlarge us-east-2 us-east-2c 60m zhsun11811-b7tsc-worker-us-east-2a-tjvx9 Running m6i.xlarge us-east-2 us-east-2a 57m zhsun11811-b7tsc-worker-us-east-2b-vhzf9 Running m6i.xlarge us-east-2 us-east-2b 57m zhsun11811-b7tsc-worker-us-east-2c-dxz6l Running g4dn.xlarge us-east-2 us-east-2c 16m zhsun11811-b7tsc-worker-us-east-2c-fv6z9 Running g4dn.xlarge us-east-2 us-east-2c 16m zhsun11811-b7tsc-worker-us-east-2c-p5jdv Running g4dn.xlarge us-east-2 us-east-2c 16m zhsun11811-b7tsc-worker-us-east-2c-ppddb Running g4dn.xlarge us-east-2 us-east-2c 16m zhsun11811-b7tsc-worker-us-east-2c-xjmgx Running g4dn.xlarge us-east-2 us-east-2c 20m thanks Zhaohua, i'm glad to see that the label is existing. i have a feeling that the nfd/gpu operator will be needed at some point as the autoscaler is waiting to see the capacity appear on the node for the gpu device. i have another idea about adding a fix to the autoscaler, i will investigate that as well. @zhsun i know we are having some issues with the NFD operator currently, so i created a patch for the 4.11 release[0] and i wonder if we could test to see if this solution works on a 4.11 cluster? my theory here is that this change should work on any version of openshift and since it does not touch the autoscaler it should be safe to try this change on a previous cluster version (4.11). i mainly want to see if this solution is improving the situation for users, also, when we test we should be sure to use ClusterAutoscaler values that are similar to what users will use in production, this is to help ensure aren't using values that are too short when waiting for machines to become nodes. [0] https://github.com/openshift/cluster-autoscaler-operator/pull/257 Michael, I set up a 4.11 cluster with https://github.com/openshift/cluster-autoscaler-operator/pull/257, still more nodes than expected are created. clusterversion: 4.11.0-0.ci.test-2022-11-29-012944-ci-ln-l4jhj3b-latest This is my steps: 1.Create a machineset with "g4dn.xlarge" metadata: annotations: autoscaling.openshift.io/machineautoscaler: openshift-machine-api/machineautoscaler-3 machine.openshift.io/GPU: "1" machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "4" machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "1" machine.openshift.io/memoryMb: "16384" machine.openshift.io/vCPU: "4" 2. Create machineautoscaler $ oc get machineautoscaler NAME REF KIND REF NAME MIN MAX AGE machineautoscaler-3 MachineSet zhsun1943194-2sh27-worker-us-east-2c 1 4 18m 3.Deploy NFD and GPU $ oc get pods,daemonset -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE pod/gpu-feature-discovery-258v5 1/1 Running 0 22m pod/gpu-operator-5dc8fc87c7-9gxpm 1/1 Running 0 23m pod/nvidia-container-toolkit-daemonset-dttnk 1/1 Running 0 22m pod/nvidia-cuda-validator-skzgk 0/1 Completed 0 17m pod/nvidia-dcgm-exporter-s9rlk 1/1 Running 0 22m pod/nvidia-dcgm-rdq8g 1/1 Running 0 22m pod/nvidia-device-plugin-daemonset-grwzf 1/1 Running 0 22m pod/nvidia-device-plugin-validator-dd4n5 0/1 Completed 0 16m pod/nvidia-driver-daemonset-411.86.202211232221-0-6st54 2/2 Running 0 22m pod/nvidia-node-status-exporter-5gmd2 1/1 Running 0 22m pod/nvidia-operator-validator-zzm6d 1/1 Running 0 21m NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/gpu-feature-discovery 1 1 1 1 1 nvidia.com/gpu.deploy.gpu-feature-discovery=true 22m daemonset.apps/nvidia-container-toolkit-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.container-toolkit=true 22m daemonset.apps/nvidia-dcgm 1 1 1 1 1 nvidia.com/gpu.deploy.dcgm=true 22m daemonset.apps/nvidia-dcgm-exporter 1 1 1 1 1 nvidia.com/gpu.deploy.dcgm-exporter=true 22m daemonset.apps/nvidia-device-plugin-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.device-plugin=true 22m daemonset.apps/nvidia-driver-daemonset-411.86.202211232221-0 1 1 1 1 1 feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202211232221-0,nvidia.com/gpu.deploy.driver=true 22m daemonset.apps/nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 22m daemonset.apps/nvidia-node-status-exporter 1 1 1 1 1 nvidia.com/gpu.deploy.node-status-exporter=true 22m daemonset.apps/nvidia-operator-validator 1 1 1 1 1 nvidia.com/gpu.deploy.operator-validator=true 4. Create clusterautoscaler apiVersion: "autoscaling.openshift.io/v1" kind: "ClusterAutoscaler" metadata: name: "default" spec: resourceLimits: gpus: - type: nvidia.com/gpu min: 1 max: 2 scaleDown: enabled: true delayAfterAdd: 15m delayAfterDelete: 15m delayAfterFailure: 15m unneededTime: 15m $ oc get deploy cluster-autoscaler-default -o yaml spec: containers: - args: - --logtostderr - --v=1 - --cloud-provider=clusterapi - --namespace=openshift-machine-api - --leader-elect-lease-duration=137s - --leader-elect-renew-deadline=107s - --leader-elect-retry-period=26s - --gpu-total=nvidia.com/gpu:1:2 - --scale-down-enabled=true - --scale-down-delay-after-add=15m - --scale-down-delay-after-delete=15m - --scale-down-delay-after-failure=15m - --scale-down-unneeded-time=15m 5. Add workload to scale up apiVersion: apps/v1 kind: Deployment metadata: name: gpu-sleep spec: replicas: 10 selector: matchLabels: app: gpu-sleep template: metadata: labels: app: gpu-sleep spec: containers: - name: gpu-sleep image: quay.io/elmiko/busybox resources: limits: nvidia.com/gpu: 1 command: - sleep - "3600" 6. check autoscaler log and machines and pods I1129 04:18:03.639817 1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/zhsun1943194-2sh27-worker-us-east-2c I1129 04:18:03.639840 1 scale_up.go:472] Estimated 9 nodes needed in MachineSet/openshift-machine-api/zhsun1943194-2sh27-worker-us-east-2c I1129 04:18:03.834810 1 scale_up.go:595] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsun1943194-2sh27-worker-us-east-2c 1->4 (max: 4)}] I1129 04:18:03.834841 1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/zhsun1943194-2sh27-worker-us-east-2c size to 4 $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsun1943194-2sh27-master-0 Running m6i.xlarge us-east-2 us-east-2a 151m zhsun1943194-2sh27-master-1 Running m6i.xlarge us-east-2 us-east-2b 151m zhsun1943194-2sh27-master-2 Running m6i.xlarge us-east-2 us-east-2c 151m zhsun1943194-2sh27-worker-us-east-2a-xm478 Running m6i.xlarge us-east-2 us-east-2a 149m zhsun1943194-2sh27-worker-us-east-2b-phxqr Running m6i.xlarge us-east-2 us-east-2b 149m zhsun1943194-2sh27-worker-us-east-2c-4vk9h Running g4dn.xlarge us-east-2 us-east-2c 60m zhsun1943194-2sh27-worker-us-east-2c-mdtzh Running g4dn.xlarge us-east-2 us-east-2c 23m zhsun1943194-2sh27-worker-us-east-2c-rrcpk Running g4dn.xlarge us-east-2 us-east-2c 23m zhsun1943194-2sh27-worker-us-east-2c-wcmtm Running g4dn.xlarge us-east-2 us-east-2c 23m $ oc get node --show-labels | grep gpu ip-10-0-198-222.us-east-2.compute.internal Ready worker 19m v1.24.6+5658434 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g4dn.xlarge,beta.kubernetes.io/os=linux,cluster-api/accelerator=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,feature.node.kubernetes.io/cpu-cpuid.ADX=true,feature.node.kubernetes.io/cpu-cpuid.AESNI=true,feature.node.kubernetes.io/cpu-cpuid.AVX2=true,feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true,feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true,feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true,feature.node.kubernetes.io/cpu-cpuid.AVX512F=true,feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true,feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true,feature.node.kubernetes.io/cpu-cpuid.AVX=true,feature.node.kubernetes.io/cpu-cpuid.FMA3=true,feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true,feature.node.kubernetes.io/cpu-cpuid.MPX=true,feature.node.kubernetes.io/cpu-hardware_multithreading=true,feature.node.kubernetes.io/kernel-config.NO_HZ=true,feature.node.kubernetes.io/kernel-config.NO_HZ_FULL=true,feature.node.kubernetes.io/kernel-selinux.enabled=true,feature.node.kubernetes.io/kernel-version.full=4.18.0-372.32.1.el8_6.x86_64,feature.node.kubernetes.io/kernel-version.major=4,feature.node.kubernetes.io/kernel-version.minor=18,feature.node.kubernetes.io/kernel-version.revision=0,feature.node.kubernetes.io/pci-10de.present=true,feature.node.kubernetes.io/pci-1d0f.present=true,feature.node.kubernetes.io/storage-nonrotationaldisk=true,feature.node.kubernetes.io/system-os_release.ID=rhcos,feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION=4.11,feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202211232221-0,feature.node.kubernetes.io/system-os_release.RHEL_VERSION=8.6,feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4,feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=11,feature.node.kubernetes.io/system-os_release.VERSION_ID=4.11,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-198-222.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=g4dn.xlarge,node.openshift.io/os_id=rhcos,nvidia.com/cuda.driver.major=515,nvidia.com/cuda.driver.minor=65,nvidia.com/cuda.driver.rev=01,nvidia.com/cuda.runtime.major=11,nvidia.com/cuda.runtime.minor=7,nvidia.com/gfd.timestamp=1669696050,nvidia.com/gpu.compute.major=7,nvidia.com/gpu.compute.minor=5,nvidia.com/gpu.count=1,nvidia.com/gpu.deploy.container-toolkit=true,nvidia.com/gpu.deploy.dcgm-exporter=true,nvidia.com/gpu.deploy.dcgm=true,nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/gpu.deploy.driver=true,nvidia.com/gpu.deploy.gpu-feature-discovery=true,nvidia.com/gpu.deploy.node-status-exporter=true,nvidia.com/gpu.deploy.nvsm=,nvidia.com/gpu.deploy.operator-validator=true,nvidia.com/gpu.family=turing,nvidia.com/gpu.machine=g4dn.xlarge,nvidia.com/gpu.memory=15360,nvidia.com/gpu.present=true,nvidia.com/gpu.product=Tesla-T4,nvidia.com/gpu.replicas=1,nvidia.com/mig.capable=false,nvidia.com/mig.strategy=single,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c ip-10-0-206-111.us-east-2.compute.internal Ready worker 56m v1.24.6+5658434 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g4dn.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,feature.node.kubernetes.io/cpu-cpuid.ADX=true,feature.node.kubernetes.io/cpu-cpuid.AESNI=true,feature.node.kubernetes.io/cpu-cpuid.AVX2=true,feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true,feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true,feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true,feature.node.kubernetes.io/cpu-cpuid.AVX512F=true,feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true,feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true,feature.node.kubernetes.io/cpu-cpuid.AVX=true,feature.node.kubernetes.io/cpu-cpuid.FMA3=true,feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true,feature.node.kubernetes.io/cpu-cpuid.MPX=true,feature.node.kubernetes.io/cpu-hardware_multithreading=true,feature.node.kubernetes.io/kernel-config.NO_HZ=true,feature.node.kubernetes.io/kernel-config.NO_HZ_FULL=true,feature.node.kubernetes.io/kernel-selinux.enabled=true,feature.node.kubernetes.io/kernel-version.full=4.18.0-372.32.1.el8_6.x86_64,feature.node.kubernetes.io/kernel-version.major=4,feature.node.kubernetes.io/kernel-version.minor=18,feature.node.kubernetes.io/kernel-version.revision=0,feature.node.kubernetes.io/pci-10de.present=true,feature.node.kubernetes.io/pci-1d0f.present=true,feature.node.kubernetes.io/storage-nonrotationaldisk=true,feature.node.kubernetes.io/system-os_release.ID=rhcos,feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION=4.11,feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202211232221-0,feature.node.kubernetes.io/system-os_release.RHEL_VERSION=8.6,feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4,feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=11,feature.node.kubernetes.io/system-os_release.VERSION_ID=4.11,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-206-111.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=g4dn.xlarge,node.openshift.io/os_id=rhcos,nvidia.com/cuda.driver.major=515,nvidia.com/cuda.driver.minor=65,nvidia.com/cuda.driver.rev=01,nvidia.com/cuda.runtime.major=11,nvidia.com/cuda.runtime.minor=7,nvidia.com/gfd.timestamp=1669694211,nvidia.com/gpu.compute.major=7,nvidia.com/gpu.compute.minor=5,nvidia.com/gpu.count=1,nvidia.com/gpu.deploy.container-toolkit=true,nvidia.com/gpu.deploy.dcgm-exporter=true,nvidia.com/gpu.deploy.dcgm=true,nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/gpu.deploy.driver=true,nvidia.com/gpu.deploy.gpu-feature-discovery=true,nvidia.com/gpu.deploy.node-status-exporter=true,nvidia.com/gpu.deploy.nvsm=,nvidia.com/gpu.deploy.operator-validator=true,nvidia.com/gpu.family=turing,nvidia.com/gpu.machine=g4dn.xlarge,nvidia.com/gpu.memory=15360,nvidia.com/gpu.present=true,nvidia.com/gpu.product=Tesla-T4,nvidia.com/gpu.replicas=1,nvidia.com/mig.capable=false,nvidia.com/mig.strategy=single,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c ip-10-0-217-193.us-east-2.compute.internal Ready worker 19m v1.24.6+5658434 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g4dn.xlarge,beta.kubernetes.io/os=linux,cluster-api/accelerator=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,feature.node.kubernetes.io/cpu-cpuid.ADX=true,feature.node.kubernetes.io/cpu-cpuid.AESNI=true,feature.node.kubernetes.io/cpu-cpuid.AVX2=true,feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true,feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true,feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true,feature.node.kubernetes.io/cpu-cpuid.AVX512F=true,feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true,feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true,feature.node.kubernetes.io/cpu-cpuid.AVX=true,feature.node.kubernetes.io/cpu-cpuid.FMA3=true,feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true,feature.node.kubernetes.io/cpu-cpuid.MPX=true,feature.node.kubernetes.io/cpu-hardware_multithreading=true,feature.node.kubernetes.io/kernel-config.NO_HZ=true,feature.node.kubernetes.io/kernel-config.NO_HZ_FULL=true,feature.node.kubernetes.io/kernel-selinux.enabled=true,feature.node.kubernetes.io/kernel-version.full=4.18.0-372.32.1.el8_6.x86_64,feature.node.kubernetes.io/kernel-version.major=4,feature.node.kubernetes.io/kernel-version.minor=18,feature.node.kubernetes.io/kernel-version.revision=0,feature.node.kubernetes.io/pci-10de.present=true,feature.node.kubernetes.io/pci-1d0f.present=true,feature.node.kubernetes.io/storage-nonrotationaldisk=true,feature.node.kubernetes.io/system-os_release.ID=rhcos,feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION=4.11,feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202211232221-0,feature.node.kubernetes.io/system-os_release.RHEL_VERSION=8.6,feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4,feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=11,feature.node.kubernetes.io/system-os_release.VERSION_ID=4.11,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-217-193.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=g4dn.xlarge,node.openshift.io/os_id=rhcos,nvidia.com/cuda.driver.major=515,nvidia.com/cuda.driver.minor=65,nvidia.com/cuda.driver.rev=01,nvidia.com/cuda.runtime.major=11,nvidia.com/cuda.runtime.minor=7,nvidia.com/gfd.timestamp=1669696062,nvidia.com/gpu.compute.major=7,nvidia.com/gpu.compute.minor=5,nvidia.com/gpu.count=1,nvidia.com/gpu.deploy.container-toolkit=true,nvidia.com/gpu.deploy.dcgm-exporter=true,nvidia.com/gpu.deploy.dcgm=true,nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/gpu.deploy.driver=true,nvidia.com/gpu.deploy.gpu-feature-discovery=true,nvidia.com/gpu.deploy.node-status-exporter=true,nvidia.com/gpu.deploy.nvsm=,nvidia.com/gpu.deploy.operator-validator=true,nvidia.com/gpu.family=turing,nvidia.com/gpu.machine=g4dn.xlarge,nvidia.com/gpu.memory=15360,nvidia.com/gpu.present=true,nvidia.com/gpu.product=Tesla-T4,nvidia.com/gpu.replicas=1,nvidia.com/mig.capable=false,nvidia.com/mig.strategy=single,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c ip-10-0-218-151.us-east-2.compute.internal Ready worker 19m v1.24.6+5658434 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g4dn.xlarge,beta.kubernetes.io/os=linux,cluster-api/accelerator=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,feature.node.kubernetes.io/cpu-cpuid.ADX=true,feature.node.kubernetes.io/cpu-cpuid.AESNI=true,feature.node.kubernetes.io/cpu-cpuid.AVX2=true,feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true,feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true,feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true,feature.node.kubernetes.io/cpu-cpuid.AVX512F=true,feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true,feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true,feature.node.kubernetes.io/cpu-cpuid.AVX=true,feature.node.kubernetes.io/cpu-cpuid.FMA3=true,feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true,feature.node.kubernetes.io/cpu-cpuid.MPX=true,feature.node.kubernetes.io/cpu-hardware_multithreading=true,feature.node.kubernetes.io/kernel-config.NO_HZ=true,feature.node.kubernetes.io/kernel-config.NO_HZ_FULL=true,feature.node.kubernetes.io/kernel-selinux.enabled=true,feature.node.kubernetes.io/kernel-version.full=4.18.0-372.32.1.el8_6.x86_64,feature.node.kubernetes.io/kernel-version.major=4,feature.node.kubernetes.io/kernel-version.minor=18,feature.node.kubernetes.io/kernel-version.revision=0,feature.node.kubernetes.io/pci-10de.present=true,feature.node.kubernetes.io/pci-1d0f.present=true,feature.node.kubernetes.io/storage-nonrotationaldisk=true,feature.node.kubernetes.io/system-os_release.ID=rhcos,feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION=4.11,feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202211232221-0,feature.node.kubernetes.io/system-os_release.RHEL_VERSION=8.6,feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4,feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=11,feature.node.kubernetes.io/system-os_release.VERSION_ID=4.11,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-218-151.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=g4dn.xlarge,node.openshift.io/os_id=rhcos,nvidia.com/cuda.driver.major=515,nvidia.com/cuda.driver.minor=65,nvidia.com/cuda.driver.rev=01,nvidia.com/cuda.runtime.major=11,nvidia.com/cuda.runtime.minor=7,nvidia.com/gfd.timestamp=1669696035,nvidia.com/gpu.compute.major=7,nvidia.com/gpu.compute.minor=5,nvidia.com/gpu.count=1,nvidia.com/gpu.deploy.container-toolkit=true,nvidia.com/gpu.deploy.dcgm-exporter=true,nvidia.com/gpu.deploy.dcgm=true,nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/gpu.deploy.driver=true,nvidia.com/gpu.deploy.gpu-feature-discovery=true,nvidia.com/gpu.deploy.node-status-exporter=true,nvidia.com/gpu.deploy.nvsm=,nvidia.com/gpu.deploy.operator-validator=true,nvidia.com/gpu.family=turing,nvidia.com/gpu.machine=g4dn.xlarge,nvidia.com/gpu.memory=15360,nvidia.com/gpu.present=true,nvidia.com/gpu.product=Tesla-T4,nvidia.com/gpu.replicas=1,nvidia.com/mig.capable=false,nvidia.com/mig.strategy=single,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c $ oc get po | grep Running gpu-sleep-5f6b7684d9-c8v42 1/1 Running 0 26m gpu-sleep-5f6b7684d9-dszvf 1/1 Running 0 26m gpu-sleep-5f6b7684d9-rrvsq 1/1 Running 0 26m gpu-sleep-5f6b7684d9-v9drk 1/1 Running 0 26m must-gather: https://drive.google.com/file/d/1I8EpH39TCaKSJlgxUNJvp61YoDcDT77d/view?usp=sharing Michael,for the min/max issue, I noticed for cpu and memory, we have functions[0] to calculate Cores and Memory, I wonder if we need such methods for gpu too, not sure if this is the cause of this issue. [0]https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/core/scale_up.go#L110 thanks Zhaohua, this bug is really challenging! re: min/max cpu and memory, the custom resources are handled a little differently and i think this function is being used[0]. i'm curious if you add the `cluster-api/accelerated` label to the machineset.spec.template.spec.metadata.labels would that change the results of this test? i have another idea about how this might be misreported internally in the autoscaler, i will make a patch to test and report back here once i have it ready. [0] https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/core/scale_up.go#L142 > i'm curious if you add the `cluster-api/accelerated` label to the machineset.spec.template.spec.metadata.labels would that change the results of this test?
i just want to expand on this a little. i'm curious to add this label on the machineset because i want to see the nodes getting the label from the beginning of the cluster creation/scaling.
when i look at the must-gather, i see this node ip-10-0-193-180.us-east-2.compute.internal, which appears to be in the machineset containing the gpu nodes but it has no labels from the NFD nor does it have the accelerated label. i'm curious why this node didn't get marked up by the NFD, i also note that there is no machine associated with this node. i'm wondering if this is causing some issue with the way the autoscaler predicts the other nodes in the node group.
```
---
apiVersion: v1
kind: Node
metadata:
annotations:
cloud.network.openshift.io/egress-ipconfig: '[{"interface":"eni-0fd59257409d99a7d","ifaddr":{"ipv4":"10.0.192.0/19"},"capacity":{"ipv4":14,"ipv6":15}}]'
csi.volume.kubernetes.io/nodeid: '{"ebs.csi.aws.com":"i-0638b4a343fdb5544"}'
machine.openshift.io/machine: openshift-machine-api/zhsun1943194-2sh27-master-2
machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
machineconfiguration.openshift.io/currentConfig: rendered-master-92f44cabc65149f5b1445e43bfc836fa
machineconfiguration.openshift.io/desiredConfig: rendered-master-92f44cabc65149f5b1445e43bfc836fa
machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-master-92f44cabc65149f5b1445e43bfc836fa
machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-master-92f44cabc65149f5b1445e43bfc836fa
machineconfiguration.openshift.io/reason: ""
machineconfiguration.openshift.io/state: Done
nfd.node.kubernetes.io/master.version: undefined
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: "2022-11-29T02:09:35Z"
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: m6i.xlarge
beta.kubernetes.io/os: linux
failure-domain.beta.kubernetes.io/region: us-east-2
failure-domain.beta.kubernetes.io/zone: us-east-2c
kubernetes.io/arch: amd64
kubernetes.io/hostname: ip-10-0-193-180.us-east-2.compute.internal
kubernetes.io/os: linux
node-role.kubernetes.io/master: ""
node.kubernetes.io/instance-type: m6i.xlarge
node.openshift.io/os_id: rhcos
topology.ebs.csi.aws.com/zone: us-east-2c
topology.kubernetes.io/region: us-east-2
topology.kubernetes.io/zone: us-east-2c
name: ip-10-0-193-180.us-east-2.compute.internal
resourceVersion: "233949"
uid: 3accde6a-3dd0-4758-af59-21ea88ba3bfa
spec:
providerID: aws:///us-east-2c/i-0638b4a343fdb5544
taints:
- effect: NoSchedule
key: node-role.kubernetes.io/master
status:
addresses:
- address: 10.0.193.180
type: InternalIP
- address: ip-10-0-193-180.us-east-2.compute.internal
type: Hostname
- address: ip-10-0-193-180.us-east-2.compute.internal
type: InternalDNS
allocatable:
attachable-volumes-aws-ebs: "39"
cpu: 3500m
ephemeral-storage: "114396791822"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 14978424Ki
pods: "250"
capacity:
attachable-volumes-aws-ebs: "39"
cpu: "4"
ephemeral-storage: 125293548Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 16129400Ki
pods: "250"
conditions:
- lastHeartbeatTime: "2022-11-29T10:09:13Z"
lastTransitionTime: "2022-11-29T02:09:35Z"
message: kubelet has sufficient memory available
reason: KubeletHasSufficientMemory
status: "False"
type: MemoryPressure
- lastHeartbeatTime: "2022-11-29T10:09:13Z"
lastTransitionTime: "2022-11-29T02:09:35Z"
message: kubelet has no disk pressure
reason: KubeletHasNoDiskPressure
status: "False"
type: DiskPressure
- lastHeartbeatTime: "2022-11-29T10:09:13Z"
lastTransitionTime: "2022-11-29T02:09:35Z"
message: kubelet has sufficient PID available
reason: KubeletHasSufficientPID
status: "False"
type: PIDPressure
- lastHeartbeatTime: "2022-11-29T10:09:13Z"
lastTransitionTime: "2022-11-29T02:10:36Z"
message: kubelet is posting ready status
reason: KubeletReady
status: "True"
type: Ready
daemonEndpoints:
kubeletEndpoint:
Port: 10250
nodeInfo:
architecture: amd64
bootID: 0c6287da-d00a-4501-a114-0f53064e9808
containerRuntimeVersion: cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8
kernelVersion: 4.18.0-372.32.1.el8_6.x86_64
kubeProxyVersion: v1.24.6+5658434
kubeletVersion: v1.24.6+5658434
machineID: ec213c2957969c8f9d0824db59c089d7
operatingSystem: linux
osImage: Red Hat Enterprise Linux CoreOS 411.86.202211232221-0 (Ootpa)
systemUUID: ec213c29-5796-9c8f-9d08-24db59c089d7
```
Michael, I manually added the `cluster-api/accelerated` label to the machineset.spec.template.spec.metadata.labels, the result is same as before. must-gather: https://drive.google.com/file/d/1qQbTrThYSAozLqofag1N9ZewQlrcUWOx/view?usp=sharing thanks Zhaohua, i appreciate all the extra testing =) are you creating a new machineset for the gpu nodes or are you modifying one of the original machinesets? the reason i ask is because i am wondering if we have a node from the original machineset which does not have gpu capability, and then later the machineset adds the gpu capability, the autoscaler might be confused by seeing a node with no gpu capacity and no accelerated label. from the must-gather in comment 71 it looked like there was an old node in the machineset 2c which was created without a gpu node, this caused me to be curious. looking through the must-gather from comment 74, i can see that there is a node which belongs to machineset 2c but has no corresponding machine. node ip-10-0-204-161.us-east-2.compute.internal this node does not have gpu capacity and is created before the other nodes from machineset 2c, the other nodes do have the gpu capacity node / creation time ip-10-0-204-161.us-east-2.compute.internal 2022-11-30T01:44:45Z ip-10-0-218-222.us-east-2.compute.internal 2022-11-30T02:55:59Z ip-10-0-215-109.us-east-2.compute.internal 2022-11-30T03:24:57Z ip-10-0-194-174.us-east-2.compute.internal 2022-11-30T03:25:08Z ip-10-0-221-5.us-east-2.compute.internal 2022-11-30T03:25:07Z because the autoscaler looks at node objects within a node group (machineset in our case), i have a feeling it is seeing this node with no gpu capacity and it causes the autoscaler to become confused about what that node group produces. we might need to test this with a fresh machineset so that there are no nodes from the old machines still in the system, or we need to make sure that none of the old node objects are still in the api server when we run the gpu test. sorry Zhaohua, i missed that the node in question is part of the control plane just to be clear, i don't think we need to retest this right now. we are going to prioritize this bug in our next sprint, so i will get setup to run tests locally. Thanks Michael, yes, the must-gather from comment 74, I was modifying one of the original machinesets, update instancetype to g4dn.xlarge, then delete original machine. After that I also tried create a new machineset with instanceType g4dn.xlarge, the result is same. Before adding workload: $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsun30-p92x6-master-0 Running m6i.xlarge us-east-2 us-east-2a 4h6m zhsun30-p92x6-master-1 Running m6i.xlarge us-east-2 us-east-2b 4h6m zhsun30-p92x6-master-2 Running m6i.xlarge us-east-2 us-east-2c 4h6m zhsun30-p92x6-worker-us-east-2a-tb6t9 Running m6i.xlarge us-east-2 us-east-2a 4h4m zhsun30-p92x6-worker-us-east-2b-cb8ll Running m6i.xlarge us-east-2 us-east-2b 4h4m zhsun30-p92x6-worker-us-east-2b-gcp-vvqnn Running g4dn.xlarge us-east-2 us-east-2b 40m After adding workload: $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsun30-p92x6-master-0 Running m6i.xlarge us-east-2 us-east-2a 4h41m zhsun30-p92x6-master-1 Running m6i.xlarge us-east-2 us-east-2b 4h41m zhsun30-p92x6-master-2 Running m6i.xlarge us-east-2 us-east-2c 4h41m zhsun30-p92x6-worker-us-east-2a-tb6t9 Running m6i.xlarge us-east-2 us-east-2a 4h38m zhsun30-p92x6-worker-us-east-2b-cb8ll Running m6i.xlarge us-east-2 us-east-2b 4h38m zhsun30-p92x6-worker-us-east-2b-gcp-6wbvc Running g4dn.xlarge us-east-2 us-east-2b 33m zhsun30-p92x6-worker-us-east-2b-gcp-84v22 Running g4dn.xlarge us-east-2 us-east-2b 33m zhsun30-p92x6-worker-us-east-2b-gcp-hpzzs Running g4dn.xlarge us-east-2 us-east-2b 33m zhsun30-p92x6-worker-us-east-2b-gcp-vvqnn Running g4dn.xlarge us-east-2 us-east-2b 75m thanks again Zhaohua, i appreciate your thoroughness. i'm planning to take a deeper look in the next sprint. making a note here about my latest tests. based on some of the feedback from Zhaohua, and an internal examination of the code, i am investigating how we propogate the GPU types internally in the autoscaler using this interface function[0]. in the core scaling routines for the autoscaler it will attempt to match the requested resource with the available resources on the projected node. this is done as a string comparison to the name of the resource (eg "nvidia.com/gpu"), with a fallback to the value of "generic". given that we do no implement the interface function as described in the reference link, i am investigating if we are having a situation where the core autoscaler is not becoming aware of these specific GPU type until after it has already created too many. i will continue to investigate this, but it appears we are still having issues finding the root of this problem. [0] https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_provider.go#L116 leaving an update, i am making some progress on this and found a section in the autoscaler code that is observing the value of the "cluster-api/accelerator" label on the node objects. i think we will need to make sure that nodes are labeled with this, but it must also contain the type of gpu being added by the node. so, for us this would look like "cluster-api/accelerator: nvidia.com/gpu". i am working to confirm this and create a patch for the cluster-autoscaler-operator that will handle adding the label automatically. i will update when i have more information. it appears that this value "nvidia.com/gpu" is not valid for a label, but afaict the autoscaler is thinking it can match the label to the resource type. i'm going to need to dig deeper on this. ok, i think i've found the root cause. it looks like the autoscaler will attempt to match the values from command line flag for the limits (eg `--gpu-total`), it uses the value from that flag to match against the value of the label `cluster-api/accelerator` when doing a scale up. during the scale up it will check the value of the label against the value specified in the command line flag to determine if it create the new resources. what is happening currently is that we specify the resources in our ClusterAutoscaler resource like this: ``` spec: resourceLimits: gpus: - type: nvidia.com/gpu min: 1 max: 2 ``` which becomes a command line flag to the autoscaler that looks like: ``` --gpu-total=nvidia.com/gpu ``` which means that we would need a label that looks like: ``` cluster-api/accelerator: nvidia.com/gpu ``` and herein lies a problem, that is not a valid label. to solve this problem, i changed the ClusterAutoscaler resource to look like this: ``` spec: resourceLimits: gpus: - type: nvidia.com min: 1 max: 2 ``` and then changed the label to look like this: ``` cluster-api/accelerator: nvidia.com ``` which allowed the autoscaler to match the values properly and it observed the limits and did not create extra nodes. so, what does this all mean for a solution. i have a couple ideas: 1. create a KCS article to discuss this solution (or update if one exists) 2. change the cluster-autoscaler-operator to use the values from the resourceLimits.gpus type values to create the labels for the machinesets. this will also involve some validation for those values as well. small correction: --- which becomes a command line flag to the autoscaler that looks like: ``` --gpu-total=nvidia.com/gpu:1:2 ``` i have talked with the team about this a little further and we have an idea for an automated solution. we will start by writing the KCS article to inform our users about how to fix this when they encounter it, and also update our documentation to better reflect how the values in the ClusterAutoscaler resource should be used. second, we will need to add some logic to the way the cluster-autoscaler-operator handles its API, and most likely add a field to the MachineAutoscaler resource, and add some validations to the field inputs on both the ClusterAutoscaler and MachineAutoscaler resources. in effect the user will need to specify the limits for GPUs in the ClusterAutoscaler resource, and then also specify the associated resource in the MachineAutoscaler resource as well. i will get started writing the KCS article soon, the code change will take us a little longer to design and plan. i have proposed an update to the KCS article, https://access.redhat.com/solutions/6055181 i am also working with the team on a design to fix this, since it will require a change to our API it will take some time to implement but i will update here when i have a better idea of scheduling. quick update about progress on this bug, i am closing some of the PRs i had open as we have resolved on a solution to this. i am going to create a patch for the CAO that will validate the .spec.resourceLimits.gpus field to ensure that the values fall within the guidelines for the accelerator labels. i am also going to add documentation that describes how the user must modify their MachineSet resource to ensure that the GPU limits are respected. we have chosen to go in this direction as opposed to having the CAO make all the changes because there is a necessary change that must happen to the MachineSet and this could be unexpected for some users. rather than hiding the automation of the accelerator label, we will make it explicit to make this process more clear to end users. i will also talk with the team about adding validations to the .spec.template.spec.metadata.labels field of the MachineSet. I tested it locally, followed same steps as in comment#94 all work as expected, I will move this to Verified, thanks Michael and Milind. I0303 04:58:16.289194 1 scale_up.go:282] Best option to resize: MachineSet/openshift-machine-api/mihuang-hyci1010-sf9sl-worker-us-east-2c I0303 04:58:16.289299 1 scale_up.go:286] Estimated 9 nodes needed in MachineSet/openshift-machine-api/mihuang-hyci1010-sf9sl-worker-us-east-2c I0303 04:58:16.289356 1 resource_manager.go:171] Capping scale-up size due to limit for resource nvidia.com I0303 04:58:16.289377 1 scale_up.go:405] Final scale-up plan: [{MachineSet/openshift-machine-api/mihuang-hyci1010-sf9sl-worker-us-east-2c 1->2 (max: 5)}] I0303 04:58:16.289412 1 scale_up.go:608] Scale-up: setting group MachineSet/openshift-machine-api/mihuang-hyci1010-sf9sl-worker-us-east-2c size to 2 $ oc get machine NAME PHASE TYPE REGION ZONE AGE mihuang-hyci1010-sf9sl-master-0 Running m6i.xlarge us-east-2 us-east-2a 161m mihuang-hyci1010-sf9sl-master-1 Running m6i.xlarge us-east-2 us-east-2b 161m mihuang-hyci1010-sf9sl-master-2 Running m6i.xlarge us-east-2 us-east-2c 161m mihuang-hyci1010-sf9sl-worker-us-east-2a-82h84 Running m6i.xlarge us-east-2 us-east-2a 156m mihuang-hyci1010-sf9sl-worker-us-east-2b-xd2dk Running m6i.xlarge us-east-2 us-east-2b 156m mihuang-hyci1010-sf9sl-worker-us-east-2c-9m5jq Running g4dn.xlarge us-east-2 us-east-2c 3m27s mihuang-hyci1010-sf9sl-worker-us-east-2c-lvrt4 Running g4dn.xlarge us-east-2 us-east-2c 36m Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.13.0 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:1326 |