Bug 1943194 - when using gpus, more nodes than needed are created by the node autoscaler
Summary: when using gpus, more nodes than needed are created by the node autoscaler
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.7
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.13.0
Assignee: Michael McCune
QA Contact: Milind Yadav
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-25 14:47 UTC by raffaele spazzoli
Modified: 2023-05-17 22:46 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Using the cluster autoscaler with GPU min/max limits enabled with workloads that require GPU access to schedule while not setting the "cluster-api/accelerator" label on Nodes with GPUs. Consequence: Due to the time required to install the GPU drivers on the OpenShift node, it is possible in some cases for the autoscaler to create extra nodes in the GPU enabled MachineSet. Fix: The cluster-autoscaler-operator has been modified to warn the user when their MachineSets do not have the appropriate labels for GPU awareness, and when the user has submitted an invalid value for the GPU type in the ClusterAutoscaler resource. Result: The autoscaler will be able to more accurately detect when GPU enabled nodes have their drivers installed, and thus will not create the extra nodes.
Clone Of:
Environment:
Last Closed: 2023-05-17 22:46:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-autoscaler-operator pull 223 0 None Merged Bug 1943194: add logic to detect GPU capacity and update accordingly 2022-11-07 09:14:22 UTC
Github openshift cluster-autoscaler-operator pull 268 0 None open Bug 1943194: update GPU resource limits type to have validation 2023-01-26 22:23:09 UTC
Red Hat Knowledge Base (Solution) 6055181 0 None None None 2022-01-21 15:08:03 UTC
Red Hat Product Errata RHSA-2023:1326 0 None None None 2023-05-17 22:46:44 UTC

Description raffaele spazzoli 2021-03-25 14:47:37 UTC
Description of problem:

I observe the following:
1. pod needed gpu resources cannot be scheduled for lack of resources
2. node autoscaler decides to create a new node.
3. new node is created and joins the cluster, but pod still cannot be scheduled because the node is not ready to receive gpu workload as the nfd and gpu operator are still working.
4. node autoscaler decides to create a new node
... this continue multiple times until:
X. pod is scheduled.


Version-Release number of selected component (if applicable):
OCP 4.7.0
nfd.4.6.0-202103010126.p0
gpu-operator-certified.v1.6.2

How reproducible:
100% of the times in the cluster I was working on.


Steps to Reproduce:
1. create a machineset with gpu machines
2. deploy nfd and gpu operator to work on the nodes of the above machineset
3. fill up the initial nodes with gpu workload until you reach the available capacity and a pod cannot be scheduled.

Actual results:

described above


Expected results:

possibly no additional nodes should be scheduled.


Additional info:

Comment 1 Kevin Pouget 2021-04-08 10:40:26 UTC
I've been able to reproduce the problem, which is known upstream (cf this note: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/aws#special-note-on-gpu-instances).


The upstream solution/workaround is to add the a label to the newly created nodes, so that the auto-scaler knows that it will have GPUs once the driver and everything is loaded.


OpenShift loads Kubernetes autoscaler with "--cloud-provider=clusterapi", so the label to use is "cluster-api/accelerator=true" [1] (there is no known GPU for this cloud provider).


To get the new nodes automatically labeled, the label must be specified in the MachineSet used by the auto-scaler:

> apiVersion: machine.openshift.io/v1beta1
> kind: MachineSet
> metadata:
> spec:
>   template:
>     spec:
>       metadata:
>         labels:
>           cluster-api/accelerator: "true"


1: https://github.com/kubernetes/autoscaler/blob/6432771415846dc0f4ff9ee71dfd307c4e72aa9e/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_provider.go#L40

Comment 2 Kevin Pouget 2021-04-08 14:33:46 UTC
I change the component to "Machine Config Operator" as there is no component directly related to the autoscaler or the openshift-machine-api

Comment 3 Yu Qi Zhang 2021-04-08 17:59:20 UTC
> I change the component to "Machine Config Operator" as there is no component directly related to the autoscaler or the openshift-machine-api

There is. It's not directly called machine-api but rather "cloud compute". Autoscaler is a sub-component.

Comment 4 Ashish Kamra 2021-04-08 18:10:24 UTC
Thanks Yu. Cloud compute team - so is this a matter of updating the OpenShift docs to document the current behavior upstream - https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/aws#special-note-on-gpu-instances?

Comment 5 Michael McCune 2021-04-08 21:18:24 UTC
(In reply to Ashish Kamra from comment #4)
> Thanks Yu. Cloud compute team - so is this a matter of updating the
> OpenShift docs to document the current behavior upstream -
> https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/
> cloudprovider/aws#special-note-on-gpu-instances?

at the very least that sounds like a reasonable first step. i think we should probably consider handling this through a feature on our cluster-autoscaler-operator, or attempting to create a fix upstream. the reason i suggest adding a feature to the cluster-autoscaler-operator is that we could effectively know when groups with gpus will be utilized and could automate the addition of the node labels in those cases.

Comment 7 Lucas López Montero 2021-05-19 10:26:48 UTC
KCS article written: https://access.redhat.com/solutions/6055181.

Comment 10 Michael McCune 2021-05-25 16:02:26 UTC
just wanted to report back after the sig autoscaling meeting.

we have a couple options to fix this which i am going to investigate.

* option 1, we can create a node label, similar to what the upstream uses for AWS and GKE[0], that would apply to the cluster-api provider implementation that we use in the cluster autoscaler. for openshift this will also require a change to how/when we apply these node labels.

* option 2, we can create a more generic approach which the cluster-autoscaler would use internally to mitigate this issue for all cloud providers.

option 1 is the most direct fix, and i will start investigating how we could do this in the upstream and openshift. most likely this work will not land for the 4.8 release.

the upstream community is open to accepting option 2 as a possible solution, but this will require more research to determine the best methodology for introducing this behavior strictly in the autoscaler. i am hoping to investigate option 2 while implementing option 1.

regardless of which option we implement, i would expect these changes to land in openshift once feature freeze for 4.8 has ended.


[0] https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/aws#special-note-on-gpu-instances

Comment 11 Michael McCune 2021-05-25 20:48:58 UTC
i have created a jira card so that our team can prioritize and plan this work, https://issues.redhat.com/browse/OCPCLOUD-1180

Comment 12 Lucas López Montero 2021-06-21 07:31:27 UTC
Is there any update regarding this issue? Thank you very much.

Comment 13 Kevin Pouget 2021-06-21 07:59:56 UTC
Michael, I was wondering if this couldn't be fixed with the help of NFD  labels [0][1]:

* NVIDIA GPU Operator rely on NFD labels to discover GPU nodes (any of these 3) [2]:

> var gpuNodeLabels = map[string]string{
> 	"feature.node.kubernetes.io/pci-10de.present":      "true",
> 	"feature.node.kubernetes.io/pci-0302_10de.present": "true",
> 	"feature.node.kubernetes.io/pci-0300_10de.present": "true",
> }


* newly-create nodes will be marked with NFD labels much after than their `nvidia.com/gpu` resource will appear (only need to spawn a NFD worker pod performing a lspci)


but "much after" isn't immediate, as a MachineSet node label can do ...

0: https://github.com/openshift/cluster-nfd-operator
1: https://github.com/kubernetes-sigs/node-feature-discovery
2: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/blob/master/controllers/state_manager.go#L39

Comment 14 Michael McCune 2021-06-21 21:22:42 UTC
@llopezmo 

our team has discussed this and we have planned to do the work. it will take some time to plan it and create the necessary code/tests/docs. please follow the jira card for the latest information: https://issues.redhat.com/browse/OCPCLOUD-1180

i am hopeful we will schedule this work during the 4.9 release cycle, but i can't give a more accurate estimate than that.


@kpouget 

that is good info, thank you! we might be able to use those labels (at least for nvidia), but i think we need something more generic if it is to be used in the cluster-autoscaler for a wider solution.

the real sticky part of this issue is that we will need to add this functionality into the cluster-autoscaler to ensure that when it simulates a scaling event, it takes into account the nodes that are marked with GPUs but have not become available for scheduling GPU workloads yet.

Comment 15 Lucas López Montero 2021-06-22 06:57:10 UTC
mimccune I appreciate your update. Thank you very much. I will follow up the Jira ticket.

Comment 16 Michael McCune 2021-07-20 19:28:07 UTC
@rspazzol i am working to reproduce this so that i can instrument the cluster-autoscaler to learn a little more about why it's failing (there is some code in the autoscaler that should be reacting to this situation). would you be able to create a must-gather with autoscaler logs for when this happening on your cluster?

Comment 17 raffaele spazzoli 2021-07-20 19:40:14 UTC
@michael I don't have that environment up and running anymore.
you can recreate my environment by following these instructions:
https://github.com/raffaelespazzoli/kubeflow-ocp

Comment 18 Michael McCune 2021-07-20 19:44:24 UTC
thanks raffaele!

Comment 19 Michael McCune 2021-08-03 20:11:42 UTC
just wanted to leave a comment. i am making some progress on fixing this, i will continue to update the jira ticket https://issues.redhat.com/browse/OCPCLOUD-1180 with my findings.

Comment 20 Michael McCune 2021-08-26 17:07:58 UTC
(cross posting this from the jira ticket)

i have been testing out this interaction and i am not able to reproduce the error condition where it creates too many nodes. perhaps the timing is better on my cluster and so the gpu driver compilation is happening quick enough to prevent the autoscaler from creating more nodes.

but, with that said, i have confirmed that applying the label `cluster-api/accelerator` in the `machineset.spec.template.spec.metadata` will cause the autoscaler to consider those nodes as unready until the gpu driver has been deployed.

in the short term, we need to make an errata (or perhaps a knowledge base article, i'm not sure) that instructs users to add the `cluster-api/accelerator` label to their machinesets that will be used with gpu instances and autoscaling. in full, it should look something like this:

{{
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
spec:
template:
  spec:
    metadata:
      labels:
        cluster-api/accelerator: ""
}}

(this example only shows the affected fields)

in the longer term we will need to add logic to our machine controller actuators that will add the label automatically when it detects an instance type that uses gpus. we should contribute this work to the upstream cluster-api community as they need this patch as well. but, this will take us slightly longer to complete, so the errata will help users who are impacted immediately.

Comment 22 Kevin Pouget 2021-08-27 07:34:55 UTC
Michael,

> i have been testing out this interaction and i am not able to reproduce the error condition where it creates too many nodes. perhaps the timing is better on my cluster and so the gpu driver compilation is happening quick enough to prevent the autoscaler from creating more nodes.

maybe an easy way to reproduce the original issue is to simply don't deploy the GPU Operator. This way the autoscaler will create a new machine, but the node will never gain a `nvidia.com/gpu` resource. If the autoscaler creates another machine --> bug; if the autoscaler waits forever for the node to become "GPU-ready" --> bug fixed

Comment 23 Michael McCune 2021-08-27 12:42:24 UTC
(In reply to Kevin Pouget from comment #22)
> maybe an easy way to reproduce the original issue is to simply don't deploy
> the GPU Operator. This way the autoscaler will create a new machine, but the
> node will never gain a `nvidia.com/gpu` resource. If the autoscaler creates
> another machine --> bug; if the autoscaler waits forever for the node to
> become "GPU-ready" --> bug fixed

that's in interesting idea, i will give it a try. there is definitely a bug here, but i think it has more to do with our lack of labeling on these nodes.

Comment 24 Michael McCune 2021-08-27 17:37:21 UTC
just wanted to report back after trying out Kevin's suggestion. the cluster did not do what i expected. basically, here is what i did

1. create a cluster
2. start autoscaler
3. add nfd
4. create machineset with gpu instance, add to autoscaling
5. start a deployment with a gpu resource limit

i expected to see some activity, but the autoscaler never considered my machineset as an option for scaling. i'm not sure why yet. i confirmed that the machineset did advertise it was gpu available, but the autoscaler did not consider it a viable candidate. i will probably need to dig deeper to understand this, but i think the label option is our best "fix" for the time being.

Comment 25 Michael McCune 2021-10-05 21:03:24 UTC
after some investigation and discussions, i think we have a short term solution and a longer term solution.

in the short term, i have created this patch[0] for our cluster-autoscaler-operator which will look for MachineSets that have GPU capacity and then label properly for the autoscaler to invoke its GPU custom node processor. i am testing this patch out on a live cluster but i have a good feeling it will alleviate the over-provisioning.

in the longer term, i am working with the upstream cluster-api community to ensure that our infrastructure provider controllers are properly labeling MachineSets when GPU capacity is detected. this fix might take several releases to fix completely though as it will depend on community support from the upstream.


[0] https://github.com/openshift/cluster-autoscaler-operator/pull/223

Comment 29 sunzhaohua 2021-10-19 07:46:52 UTC
@Michael, in my testing, still more nodes than needed are created. Feel we need update this file https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/scale_up.go#L111, add some code to calculate the gpu total. I set gpu min/max to 1/16, the instacetype is "p2.8xlarge" with 8 gpus, so I am looking forward to having two nodes at most, but now it will scale up to 5 nodes.

1. create a cluster
2. create clusterautoscaler
apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  resourceLimits:
    gpus:
      - type: nvidia.com/gpu
        min: 1
        max: 16
  scaleDown:
    enabled: true
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 10s
    unneededTime: 10s

3. create a machineset with "p2.8xlarge" and add `cluster-api/accelerator` label
https://privatebin-it-iso.int.open.paas.redhat.com/?ce2bb51b9fa5130a#26BxMgeYQLoADngVFnfBjzBo4Z7LHb6xVm6gjvX6kkm3
kind: MachineSet
metadata:
  annotations:
    machine.openshift.io/GPU: "8"
    machine.openshift.io/memoryMb: "499712"
    machine.openshift.io/vCPU: "32"
...
    spec:
      metadata:
        labels:
          cluster-api/accelerator: ""
      providerSpec:
4. create machineautoscaler
$ oc get machineautoscaler
NAME                REF KIND     REF NAME                               MIN   MAX   AGE
machineautoscaler   MachineSet   zhsunaws1018-58vff-worker-us-east-2c   1     5     2m48s

5. add workload to scale up
workload: https://privatebin-it-iso.int.open.paas.redhat.com/?aac7e94374510001#6p7DFH9A9v9CJGQZLGrpSPhC7D5Wui8QpdeNd2qL7UjR

6. check autoscaler log and machines
I1019 07:29:46.619912       1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/zhsunaws1018-58vff-worker-us-east-2c
I1019 07:29:46.619935       1 scale_up.go:472] Estimated 4 nodes needed in MachineSet/openshift-machine-api/zhsunaws1018-58vff-worker-us-east-2c
I1019 07:29:46.818502       1 scale_up.go:586] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsunaws1018-58vff-worker-us-east-2c 1->5 (max: 5)}]
I1019 07:29:46.818540       1 scale_up.go:675] Scale-up: setting group MachineSet/openshift-machine-api/zhsunaws1018-58vff-worker-us-east-2c size to 5
W1019 07:29:58.243046       1 clusterapi_controller.go:455] Machine "zhsunaws1018-58vff-worker-us-east-2c-twxrt" has no providerID
W1019 07:29:58.243064       1 clusterapi_controller.go:455] Machine "zhsunaws1018-58vff-worker-us-east-2c-4hgmb" has no providerID
W1019 07:29:58.268297       1 clusterapi_controller.go:455] Machine "zhsunaws1018-58vff-worker-us-east-2c-twxrt" has no providerID
W1019 07:29:58.268318       1 clusterapi_controller.go:455] Machine "zhsunaws1018-58vff-worker-us-east-2c-4hgmb" has no providerID

 oc get machine
NAME                                         PHASE          TYPE         REGION      ZONE         AGE
zhsunaws1018-58vff-master-0                  Running        m5.xlarge    us-east-2   us-east-2a   19h
zhsunaws1018-58vff-master-1                  Running        m5.xlarge    us-east-2   us-east-2b   19h
zhsunaws1018-58vff-master-2                  Running        m5.xlarge    us-east-2   us-east-2c   19h
zhsunaws1018-58vff-worker-us-east-2a-cwf8v   Running        m5.large     us-east-2   us-east-2a   19h
zhsunaws1018-58vff-worker-us-east-2b-2gs8l   Running        m5.large     us-east-2   us-east-2b   19h
zhsunaws1018-58vff-worker-us-east-2c-4hgmb   Provisioning                                         10m
zhsunaws1018-58vff-worker-us-east-2c-65cwv   Running        p2.8xlarge   us-east-2   us-east-2c   10m
zhsunaws1018-58vff-worker-us-east-2c-hbg2q   Running        p2.8xlarge   us-east-2   us-east-2c   22m
zhsunaws1018-58vff-worker-us-east-2c-mwvdx   Running        p2.8xlarge   us-east-2   us-east-2c   10m
zhsunaws1018-58vff-worker-us-east-2c-twxrt   Provisioning                                         10m

$ oc edit machine zhsunaws1018-58vff-worker-us-east-2c-4hgmb
  providerStatus:
    conditions:
    - lastTransitionTime: "2021-10-19T07:29:57Z"
      message: "error creating EC2 instance: InsufficientInstanceCapacity: We currently do not have sufficient p2.8xlarge capacity in the Availability Zone you requested (us-east-2c). Our system will be working on provisioning additional capacity. You can currently get p2.8xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2a, us-east-2b.\n\tstatus code: 500, request id: 74aaed8c-f18f-4728-8c4a-02cdba9278b1"
      reason: MachineCreationFailed

Comment 30 Michael McCune 2021-10-19 18:47:22 UTC
@zhsun thanks for the thorough test, i think there are a couple issues though.

1. you shouldn't need to manually add the `cluster-api/accelerated` label to the machineset, the CAO should do this automatically.
2. the workload you specified is asking for 80 replicas with each replica requesting 32Gi of memory, we should change this request to be a limit of `nvidia.com/gpu: 1` to ensure that the scheduler is trying to find gpu-enabled nodes to run on. i have a feeling the extra nodes are because we are asking for a large amount of memory with each replica. with the request changed to a limit, we should be able to use 40 replicas, which /should/ create 5 nodes (8 GPU per node, 5 nodes in node group).

this is a sample deployment i have used for testing this fix

```
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-sleep
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-sleep
  template:
    metadata:
      labels:
        app: gpu-sleep
    spec:
      containers:
      - name: gpu-sleep
        image: quay.io/elmiko/busybox
        resources:
          limits:
            nvidia.com/gpu: 1
        command:
          - sleep
          - "3600"
```

Comment 31 sunzhaohua 2021-10-22 14:55:45 UTC
@Michael thank you. I tested again, still more nodes than needed are created. I still set gpu min/max to 1/16, the instacetype is "p2.8xlarge" with 8 gpus, so I am looking forward to having two nodes at most, but it will scale up to 4 nodes. Am I missing something?

1. create a cluster
2. create a machineset with "p2.8xlarge" 
metadata:
  annotations:
    autoscaling.openshift.io/machineautoscaler: openshift-machine-api/machineautoscaler
    machine.openshift.io/GPU: "8"
    machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "5"
    machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "1"
    machine.openshift.io/memoryMb: "499712"
    machine.openshift.io/vCPU: "32"
  creationTimestamp: "2021-10-22T03:50:07Z"
  generation: 15
  labels:
    cluster-api/accelerator: ""
3. deploy nfd and gpu operator refer to https://docs.nvidia.com/datacenter/cloud-native/openshift/cluster-entitlement.html
$ oc get pods,daemonset -n gpu-operator-resources                                       [22:54:11]
NAME                                                DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
daemonset.apps/gpu-feature-discovery                0         0         0       0            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true   6h12m
daemonset.apps/nvidia-container-toolkit-daemonset   0         0         0       0            0           nvidia.com/gpu.deploy.container-toolkit=true       6h12m
daemonset.apps/nvidia-dcgm                          0         0         0       0            0           nvidia.com/gpu.deploy.dcgm=true                    6h12m
daemonset.apps/nvidia-dcgm-exporter                 0         0         0       0            0           nvidia.com/gpu.deploy.dcgm-exporter=true           6h12m
daemonset.apps/nvidia-device-plugin-daemonset       0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true           6h12m
daemonset.apps/nvidia-driver-daemonset              0         0         0       0            0           nvidia.com/gpu.deploy.driver=true                  6h12m
daemonset.apps/nvidia-mig-manager                   0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true             6h12m
daemonset.apps/nvidia-node-status-exporter          0         0         0       0            0           nvidia.com/gpu.deploy.node-status-exporter=true    6h12m
daemonset.apps/nvidia-operator-validator            0         0         0       0            0           nvidia.com/gpu.deploy.operator-validator=true      6h12m

4. create clusterautoscaler
apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  resourceLimits:
    gpus:
      - type: nvidia.com/gpu
        min: 1
        max: 16
  scaleDown:
    enabled: true
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 10s
    unneededTime: 10s

5. create machineautoscaler
$  oc get machineautoscaler                                                              
NAME                REF KIND     REF NAME                            MIN   MAX   AGE
machineautoscaler   MachineSet   zhsun1022-b96mx-worker-us-east-2c   1     5     136m

6. add workload to scale up
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-sleep
spec:
  replicas: 30
  selector:
    matchLabels:
      app: gpu-sleep
  template:
    metadata:
      labels:
        app: gpu-sleep
    spec:
      containers:
      - name: gpu-sleep
        image: quay.io/elmiko/busybox
        resources:
          limits:
            nvidia.com/gpu: 1
        command:
          - sleep
          - "3600"
6. check autoscaler log and machines
I1022 14:20:40.246003       1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/zhsun1022-b96mx-worker-us-east-2c
I1022 14:20:40.246023       1 scale_up.go:472] Estimated 3 nodes needed in MachineSet/openshift-machine-api/zhsun1022-b96mx-worker-us-east-2c
I1022 14:20:40.443299       1 scale_up.go:586] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsun1022-b96mx-worker-us-east-2c 1->4 (max: 5)}]
I1022 14:20:40.443328       1 scale_up.go:675] Scale-up: setting group MachineSet/openshift-machine-api/zhsun1022-b96mx-worker-us-east-2c size to 4

Comment 32 sunzhaohua 2021-10-25 07:36:39 UTC
@Michael tested again with instacetype "g4dn.xlarge" with 1 gpus, still more nodes are created. clusterautoscaler set gpu min/max to 1/2, , so I am looking forward to having 2 nodes at most, but it will scale up to 5 nodes. 

1. create a cluster
2. create a machineset with "g4dn.xlarge" 
  annotations:
    autoscaling.openshift.io/machineautoscaler: openshift-machine-api/machineautoscaler
    machine.openshift.io/GPU: "1"
    machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "5"
    machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "1"
    machine.openshift.io/memoryMb: "16384"
    machine.openshift.io/vCPU: "4"
  labels:
    cluster-api/accelerator: ""
3. deploy nfd and gpu operator refer to https://docs.nvidia.com/datacenter/cloud-native/openshift/cluster-entitlement.html
$ oc get pods,daemonset -n gpu-operator-resources                                                                              [15:33:00]
NAME                                           READY   STATUS      RESTARTS   AGE
pod/gpu-feature-discovery-9bm5w                1/1     Running     0          25m
pod/nvidia-container-toolkit-daemonset-zm9f9   1/1     Running     0          25m
pod/nvidia-cuda-validator-jvlb7                0/1     Completed   0          20m
pod/nvidia-dcgm-9b6kx                          1/1     Running     0          25m
pod/nvidia-dcgm-exporter-kbbw7                 1/1     Running     0          25m
pod/nvidia-device-plugin-daemonset-j2kds       1/1     Running     0          25m
pod/nvidia-device-plugin-validator-9vncc       0/1     Completed   0          19m
pod/nvidia-driver-daemonset-f65rw              1/1     Running     0          25m
pod/nvidia-node-status-exporter-dj4s4          1/1     Running     0          25m
pod/nvidia-operator-validator-hkhr9            1/1     Running     0          25m

NAME                                                DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
daemonset.apps/gpu-feature-discovery                1         1         1       1            1           nvidia.com/gpu.deploy.gpu-feature-discovery=true   25m
daemonset.apps/nvidia-container-toolkit-daemonset   1         1         1       1            1           nvidia.com/gpu.deploy.container-toolkit=true       25m
daemonset.apps/nvidia-dcgm                          1         1         1       1            1           nvidia.com/gpu.deploy.dcgm=true                    25m
daemonset.apps/nvidia-dcgm-exporter                 1         1         1       1            1           nvidia.com/gpu.deploy.dcgm-exporter=true           25m
daemonset.apps/nvidia-device-plugin-daemonset       1         1         1       1            1           nvidia.com/gpu.deploy.device-plugin=true           25m
daemonset.apps/nvidia-driver-daemonset              1         1         1       1            1           nvidia.com/gpu.deploy.driver=true                  25m
daemonset.apps/nvidia-mig-manager                   0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true             25m
daemonset.apps/nvidia-node-status-exporter          1         1         1       1            1           nvidia.com/gpu.deploy.node-status-exporter=true    25m
daemonset.apps/nvidia-operator-validator            1         1         1       1            1           nvidia.com/gpu.deploy.operator-validator=true

4. create clusterautoscaler
apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  resourceLimits:
    gpus:
      - type: nvidia.com/gpu
        min: 1
        max: 2
  scaleDown:
    enabled: true
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 10s
    unneededTime: 10s

5. create machineautoscaler                                                           
$ oc get machineautoscaler                                                                                                     [15:34:04]
NAME                REF KIND     REF NAME                            MIN   MAX   AGE
machineautoscaler   MachineSet   zhsun1025-snfmv-worker-us-east-2c   1     5     14m

6. add workload to scale up
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-sleep
spec:
  replicas: 10
  selector:
    matchLabels:
      app: gpu-sleep
  template:
    metadata:
      labels:
        app: gpu-sleep
    spec:
      containers:
      - name: gpu-sleep
        image: quay.io/elmiko/busybox
        resources:
          limits:
            nvidia.com/gpu: 1
        command:
          - sleep
          - "3600"
6. check autoscaler log and machines
I1025 07:31:11.458630       1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/zhsun1025-snfmv-worker-us-east-2c
I1025 07:31:11.458655       1 scale_up.go:472] Estimated 9 nodes needed in MachineSet/openshift-machine-api/zhsun1025-snfmv-worker-us-east-2c
I1025 07:31:11.653606       1 scale_up.go:586] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsun1025-snfmv-worker-us-east-2c 1->5 (max: 5)}]
I1025 07:31:11.653638       1 scale_up.go:675] Scale-up: setting group MachineSet/openshift-machine-api/zhsun1025-snfmv-worker-us-east-2c size to 5
I1025 07:31:23.268925       1 static_autoscaler.go:335] 4 unregistered nodes present

Comment 33 Michael McCune 2021-10-25 13:41:18 UTC
(In reply to sunzhaohua from comment #31)
> @Michael thank you. I tested again, still more nodes than needed are
> created. I still set gpu min/max to 1/16, the instacetype is "p2.8xlarge"
> with 8 gpus, so I am looking forward to having two nodes at most, but it
> will scale up to 4 nodes. Am I missing something?
> 

this actually appears to have created the appropriate number of machines.

your deployment is asking for 30 replicas, each one asking for a gpu.

the instances have 8 gpus each.

4 instances would be 32 gpus, which would fit the 30 replicas you asked for. it also does not create an extra node (5 would be the max for that node group)

Comment 34 Michael McCune 2021-10-25 13:45:22 UTC
(In reply to sunzhaohua from comment #32)
> @Michael tested again with instacetype "g4dn.xlarge" with 1 gpus, still more
> nodes are created. clusterautoscaler set gpu min/max to 1/2, , so I am
> looking forward to having 2 nodes at most, but it will scale up to 5 nodes. 
> 

it appears that the autoscaler created the appropriate number of instances (within its limits)

your deployment asked for 10 replicas, each requesting 1 gpu.

each instance only has a single gpu, and the node group maximum is 5.

the autoscaler scaled up to 5 instances (its max), and it should have had 5 pods pending since it could not make more nodes.


i think the tests you have shown are both accurate in terms of the expected activity. if you want to craft another test though, i would suggest re-running the second test but set the replicas to 3 on the deployment. you should see 3 nodes in the autoscaler node group at the end, so it should create 2 if it starts with 1. with these levels you should be able to see that the autoscaler creates the appropriate number of instances without creating too many.

Comment 35 Michael McCune 2021-10-25 13:50:05 UTC
oops Zhaohua, i just noticed the min/max settings on that last run. my apologies. that does look like a bug, i'll have to investigate the min/max issue.

Comment 36 Michael McCune 2021-10-26 19:55:17 UTC
@zhsun would it be possible for you to run the last test with the autoscaler using `--v=4` and capture the log file ?

Comment 39 sunzhaohua 2021-10-27 05:49:41 UTC
must-gather: http://file.rdu.redhat.com/~zhsun/must-gather.local.7665988797093375454.tar.gz	
The machine keeps creating and then deleting, keeps looping

Comment 40 Joel Speed 2021-11-26 12:19:48 UTC
Just had a quick look through what's going on in the logs, I think there's an issue with the cluster autoscaler configuration being used in this test.

  scaleDown:
    enabled: true
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 10s
    unneededTime: 10s

Because we have unneededTime as 10s, I don't think the GPU operator is getting enough time to initialise the Nodes, which means the pods can't schedule, which is causing them to scale away, because they are empty.

Can we try again but increase all of the scale down timings a bit, maybe make them all 120s to give some more time for things to settle?

Also, looking at the gpu sleep, you are requesting 10 replicas, each replica requests 1 gpu, so you will need to create 10 instances to fulfil this request.
Could we perhaps change this to 3 replicas instead? This would be in the middle of the scale up range, we expect it to scale to 3 replicas, but it could scale to 5 if there was a bug, so if it only creates 3 machines, we know it's not over creating.

Comment 41 Kevin Pouget 2021-11-26 12:34:34 UTC
just to clarify this point:

> Because we have unneededTime as 10s, I don't think the GPU operator is getting enough time to initialise the Nodes

in our nightly CI, it takes ~8min to wait for the full deployment of the GPU computing stack:

> Playbook run took 0 days, 0 hours, 8 minutes, 34 seconds

among which 7min are for the wait of the driver deployment

> Thursday 25 November 2021  23:53:52 +0000 (0:00:00.022)       0:00:06.943 ***** 
> TASK: gpu_operator_wait_deployment : Wait for the GPU Operator to validate the driver deployment

so 8 minutes seems to be a good estimation of the time it takes between the (operator is deployed|the node is ready) and GPU Pod start to be executed

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-psap-ci-artifacts-release-4.9-gpu-operator-e2e-18x/1464006155771580416/artifacts/gpu-operator-e2e-18x/nightly/artifacts/010__gpu_operator__wait_deployment/_ansible.log

Comment 42 Michael McCune 2021-11-30 19:16:21 UTC
sorry, i haven't had a chance to come back to this issue. i concur with what Joel is saying though, i have a feeling the failure we are seeing might be related to our specific configuration. if we run this test with values that a customer might use, or something closer to those values, i think we will see the tests pass.

Comment 43 sunzhaohua 2021-12-01 08:32:38 UTC
sorry for the confusion, I will try again with unneededTime: 10m and post result here.

Comment 44 sunzhaohua 2021-12-01 14:40:26 UTC
(In reply to Joel Speed from comment #40)
> Also, looking at the gpu sleep, you are requesting 10 replicas, each replica
> requests 1 gpu, so you will need to create 10 instances to fulfil this
> request.
> Could we perhaps change this to 3 replicas instead? This would be in the
> middle of the scale up range, we expect it to scale to 3 replicas, but it
> could scale to 5 if there was a bug, so if it only creates 3 machines, we
> know it's not over creating.

@Joel the clusterautoscaler gpus min/max settings is 1/2, so I think it should create 2 machines at most, if it creates 3 machines, it's still over creating.
 
apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  resourceLimits:
    gpus:
      - type: nvidia.com/gpu
        min: 1
        max: 2

Comment 45 Joel Speed 2021-12-01 16:13:31 UTC
Right, ok, I misunderstood how the test was being controller there, this makes more sense now.
Not sure why that's not working as expected, will do some investigation

Comment 46 Michael McCune 2022-01-04 21:39:25 UTC
@zhsun did you run the test again with the 10 minute unneededTime?

i'm just curious where we left off on this bug.

Comment 47 sunzhaohua 2022-01-05 14:02:24 UTC
@Michael I tested it with below clusterautoscaler, all scale down timings to 10m, the pods can schedule, machines are running and will not scale away, this work as expected. The only issue is clusterautoscaler gpus min/max settings is 1/2, so it should have 2 machines at most, but it will create more machines than needed. For example, in my testing, it will create another 4 new machines.

apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  resourceLimits:
    gpus:
      - type: nvidia.com/gpu
        min: 1
        max: 2
  scaleDown:
    enabled: true
    delayAfterAdd: 10m
    delayAfterDelete: 10m
    delayAfterFailure: 10m
    unneededTime: 10m

$ oc get machine                                                                                                                                                                                                     
NAME                                      PHASE     TYPE          REGION      ZONE         AGE
zhsunaws5-fzz8v-master-0                  Running   m6i.xlarge    us-east-2   us-east-2a   12h
zhsunaws5-fzz8v-master-1                  Running   m6i.xlarge    us-east-2   us-east-2b   12h
zhsunaws5-fzz8v-master-2                  Running   m6i.xlarge    us-east-2   us-east-2c   12h
zhsunaws5-fzz8v-worker-us-east-2a-rgkg9   Running   m6i.large     us-east-2   us-east-2a   12h
zhsunaws5-fzz8v-worker-us-east-2b-cl464   Running   m6i.large     us-east-2   us-east-2b   12h
zhsunaws5-fzz8v-worker-us-east-2c-486fz   Running   g4dn.xlarge   us-east-2   us-east-2c   19m
zhsunaws5-fzz8v-worker-us-east-2c-9lj8f   Running   g4dn.xlarge   us-east-2   us-east-2c   19m
zhsunaws5-fzz8v-worker-us-east-2c-gjhzz   Running   g4dn.xlarge   us-east-2   us-east-2c   19m
zhsunaws5-fzz8v-worker-us-east-2c-jbqgf   Running   g4dn.xlarge   us-east-2   us-east-2c   57m
zhsunaws5-fzz8v-worker-us-east-2c-sph7h   Running   g4dn.xlarge   us-east-2   us-east-2c   19m

Comment 48 Michael McCune 2022-01-06 18:12:21 UTC
thanks for the update Zhaohua!

i talked with Joel about this a little bit, and i have a suspicion that there might be a disconnect between how the autoscaler waits for the gpu nodes to become active (eg. driver installed) and how it registers the nodes resources with respect to the limits. we'll have to do some more debugging.

Comment 50 Michael McCune 2022-03-09 14:13:44 UTC
this bug is being extremely difficult to solve. my current thinking is that there is an interaction happening in the core autoscaler between the way it calculates the maximum resource within the cluster, and the way it plans to add new gpu nodes. due to the way the autoscaler must wait for the gpu nodes to have a driver ready before the pending pods will schedule to the node, it may not be aware that gpu nodes are being added when it calculates the maximums. i am continuing to investigate along these lines.

Comment 51 Michael McCune 2022-04-22 13:12:38 UTC
no new progress here, i am still investigating.

Comment 52 Lucas López Montero 2022-05-13 14:04:13 UTC
Thank you for your update and your efforts on this, Michael

Comment 53 Joel Speed 2022-05-26 13:34:46 UTC
Still working on this, Mike got some inspiration last week at the sig-autoscaling meet up at Kubecon, looking to get back to this soon

Comment 60 Michael McCune 2022-10-19 21:58:59 UTC
@zhsun i've created a new PR for the CAO that i'm hoping will alleviate this issue in machinesets where there is at least 1 replica, i'm not 100% positive that it will solve the issue when scaling from/to zero.

i've changed the CAO to add the cluster-autoscaler accelerator label to the `machineset.spec.template.spec.metadata.labels`, which should cause all the machines and nodes made from that machineset to also have the accelerator label. my hope is that this will cause the autoscaler to see new nodes will have a GPU before the capacity has been set on the node object. this pattern is inspired by actions that our users have taken to alleviate the root cause.

i'd be happy to know if this fixes the bug with respect to scaling from 1, and then in a separate test scaling from 0.

Comment 66 sunzhaohua 2022-11-08 05:51:24 UTC
Tried again, clusterpolicy still wasn't ready, clusterversion 4.12.0-0.ci.test-2022-11-08-004217-ci-ln-slqh742-latest

Checked "cluster-api/accelerator" label can be added to nodes automatically. 
Didn't install NFD and GPU operator, created clusterautoscaler, machineautoscaler and add workload to scale, more nodes than expected are created. Not sure if this test is ok for the bug, or must install nfd and gpu.

$ oc get node --show-labels  | grep accelerator                                   
ip-10-0-210-38.us-east-2.compute.internal    Ready    worker                 89m    v1.25.2+93b33ea   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g4dn.xlarge,beta.kubernetes.io/os=linux,cluster-api/accelerator=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-210-38.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=g4dn.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c


apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  resourceLimits:
    gpus:
      - type: nvidia.com/gpu
        min: 1
        max: 2
  scaleDown:
    enabled: true
    delayAfterAdd: 10m
    delayAfterDelete: 10m
    delayAfterFailure: 10m
    unneededTime: 10m

$ oc get machineautoscaler                             
NAME                  REF KIND     REF NAME                             MIN   MAX   AGE
machineautoscaler-3   MachineSet   zhsun11811-b7tsc-worker-us-east-2c   1     5     102m

$ oc get machine                                                        
NAME                                       PHASE     TYPE          REGION      ZONE         AGE
zhsun11811-b7tsc-master-0                  Running   m6i.xlarge    us-east-2   us-east-2a   60m
zhsun11811-b7tsc-master-1                  Running   m6i.xlarge    us-east-2   us-east-2b   60m
zhsun11811-b7tsc-master-2                  Running   m6i.xlarge    us-east-2   us-east-2c   60m
zhsun11811-b7tsc-worker-us-east-2a-tjvx9   Running   m6i.xlarge    us-east-2   us-east-2a   57m
zhsun11811-b7tsc-worker-us-east-2b-vhzf9   Running   m6i.xlarge    us-east-2   us-east-2b   57m
zhsun11811-b7tsc-worker-us-east-2c-dxz6l   Running   g4dn.xlarge   us-east-2   us-east-2c   16m
zhsun11811-b7tsc-worker-us-east-2c-fv6z9   Running   g4dn.xlarge   us-east-2   us-east-2c   16m
zhsun11811-b7tsc-worker-us-east-2c-p5jdv   Running   g4dn.xlarge   us-east-2   us-east-2c   16m
zhsun11811-b7tsc-worker-us-east-2c-ppddb   Running   g4dn.xlarge   us-east-2   us-east-2c   16m
zhsun11811-b7tsc-worker-us-east-2c-xjmgx   Running   g4dn.xlarge   us-east-2   us-east-2c   20m

Comment 67 Michael McCune 2022-11-08 14:01:41 UTC
thanks Zhaohua, i'm glad to see that the label is existing. i have a feeling that the nfd/gpu operator will be needed at some point as the autoscaler is waiting to see the capacity appear on the node for the gpu device. i have another idea about adding a fix to the autoscaler, i will investigate that as well.

Comment 69 Michael McCune 2022-11-28 20:14:56 UTC
@zhsun i know we are having some issues with the NFD operator currently, so i created a patch for the 4.11 release[0] and i wonder if we could test to see if this solution works on a 4.11 cluster?

my theory here is that this change should work on any version of openshift and since it does not touch the autoscaler it should be safe to try this change on a previous cluster version (4.11). i mainly want to see if this solution is improving the situation for users, also, when we test we should be sure to use ClusterAutoscaler values that are similar to what users will use in production, this is to help ensure aren't using values that are too short when waiting for machines to become nodes.

[0] https://github.com/openshift/cluster-autoscaler-operator/pull/257

Comment 70 sunzhaohua 2022-11-29 04:44:10 UTC
Michael, I set up a 4.11 cluster with https://github.com/openshift/cluster-autoscaler-operator/pull/257, still more nodes than expected are created.
clusterversion: 4.11.0-0.ci.test-2022-11-29-012944-ci-ln-l4jhj3b-latest

This is my steps:
1.Create a machineset with "g4dn.xlarge" 
metadata:
  annotations:
    autoscaling.openshift.io/machineautoscaler: openshift-machine-api/machineautoscaler-3
    machine.openshift.io/GPU: "1"
    machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "4"
    machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "1"
    machine.openshift.io/memoryMb: "16384"
    machine.openshift.io/vCPU: "4"

2. Create machineautoscaler
$ oc get machineautoscaler                                                                                                                                                                                                                         
NAME                  REF KIND     REF NAME                               MIN   MAX   AGE
machineautoscaler-3   MachineSet   zhsun1943194-2sh27-worker-us-east-2c   1     4     18m

3.Deploy NFD and GPU
$ oc get pods,daemonset -n nvidia-gpu-operator     
NAME                                                      READY   STATUS      RESTARTS   AGE
pod/gpu-feature-discovery-258v5                           1/1     Running     0          22m
pod/gpu-operator-5dc8fc87c7-9gxpm                         1/1     Running     0          23m
pod/nvidia-container-toolkit-daemonset-dttnk              1/1     Running     0          22m
pod/nvidia-cuda-validator-skzgk                           0/1     Completed   0          17m
pod/nvidia-dcgm-exporter-s9rlk                            1/1     Running     0          22m
pod/nvidia-dcgm-rdq8g                                     1/1     Running     0          22m
pod/nvidia-device-plugin-daemonset-grwzf                  1/1     Running     0          22m
pod/nvidia-device-plugin-validator-dd4n5                  0/1     Completed   0          16m
pod/nvidia-driver-daemonset-411.86.202211232221-0-6st54   2/2     Running     0          22m
pod/nvidia-node-status-exporter-5gmd2                     1/1     Running     0          22m
pod/nvidia-operator-validator-zzm6d                       1/1     Running     0          21m

NAME                                                           DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                         AGE
daemonset.apps/gpu-feature-discovery                           1         1         1       1            1           nvidia.com/gpu.deploy.gpu-feature-discovery=true                                                                      22m
daemonset.apps/nvidia-container-toolkit-daemonset              1         1         1       1            1           nvidia.com/gpu.deploy.container-toolkit=true                                                                          22m
daemonset.apps/nvidia-dcgm                                     1         1         1       1            1           nvidia.com/gpu.deploy.dcgm=true                                                                                       22m
daemonset.apps/nvidia-dcgm-exporter                            1         1         1       1            1           nvidia.com/gpu.deploy.dcgm-exporter=true                                                                              22m
daemonset.apps/nvidia-device-plugin-daemonset                  1         1         1       1            1           nvidia.com/gpu.deploy.device-plugin=true                                                                              22m
daemonset.apps/nvidia-driver-daemonset-411.86.202211232221-0   1         1         1       1            1           feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202211232221-0,nvidia.com/gpu.deploy.driver=true   22m
daemonset.apps/nvidia-mig-manager                              0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true                                                                                22m
daemonset.apps/nvidia-node-status-exporter                     1         1         1       1            1           nvidia.com/gpu.deploy.node-status-exporter=true                                                                       22m
daemonset.apps/nvidia-operator-validator                       1         1         1       1            1           nvidia.com/gpu.deploy.operator-validator=true

4. Create clusterautoscaler
apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  resourceLimits:
    gpus:
      - type: nvidia.com/gpu
        min: 1
        max: 2
  scaleDown:
    enabled: true
    delayAfterAdd: 15m
    delayAfterDelete: 15m
    delayAfterFailure: 15m
    unneededTime: 15m

$ oc get deploy cluster-autoscaler-default -o yaml
    spec:
      containers:
      - args:
        - --logtostderr
        - --v=1
        - --cloud-provider=clusterapi
        - --namespace=openshift-machine-api
        - --leader-elect-lease-duration=137s
        - --leader-elect-renew-deadline=107s
        - --leader-elect-retry-period=26s
        - --gpu-total=nvidia.com/gpu:1:2
        - --scale-down-enabled=true
        - --scale-down-delay-after-add=15m
        - --scale-down-delay-after-delete=15m
        - --scale-down-delay-after-failure=15m
        - --scale-down-unneeded-time=15m

5. Add workload to scale up
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-sleep
spec:
  replicas: 10
  selector:
    matchLabels:
      app: gpu-sleep
  template:
    metadata:
      labels:
        app: gpu-sleep
    spec:
      containers:
      - name: gpu-sleep
        image: quay.io/elmiko/busybox
        resources:
          limits:
            nvidia.com/gpu: 1
        command:
          - sleep
          - "3600"

6. check autoscaler log and machines and pods
I1129 04:18:03.639817       1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/zhsun1943194-2sh27-worker-us-east-2c
I1129 04:18:03.639840       1 scale_up.go:472] Estimated 9 nodes needed in MachineSet/openshift-machine-api/zhsun1943194-2sh27-worker-us-east-2c
I1129 04:18:03.834810       1 scale_up.go:595] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsun1943194-2sh27-worker-us-east-2c 1->4 (max: 4)}]
I1129 04:18:03.834841       1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/zhsun1943194-2sh27-worker-us-east-2c size to 4

$ oc get machine     
NAME                                         PHASE     TYPE          REGION      ZONE         AGE
zhsun1943194-2sh27-master-0                  Running   m6i.xlarge    us-east-2   us-east-2a   151m
zhsun1943194-2sh27-master-1                  Running   m6i.xlarge    us-east-2   us-east-2b   151m
zhsun1943194-2sh27-master-2                  Running   m6i.xlarge    us-east-2   us-east-2c   151m
zhsun1943194-2sh27-worker-us-east-2a-xm478   Running   m6i.xlarge    us-east-2   us-east-2a   149m
zhsun1943194-2sh27-worker-us-east-2b-phxqr   Running   m6i.xlarge    us-east-2   us-east-2b   149m
zhsun1943194-2sh27-worker-us-east-2c-4vk9h   Running   g4dn.xlarge   us-east-2   us-east-2c   60m
zhsun1943194-2sh27-worker-us-east-2c-mdtzh   Running   g4dn.xlarge   us-east-2   us-east-2c   23m
zhsun1943194-2sh27-worker-us-east-2c-rrcpk   Running   g4dn.xlarge   us-east-2   us-east-2c   23m
zhsun1943194-2sh27-worker-us-east-2c-wcmtm   Running   g4dn.xlarge   us-east-2   us-east-2c   23m

$ oc get node --show-labels  | grep gpu         
ip-10-0-198-222.us-east-2.compute.internal   Ready    worker   19m    v1.24.6+5658434   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g4dn.xlarge,beta.kubernetes.io/os=linux,cluster-api/accelerator=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,feature.node.kubernetes.io/cpu-cpuid.ADX=true,feature.node.kubernetes.io/cpu-cpuid.AESNI=true,feature.node.kubernetes.io/cpu-cpuid.AVX2=true,feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true,feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true,feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true,feature.node.kubernetes.io/cpu-cpuid.AVX512F=true,feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true,feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true,feature.node.kubernetes.io/cpu-cpuid.AVX=true,feature.node.kubernetes.io/cpu-cpuid.FMA3=true,feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true,feature.node.kubernetes.io/cpu-cpuid.MPX=true,feature.node.kubernetes.io/cpu-hardware_multithreading=true,feature.node.kubernetes.io/kernel-config.NO_HZ=true,feature.node.kubernetes.io/kernel-config.NO_HZ_FULL=true,feature.node.kubernetes.io/kernel-selinux.enabled=true,feature.node.kubernetes.io/kernel-version.full=4.18.0-372.32.1.el8_6.x86_64,feature.node.kubernetes.io/kernel-version.major=4,feature.node.kubernetes.io/kernel-version.minor=18,feature.node.kubernetes.io/kernel-version.revision=0,feature.node.kubernetes.io/pci-10de.present=true,feature.node.kubernetes.io/pci-1d0f.present=true,feature.node.kubernetes.io/storage-nonrotationaldisk=true,feature.node.kubernetes.io/system-os_release.ID=rhcos,feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION=4.11,feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202211232221-0,feature.node.kubernetes.io/system-os_release.RHEL_VERSION=8.6,feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4,feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=11,feature.node.kubernetes.io/system-os_release.VERSION_ID=4.11,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-198-222.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=g4dn.xlarge,node.openshift.io/os_id=rhcos,nvidia.com/cuda.driver.major=515,nvidia.com/cuda.driver.minor=65,nvidia.com/cuda.driver.rev=01,nvidia.com/cuda.runtime.major=11,nvidia.com/cuda.runtime.minor=7,nvidia.com/gfd.timestamp=1669696050,nvidia.com/gpu.compute.major=7,nvidia.com/gpu.compute.minor=5,nvidia.com/gpu.count=1,nvidia.com/gpu.deploy.container-toolkit=true,nvidia.com/gpu.deploy.dcgm-exporter=true,nvidia.com/gpu.deploy.dcgm=true,nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/gpu.deploy.driver=true,nvidia.com/gpu.deploy.gpu-feature-discovery=true,nvidia.com/gpu.deploy.node-status-exporter=true,nvidia.com/gpu.deploy.nvsm=,nvidia.com/gpu.deploy.operator-validator=true,nvidia.com/gpu.family=turing,nvidia.com/gpu.machine=g4dn.xlarge,nvidia.com/gpu.memory=15360,nvidia.com/gpu.present=true,nvidia.com/gpu.product=Tesla-T4,nvidia.com/gpu.replicas=1,nvidia.com/mig.capable=false,nvidia.com/mig.strategy=single,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c
ip-10-0-206-111.us-east-2.compute.internal   Ready    worker   56m    v1.24.6+5658434   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g4dn.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,feature.node.kubernetes.io/cpu-cpuid.ADX=true,feature.node.kubernetes.io/cpu-cpuid.AESNI=true,feature.node.kubernetes.io/cpu-cpuid.AVX2=true,feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true,feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true,feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true,feature.node.kubernetes.io/cpu-cpuid.AVX512F=true,feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true,feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true,feature.node.kubernetes.io/cpu-cpuid.AVX=true,feature.node.kubernetes.io/cpu-cpuid.FMA3=true,feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true,feature.node.kubernetes.io/cpu-cpuid.MPX=true,feature.node.kubernetes.io/cpu-hardware_multithreading=true,feature.node.kubernetes.io/kernel-config.NO_HZ=true,feature.node.kubernetes.io/kernel-config.NO_HZ_FULL=true,feature.node.kubernetes.io/kernel-selinux.enabled=true,feature.node.kubernetes.io/kernel-version.full=4.18.0-372.32.1.el8_6.x86_64,feature.node.kubernetes.io/kernel-version.major=4,feature.node.kubernetes.io/kernel-version.minor=18,feature.node.kubernetes.io/kernel-version.revision=0,feature.node.kubernetes.io/pci-10de.present=true,feature.node.kubernetes.io/pci-1d0f.present=true,feature.node.kubernetes.io/storage-nonrotationaldisk=true,feature.node.kubernetes.io/system-os_release.ID=rhcos,feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION=4.11,feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202211232221-0,feature.node.kubernetes.io/system-os_release.RHEL_VERSION=8.6,feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4,feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=11,feature.node.kubernetes.io/system-os_release.VERSION_ID=4.11,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-206-111.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=g4dn.xlarge,node.openshift.io/os_id=rhcos,nvidia.com/cuda.driver.major=515,nvidia.com/cuda.driver.minor=65,nvidia.com/cuda.driver.rev=01,nvidia.com/cuda.runtime.major=11,nvidia.com/cuda.runtime.minor=7,nvidia.com/gfd.timestamp=1669694211,nvidia.com/gpu.compute.major=7,nvidia.com/gpu.compute.minor=5,nvidia.com/gpu.count=1,nvidia.com/gpu.deploy.container-toolkit=true,nvidia.com/gpu.deploy.dcgm-exporter=true,nvidia.com/gpu.deploy.dcgm=true,nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/gpu.deploy.driver=true,nvidia.com/gpu.deploy.gpu-feature-discovery=true,nvidia.com/gpu.deploy.node-status-exporter=true,nvidia.com/gpu.deploy.nvsm=,nvidia.com/gpu.deploy.operator-validator=true,nvidia.com/gpu.family=turing,nvidia.com/gpu.machine=g4dn.xlarge,nvidia.com/gpu.memory=15360,nvidia.com/gpu.present=true,nvidia.com/gpu.product=Tesla-T4,nvidia.com/gpu.replicas=1,nvidia.com/mig.capable=false,nvidia.com/mig.strategy=single,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c
ip-10-0-217-193.us-east-2.compute.internal   Ready    worker   19m    v1.24.6+5658434   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g4dn.xlarge,beta.kubernetes.io/os=linux,cluster-api/accelerator=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,feature.node.kubernetes.io/cpu-cpuid.ADX=true,feature.node.kubernetes.io/cpu-cpuid.AESNI=true,feature.node.kubernetes.io/cpu-cpuid.AVX2=true,feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true,feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true,feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true,feature.node.kubernetes.io/cpu-cpuid.AVX512F=true,feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true,feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true,feature.node.kubernetes.io/cpu-cpuid.AVX=true,feature.node.kubernetes.io/cpu-cpuid.FMA3=true,feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true,feature.node.kubernetes.io/cpu-cpuid.MPX=true,feature.node.kubernetes.io/cpu-hardware_multithreading=true,feature.node.kubernetes.io/kernel-config.NO_HZ=true,feature.node.kubernetes.io/kernel-config.NO_HZ_FULL=true,feature.node.kubernetes.io/kernel-selinux.enabled=true,feature.node.kubernetes.io/kernel-version.full=4.18.0-372.32.1.el8_6.x86_64,feature.node.kubernetes.io/kernel-version.major=4,feature.node.kubernetes.io/kernel-version.minor=18,feature.node.kubernetes.io/kernel-version.revision=0,feature.node.kubernetes.io/pci-10de.present=true,feature.node.kubernetes.io/pci-1d0f.present=true,feature.node.kubernetes.io/storage-nonrotationaldisk=true,feature.node.kubernetes.io/system-os_release.ID=rhcos,feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION=4.11,feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202211232221-0,feature.node.kubernetes.io/system-os_release.RHEL_VERSION=8.6,feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4,feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=11,feature.node.kubernetes.io/system-os_release.VERSION_ID=4.11,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-217-193.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=g4dn.xlarge,node.openshift.io/os_id=rhcos,nvidia.com/cuda.driver.major=515,nvidia.com/cuda.driver.minor=65,nvidia.com/cuda.driver.rev=01,nvidia.com/cuda.runtime.major=11,nvidia.com/cuda.runtime.minor=7,nvidia.com/gfd.timestamp=1669696062,nvidia.com/gpu.compute.major=7,nvidia.com/gpu.compute.minor=5,nvidia.com/gpu.count=1,nvidia.com/gpu.deploy.container-toolkit=true,nvidia.com/gpu.deploy.dcgm-exporter=true,nvidia.com/gpu.deploy.dcgm=true,nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/gpu.deploy.driver=true,nvidia.com/gpu.deploy.gpu-feature-discovery=true,nvidia.com/gpu.deploy.node-status-exporter=true,nvidia.com/gpu.deploy.nvsm=,nvidia.com/gpu.deploy.operator-validator=true,nvidia.com/gpu.family=turing,nvidia.com/gpu.machine=g4dn.xlarge,nvidia.com/gpu.memory=15360,nvidia.com/gpu.present=true,nvidia.com/gpu.product=Tesla-T4,nvidia.com/gpu.replicas=1,nvidia.com/mig.capable=false,nvidia.com/mig.strategy=single,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c
ip-10-0-218-151.us-east-2.compute.internal   Ready    worker   19m    v1.24.6+5658434   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g4dn.xlarge,beta.kubernetes.io/os=linux,cluster-api/accelerator=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,feature.node.kubernetes.io/cpu-cpuid.ADX=true,feature.node.kubernetes.io/cpu-cpuid.AESNI=true,feature.node.kubernetes.io/cpu-cpuid.AVX2=true,feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true,feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true,feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true,feature.node.kubernetes.io/cpu-cpuid.AVX512F=true,feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true,feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true,feature.node.kubernetes.io/cpu-cpuid.AVX=true,feature.node.kubernetes.io/cpu-cpuid.FMA3=true,feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true,feature.node.kubernetes.io/cpu-cpuid.MPX=true,feature.node.kubernetes.io/cpu-hardware_multithreading=true,feature.node.kubernetes.io/kernel-config.NO_HZ=true,feature.node.kubernetes.io/kernel-config.NO_HZ_FULL=true,feature.node.kubernetes.io/kernel-selinux.enabled=true,feature.node.kubernetes.io/kernel-version.full=4.18.0-372.32.1.el8_6.x86_64,feature.node.kubernetes.io/kernel-version.major=4,feature.node.kubernetes.io/kernel-version.minor=18,feature.node.kubernetes.io/kernel-version.revision=0,feature.node.kubernetes.io/pci-10de.present=true,feature.node.kubernetes.io/pci-1d0f.present=true,feature.node.kubernetes.io/storage-nonrotationaldisk=true,feature.node.kubernetes.io/system-os_release.ID=rhcos,feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION=4.11,feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202211232221-0,feature.node.kubernetes.io/system-os_release.RHEL_VERSION=8.6,feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4,feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=11,feature.node.kubernetes.io/system-os_release.VERSION_ID=4.11,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-218-151.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=g4dn.xlarge,node.openshift.io/os_id=rhcos,nvidia.com/cuda.driver.major=515,nvidia.com/cuda.driver.minor=65,nvidia.com/cuda.driver.rev=01,nvidia.com/cuda.runtime.major=11,nvidia.com/cuda.runtime.minor=7,nvidia.com/gfd.timestamp=1669696035,nvidia.com/gpu.compute.major=7,nvidia.com/gpu.compute.minor=5,nvidia.com/gpu.count=1,nvidia.com/gpu.deploy.container-toolkit=true,nvidia.com/gpu.deploy.dcgm-exporter=true,nvidia.com/gpu.deploy.dcgm=true,nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/gpu.deploy.driver=true,nvidia.com/gpu.deploy.gpu-feature-discovery=true,nvidia.com/gpu.deploy.node-status-exporter=true,nvidia.com/gpu.deploy.nvsm=,nvidia.com/gpu.deploy.operator-validator=true,nvidia.com/gpu.family=turing,nvidia.com/gpu.machine=g4dn.xlarge,nvidia.com/gpu.memory=15360,nvidia.com/gpu.present=true,nvidia.com/gpu.product=Tesla-T4,nvidia.com/gpu.replicas=1,nvidia.com/mig.capable=false,nvidia.com/mig.strategy=single,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c

$ oc get po | grep Running                   
gpu-sleep-5f6b7684d9-c8v42                     1/1     Running   0          26m
gpu-sleep-5f6b7684d9-dszvf                     1/1     Running   0          26m
gpu-sleep-5f6b7684d9-rrvsq                     1/1     Running   0          26m
gpu-sleep-5f6b7684d9-v9drk                     1/1     Running   0          26m

Comment 71 sunzhaohua 2022-11-29 13:11:48 UTC
must-gather: https://drive.google.com/file/d/1I8EpH39TCaKSJlgxUNJvp61YoDcDT77d/view?usp=sharing

Michael,for the min/max issue, I noticed for cpu and memory, we have functions[0] to calculate Cores and Memory, I wonder if we need such methods for gpu too, not sure if this is the cause of this issue.

[0]https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/core/scale_up.go#L110

Comment 72 Michael McCune 2022-11-29 15:42:18 UTC
thanks Zhaohua, this bug is really challenging!

re: min/max cpu and memory, the custom resources are handled a little differently and i think this function is being used[0].

i'm curious if you add the `cluster-api/accelerated` label to the machineset.spec.template.spec.metadata.labels would that change the results of this test?

i have another idea about how this might be misreported internally in the autoscaler, i will make a patch to test and report back here once i have it ready.


[0] https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/core/scale_up.go#L142

Comment 73 Michael McCune 2022-11-29 16:37:51 UTC
> i'm curious if you add the `cluster-api/accelerated` label to the machineset.spec.template.spec.metadata.labels would that change the results of this test?

i just want to expand on this a little. i'm curious to add this label on the machineset because i want to see the nodes getting the label from the beginning of the cluster creation/scaling.

when i look at the must-gather, i see this node ip-10-0-193-180.us-east-2.compute.internal, which appears to be in the machineset containing the gpu nodes but it has no labels from the NFD nor does it have the accelerated label. i'm curious why this node didn't get marked up by the NFD, i also note that there is no machine associated with this node. i'm wondering if this is causing some issue with the way the autoscaler predicts the other nodes in the node group.

```
---
apiVersion: v1
kind: Node
metadata:
  annotations:
    cloud.network.openshift.io/egress-ipconfig: '[{"interface":"eni-0fd59257409d99a7d","ifaddr":{"ipv4":"10.0.192.0/19"},"capacity":{"ipv4":14,"ipv6":15}}]'
    csi.volume.kubernetes.io/nodeid: '{"ebs.csi.aws.com":"i-0638b4a343fdb5544"}'
    machine.openshift.io/machine: openshift-machine-api/zhsun1943194-2sh27-master-2
    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
    machineconfiguration.openshift.io/currentConfig: rendered-master-92f44cabc65149f5b1445e43bfc836fa
    machineconfiguration.openshift.io/desiredConfig: rendered-master-92f44cabc65149f5b1445e43bfc836fa
    machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-master-92f44cabc65149f5b1445e43bfc836fa
    machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-master-92f44cabc65149f5b1445e43bfc836fa
    machineconfiguration.openshift.io/reason: ""
    machineconfiguration.openshift.io/state: Done
    nfd.node.kubernetes.io/master.version: undefined
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2022-11-29T02:09:35Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: m6i.xlarge
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/region: us-east-2
    failure-domain.beta.kubernetes.io/zone: us-east-2c
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: ip-10-0-193-180.us-east-2.compute.internal
    kubernetes.io/os: linux
    node-role.kubernetes.io/master: ""
    node.kubernetes.io/instance-type: m6i.xlarge
    node.openshift.io/os_id: rhcos
    topology.ebs.csi.aws.com/zone: us-east-2c
    topology.kubernetes.io/region: us-east-2
    topology.kubernetes.io/zone: us-east-2c
  name: ip-10-0-193-180.us-east-2.compute.internal
  resourceVersion: "233949"
  uid: 3accde6a-3dd0-4758-af59-21ea88ba3bfa
spec:
  providerID: aws:///us-east-2c/i-0638b4a343fdb5544
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
status:
  addresses:
  - address: 10.0.193.180
    type: InternalIP
  - address: ip-10-0-193-180.us-east-2.compute.internal
    type: Hostname
  - address: ip-10-0-193-180.us-east-2.compute.internal
    type: InternalDNS
  allocatable:
    attachable-volumes-aws-ebs: "39"
    cpu: 3500m
    ephemeral-storage: "114396791822"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 14978424Ki
    pods: "250"
  capacity:
    attachable-volumes-aws-ebs: "39"
    cpu: "4"
    ephemeral-storage: 125293548Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 16129400Ki
    pods: "250"
  conditions:
  - lastHeartbeatTime: "2022-11-29T10:09:13Z"
    lastTransitionTime: "2022-11-29T02:09:35Z"
    message: kubelet has sufficient memory available
    reason: KubeletHasSufficientMemory
    status: "False"
    type: MemoryPressure
  - lastHeartbeatTime: "2022-11-29T10:09:13Z"
    lastTransitionTime: "2022-11-29T02:09:35Z"
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure
  - lastHeartbeatTime: "2022-11-29T10:09:13Z"
    lastTransitionTime: "2022-11-29T02:09:35Z"
    message: kubelet has sufficient PID available
    reason: KubeletHasSufficientPID
    status: "False"
    type: PIDPressure
  - lastHeartbeatTime: "2022-11-29T10:09:13Z"
    lastTransitionTime: "2022-11-29T02:10:36Z"
    message: kubelet is posting ready status
    reason: KubeletReady
    status: "True"
    type: Ready
  daemonEndpoints:
    kubeletEndpoint:
      Port: 10250
  nodeInfo:
    architecture: amd64
    bootID: 0c6287da-d00a-4501-a114-0f53064e9808
    containerRuntimeVersion: cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8
    kernelVersion: 4.18.0-372.32.1.el8_6.x86_64
    kubeProxyVersion: v1.24.6+5658434
    kubeletVersion: v1.24.6+5658434
    machineID: ec213c2957969c8f9d0824db59c089d7
    operatingSystem: linux
    osImage: Red Hat Enterprise Linux CoreOS 411.86.202211232221-0 (Ootpa)
    systemUUID: ec213c29-5796-9c8f-9d08-24db59c089d7
```

Comment 74 sunzhaohua 2022-11-30 06:41:45 UTC
Michael, I manually added the `cluster-api/accelerated` label to the machineset.spec.template.spec.metadata.labels, the result is same as before.

must-gather: https://drive.google.com/file/d/1qQbTrThYSAozLqofag1N9ZewQlrcUWOx/view?usp=sharing

Comment 75 Michael McCune 2022-11-30 17:02:46 UTC
thanks Zhaohua, i appreciate all the extra testing =)

are you creating a new machineset for the gpu nodes or are you modifying one of the original machinesets?

the reason i ask is because i am wondering if we have a node from the original machineset which does not have gpu capability, and then later the machineset adds the gpu capability, the autoscaler might be confused by seeing a node with no gpu capacity and no accelerated label. from the must-gather in comment 71 it looked like there was an old node in the machineset 2c which was created without a gpu node, this caused me to be curious.

Comment 76 Michael McCune 2022-11-30 17:19:56 UTC
looking through the must-gather from comment 74, i can see that there is a node which belongs to machineset 2c but has no corresponding machine. node ip-10-0-204-161.us-east-2.compute.internal

this node does not have gpu capacity and is created before the other nodes from machineset 2c, the other nodes do have the gpu capacity

node / creation time
ip-10-0-204-161.us-east-2.compute.internal 2022-11-30T01:44:45Z
ip-10-0-218-222.us-east-2.compute.internal 2022-11-30T02:55:59Z
ip-10-0-215-109.us-east-2.compute.internal 2022-11-30T03:24:57Z
ip-10-0-194-174.us-east-2.compute.internal 2022-11-30T03:25:08Z
ip-10-0-221-5.us-east-2.compute.internal   2022-11-30T03:25:07Z

because the autoscaler looks at node objects within a node group (machineset in our case), i have a feeling it is seeing this node with no gpu capacity and it causes the autoscaler to become confused about what that node group produces.

we might need to test this with a fresh machineset so that there are no nodes from the old machines still in the system, or we need to make sure that none of the old node objects are still in the api server when we run the gpu test.

Comment 77 Michael McCune 2022-11-30 17:34:53 UTC
sorry Zhaohua, i missed that the node in question is part of the control plane

Comment 78 Michael McCune 2022-11-30 17:59:32 UTC
just to be clear, i don't think we need to retest this right now. we are going to prioritize this bug in our next sprint, so i will get setup to run tests locally.

Comment 79 sunzhaohua 2022-12-01 01:52:22 UTC
Thanks Michael, yes, the must-gather from comment 74, I was modifying one of the original machinesets, update instancetype to g4dn.xlarge, then delete original machine.
After that I also tried create a new machineset with instanceType g4dn.xlarge, the result is same.

Before adding workload:
$ oc get machine                                                                                                            
NAME                                        PHASE     TYPE          REGION      ZONE         AGE
zhsun30-p92x6-master-0                      Running   m6i.xlarge    us-east-2   us-east-2a   4h6m
zhsun30-p92x6-master-1                      Running   m6i.xlarge    us-east-2   us-east-2b   4h6m
zhsun30-p92x6-master-2                      Running   m6i.xlarge    us-east-2   us-east-2c   4h6m
zhsun30-p92x6-worker-us-east-2a-tb6t9       Running   m6i.xlarge    us-east-2   us-east-2a   4h4m
zhsun30-p92x6-worker-us-east-2b-cb8ll       Running   m6i.xlarge    us-east-2   us-east-2b   4h4m
zhsun30-p92x6-worker-us-east-2b-gcp-vvqnn   Running   g4dn.xlarge   us-east-2   us-east-2b   40m
After adding workload:
$ oc get machine                                                                                                         
NAME                                        PHASE     TYPE          REGION      ZONE         AGE
zhsun30-p92x6-master-0                      Running   m6i.xlarge    us-east-2   us-east-2a   4h41m
zhsun30-p92x6-master-1                      Running   m6i.xlarge    us-east-2   us-east-2b   4h41m
zhsun30-p92x6-master-2                      Running   m6i.xlarge    us-east-2   us-east-2c   4h41m
zhsun30-p92x6-worker-us-east-2a-tb6t9       Running   m6i.xlarge    us-east-2   us-east-2a   4h38m
zhsun30-p92x6-worker-us-east-2b-cb8ll       Running   m6i.xlarge    us-east-2   us-east-2b   4h38m
zhsun30-p92x6-worker-us-east-2b-gcp-6wbvc   Running   g4dn.xlarge   us-east-2   us-east-2b   33m
zhsun30-p92x6-worker-us-east-2b-gcp-84v22   Running   g4dn.xlarge   us-east-2   us-east-2b   33m
zhsun30-p92x6-worker-us-east-2b-gcp-hpzzs   Running   g4dn.xlarge   us-east-2   us-east-2b   33m
zhsun30-p92x6-worker-us-east-2b-gcp-vvqnn   Running   g4dn.xlarge   us-east-2   us-east-2b   75m

Comment 80 Michael McCune 2022-12-01 14:13:13 UTC
thanks again Zhaohua, i appreciate your thoroughness. i'm planning to take a deeper look in the next sprint.

Comment 82 Michael McCune 2023-01-09 22:12:02 UTC
making a note here about my latest tests.

based on some of the feedback from Zhaohua, and an internal examination of the code, i am investigating how we propogate the GPU types internally in the autoscaler using this interface function[0]. in the core scaling routines for the autoscaler it will attempt to match the requested resource with the available resources on the projected node. this is done as a string comparison to the name of the resource (eg "nvidia.com/gpu"), with a fallback to the value of "generic". given that we do no implement the interface function as described in the reference link, i am investigating if we are having a situation where the core autoscaler is not becoming aware of these specific GPU type until after it has already created too many.

i will continue to investigate this, but it appears we are still having issues finding the root of this problem.


[0] https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_provider.go#L116

Comment 83 Michael McCune 2023-01-18 20:32:37 UTC
leaving an update, i am making some progress on this and found a section in the autoscaler code that is observing the value of the "cluster-api/accelerator" label on the node objects. i think we will need to make sure that nodes are labeled with this, but it must also contain the type of gpu being added by the node. so, for us this would look like "cluster-api/accelerator: nvidia.com/gpu".

i am working to confirm this and create a patch for the cluster-autoscaler-operator that will handle adding the label automatically. i will update when i have more information.

Comment 84 Michael McCune 2023-01-18 21:29:21 UTC
it appears that this value "nvidia.com/gpu" is not valid for a label, but afaict the autoscaler is thinking it can match the label to the resource type. i'm going to need to dig deeper on this.

Comment 85 Michael McCune 2023-01-18 21:53:08 UTC
ok, i think i've found the root cause. it looks like the autoscaler will attempt to match the values from command line flag for the limits (eg `--gpu-total`), it uses the value from that flag to match against the value of the label `cluster-api/accelerator` when doing a scale up. during the scale up it will check the value of the label against the value specified in the command line flag to determine if it create the new resources.

what is happening currently is that we specify the resources in our ClusterAutoscaler resource like this:

```
spec:
  resourceLimits:
    gpus:
      - type: nvidia.com/gpu
        min: 1
        max: 2
```

which becomes a command line flag to the autoscaler that looks like:

```
--gpu-total=nvidia.com/gpu
```

which means that we would need a label that looks like:

```
cluster-api/accelerator: nvidia.com/gpu
```

and herein lies a problem, that is not a valid label.

to solve this problem, i changed the ClusterAutoscaler resource to look like this:

```
spec:
  resourceLimits:
    gpus:
      - type: nvidia.com
        min: 1
        max: 2
```

and then changed the label to look like this:


```
cluster-api/accelerator: nvidia.com
```

which allowed the autoscaler to match the values properly and it observed the limits and did not create extra nodes.

so, what does this all mean for a solution. i have a couple ideas:

1. create a KCS article to discuss this solution (or update if one exists)
2. change the cluster-autoscaler-operator to use the values from the resourceLimits.gpus type values to create the labels for the machinesets. this will also involve some validation for those values as well.

Comment 86 Michael McCune 2023-01-18 22:01:47 UTC
small correction:

---

which becomes a command line flag to the autoscaler that looks like:

```
--gpu-total=nvidia.com/gpu:1:2
```

Comment 87 Michael McCune 2023-01-19 15:30:38 UTC
i have talked with the team about this a little further and we have an idea for an automated solution. we will start by writing the KCS article to inform our users about how to fix this when they encounter it, and also update our documentation to better reflect how the values in the ClusterAutoscaler resource should be used.

second, we will need to add some logic to the way the cluster-autoscaler-operator handles its API, and most likely add a field to the MachineAutoscaler resource, and add some validations to the field inputs on both the ClusterAutoscaler and MachineAutoscaler resources. in effect the user will need to specify the limits for GPUs in the ClusterAutoscaler resource, and then also specify the associated resource in the MachineAutoscaler resource as well.

i will get started writing the KCS article soon, the code change will take us a little longer to design and plan.

Comment 88 Michael McCune 2023-01-20 15:44:36 UTC
i have proposed an update to the KCS article, https://access.redhat.com/solutions/6055181

i am also working with the team on a design to fix this, since it will require a change to our API it will take some time to implement but i will update here when i have a better idea of scheduling.

Comment 89 Michael McCune 2023-01-26 17:10:35 UTC
quick update about progress on this bug,

i am closing some of the PRs i had open as we have resolved on a solution to this. i am going to create a patch for the CAO that will validate the .spec.resourceLimits.gpus field to ensure that the values fall within the guidelines for the accelerator labels. i am also going to add documentation that describes how the user must modify their MachineSet resource to ensure that the GPU limits are respected.

we have chosen to go in this direction as opposed to having the CAO make all the changes because there is a necessary change that must happen to the MachineSet and this could be unexpected for some users. rather than hiding the automation of the accelerator label, we will make it explicit to make this process more clear to end users.

i will also talk with the team about adding validations to the .spec.template.spec.metadata.labels field of the MachineSet.

Comment 104 sunzhaohua 2023-03-03 05:04:49 UTC
I tested it locally, followed same steps as in comment#94  all work as expected, I will move this to Verified, thanks Michael and Milind.

I0303 04:58:16.289194       1 scale_up.go:282] Best option to resize: MachineSet/openshift-machine-api/mihuang-hyci1010-sf9sl-worker-us-east-2c
I0303 04:58:16.289299       1 scale_up.go:286] Estimated 9 nodes needed in MachineSet/openshift-machine-api/mihuang-hyci1010-sf9sl-worker-us-east-2c
I0303 04:58:16.289356       1 resource_manager.go:171] Capping scale-up size due to limit for resource nvidia.com
I0303 04:58:16.289377       1 scale_up.go:405] Final scale-up plan: [{MachineSet/openshift-machine-api/mihuang-hyci1010-sf9sl-worker-us-east-2c 1->2 (max: 5)}]
I0303 04:58:16.289412       1 scale_up.go:608] Scale-up: setting group MachineSet/openshift-machine-api/mihuang-hyci1010-sf9sl-worker-us-east-2c size to 2

$ oc get machine                                                                                               
NAME                                             PHASE     TYPE          REGION      ZONE         AGE
mihuang-hyci1010-sf9sl-master-0                  Running   m6i.xlarge    us-east-2   us-east-2a   161m
mihuang-hyci1010-sf9sl-master-1                  Running   m6i.xlarge    us-east-2   us-east-2b   161m
mihuang-hyci1010-sf9sl-master-2                  Running   m6i.xlarge    us-east-2   us-east-2c   161m
mihuang-hyci1010-sf9sl-worker-us-east-2a-82h84   Running   m6i.xlarge    us-east-2   us-east-2a   156m
mihuang-hyci1010-sf9sl-worker-us-east-2b-xd2dk   Running   m6i.xlarge    us-east-2   us-east-2b   156m
mihuang-hyci1010-sf9sl-worker-us-east-2c-9m5jq   Running   g4dn.xlarge   us-east-2   us-east-2c   3m27s
mihuang-hyci1010-sf9sl-worker-us-east-2c-lvrt4   Running   g4dn.xlarge   us-east-2   us-east-2c   36m

Comment 107 errata-xmlrpc 2023-05-17 22:46:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.13.0 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:1326


Note You need to log in before you can comment on or make changes to this bug.