Bug 2017680 - [gcp] Couldn’t enable support for instances with GPUs on GCP
Summary: [gcp] Couldn’t enable support for instances with GPUs on GCP
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.10.0
Assignee: Samuel Stuchly
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-10-27 08:25 UTC by sunzhaohua
Modified: 2022-03-10 16:22 UTC (History)
0 users

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-10 16:22:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift api pull 1044 0 None open update types_gcprovider.go 2021-10-29 10:49:28 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:22:25 UTC

Description sunzhaohua 2021-10-27 08:25:29 UTC
Description of problem: 
Couldn’t enable support for instances with GPUs on GCP


Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2021-10-25-190146

How reproducible:
always

Steps to Reproduce:
1. Create a new machineset
      providerSpec:
        value:
          apiVersion: gcpprovider.openshift.io/v1beta1
...
          guestAccelerators:
          - acceleratorCount: 1
            acceleratorType: nvidia-tesla-p100
          kind: GCPMachineProviderSpec
          machineType: n1-standard-1

2. Check new created machines
3.

Actual results:
GuestAccelerators are ignored in machine yaml file

  providerSpec:
    value:
      apiVersion: gcpprovider.openshift.io/v1beta1
      canIPForward: false
      credentialsSecret:
        name: gcp-cloud-credentials
      deletionProtection: false
      disks:
      - autoDelete: true
        boot: true
        image: projects/rhcos-cloud/global/images/rhcos-410-84-202110140201-0-gcp-x86-64
        labels: null
        sizeGb: 128
        type: pd-ssd
      kind: GCPMachineProviderSpec
      machineType: n1-standard-1
      metadata:
        creationTimestamp: null
      networkInterfaces:
      - network: wewang-gcp10-r5h4b-network
        subnetwork: wewang-gcp10-r5h4b-worker-subnet
      projectID: openshift-qe
      region: us-central1
      serviceAccounts:
      - email: wewang-gcp10-r5h4b-w.gserviceaccount.com
        scopes:
        - https://www.googleapis.com/auth/cloud-platform
      tags:
      - wewang-gcp10-r5h4b-worker
      userDataSecret:
        name: worker-user-data
      zone: us-central1-c


Expected results:
Could create instances with GPU successfully

Additional info:
https://issues.redhat.com/browse/OCPCLOUD-812

Comment 1 Joel Speed 2021-10-27 09:25:14 UTC
@

Comment 2 Joel Speed 2021-10-27 09:26:31 UTC
@Sam, Please make sure that the accelerated network fields have been copied over to the openshift/api repo as part of the migration and that the MAO repo has the latest copy of the api dependency. If you have issues, please speak to Alex who has been working on this migration.

Comment 7 sunzhaohua 2021-12-20 05:59:40 UTC
Tested with nightly build 4.10.0-0.nightly-2021-12-18-034942, all works well, move to verified.

$ oc get machine                                                                                           
NAME                               PHASE      TYPE            REGION        ZONE            AGE
zhsungcp201-r79l8-master-0         Running    n1-standard-4   us-central1   us-central1-a   169m
zhsungcp201-r79l8-master-1         Running    n1-standard-4   us-central1   us-central1-b   169m
zhsungcp201-r79l8-master-2         Running    n1-standard-4   us-central1   us-central1-c   169m
zhsungcp201-r79l8-worker-a-9knlf   Running    n1-standard-4   us-central1   us-central1-a   165m
zhsungcp201-r79l8-worker-b-xsflz   Running    n1-standard-4   us-central1   us-central1-b   165m
zhsungcp201-r79l8-worker-c-vcw54   Deleting   n1-standard-1   us-central1   us-central1-c   124m

$ oc edit machineset zhsungcp201-r79l8-worker-c
          gpus:
          - count: 1
            type: nvidia-tesla-p100
          kind: GCPMachineProviderSpec
          machineType: n1-standard-1
          onHostMaintenance: Terminate
          restartPolicy: Always

Comment 10 errata-xmlrpc 2022-03-10 16:22:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.