1915693 – Not able to install gpu-operator on cpumanager enabled node.

Bug 1915693 - Not able to install gpu-operator on cpumanager enabled node.

Summary: Not able to install gpu-operator on cpumanager enabled node.

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.6
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Jiří Mencák
QA Contact:	Walid A.
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-13 09:07 UTC by Vinu K
Modified:	2024-10-01 17:18 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-11 16:06:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	5705231	0	None	None	None	2021-01-15 08:36:55 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:36:27 UTC

Description Vinu K 2021-01-13 09:07:18 UTC

Description of problem:
When enable cpumanager (https://docs.openshift.com/container-platform/4.6/scalability_and_performance/using-cpu-manager.html) and try to install the NVIDIA gpu-operator (tested with version 1.3.1 and 1.4.0, https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/index.html) the 'nvidia-driver-validation' pod will always fail with logs below:

---
# oc logs nvidia-driver-validation
Failed to allocate device vector A (error code no CUDA-capable device is detected)!
[Vector addition of 50000 elements]
---

Disabling the cpumanager results in success and we can proceed.

Version-Release number of selected component (if applicable):
4.6

How reproducible:
Easy

Steps to Reproduce:
1. Enable cpumanager on node and install gpu-operator.
2. Watch nvidia-driver-validation Pod logs
3.

Actual results:
nvidia-driver-validation Pod is failed to start.

Expected results:
gpu-opeartor should be installed without any error.

Additional info:

Comment 3 Maciej Szulik 2021-01-13 11:49:07 UTC

Sending this over to the node team who is responsible for this piece of the stack.

Comment 16 Kevin Pouget 2021-02-18 08:11:09 UTC

The last PR required to fix this issue has been posted on the GPU Operator repository:

https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/182

Comment 17 Kevin Pouget 2021-02-18 15:27:31 UTC

The PR has been merged into the GPU Operator master banch:

https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/182

Comment 20 Walid A. 2021-03-03 17:27:19 UTC

Verified the proposed workarounds are now being incorporated on OCP 4.7.0 by Deploying NVIDIA GPU Operator version 1.6.0 from OperatorHub, and then enabling with CPU Manager on the GPU node:

# oc get pods -n gpu-operator-resources
NAME                                       READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-t4t5m                1/1     Running     5          37h
nvidia-container-toolkit-daemonset-j64c8   1/1     Running     0          37h
nvidia-dcgm-exporter-5q7f8                 1/1     Running     0          37h
nvidia-device-plugin-daemonset-97mkc       1/1     Running     0          37h
nvidia-device-plugin-validation            0/1     Completed   0          36h
nvidia-driver-daemonset-zkkg7              1/1     Running     0          37h

1. ds/nvidia-device-plugin-daemonset now shows "spec.template.spec.containers[0].args --pass-device-specs=true":

   oc get ds/nvidia-device-plugin-daemonset -n gpu-operator-resources -o yaml
   .
   .
    spec:
      affinity: {}
      containers:
      - args:
        - --mig-strategy=single
        - --pass-device-specs=true
        - --fail-on-init-error=true
        - --device-list-strategy=envvar
        - --nvidia-driver-root=/run/nvidia/driver


2. In the same ds/nvidia-device-plugin-daemonset, spec.template.spec.initContainers[0].securityContext sets privileged: true (removed the escalation: false):
      oc get ds/nvidia-device-plugin-daemonset -n gpu-operator-resources -o yaml
      .
      .
      .
      initContainers:
      - args:
        - /tmp/vectorAdd
        command:
        - sh
        - -c
        image: nvcr.io/nvidia/k8s/cuda-sample@sha256:2a30fe7e23067bc2c3f8f62a6867702a016af2b80b9f6ce861f3fea4dfd85bc2
        imagePullPolicy: IfNotPresent
        name: toolkit-validation
        resources: {}
        securityContext:
          privileged: true



3. ds/nvidia-dcgm-exporter now sets spec.template.spec.initContainers[0].securityContext to privileged: true (remove the escalation: false):

     .
     .
     .
      initContainers:
      - args:
        - /tmp/vectorAdd
        command:
        - sh
        - -c
        image: nvcr.io/nvidia/k8s/cuda-sample@sha256:2a30fe7e23067bc2c3f8f62a6867702a016af2b80b9f6ce861f3fea4dfd85bc2
        imagePullPolicy: IfNotPresent
        name: toolkit-validation
        resources: {}
        securityContext:
          privileged: true

Also verified with CPU Manager Enabled that we can run the openshift-psap gpu-burn workload derived from (https://github.com/openshift-psap/gpu-burn):

# oc get pods
NAME                       READY   STATUS             RESTARTS   AGE
gpu-burn-daemonset-8w77b   0/1     CrashLoopBackOff   312        35h

# oc logs -f gpu-burn-daemonset-8w77b
GPU 0: Tesla T4 (UUID: GPU-711790a6-f758-de0a-62c3-ab6d9e7eb726)
10.7%  proc'd: 7542 (4136 Gflop/s)   errors: 0   temps: 43 C 
	Summary at:   Wed Mar  3 16:11:09 UTC 2021

21.0%  proc'd: 15084 (4157 Gflop/s)   errors: 0   temps: 45 C 
	Summary at:   Wed Mar  3 16:11:40 UTC 2021

31.3%  proc'd: 22626 (4153 Gflop/s)   errors: 0   temps: 46 C 
	Summary at:   Wed Mar  3 16:12:11 UTC 2021

41.7%  proc'd: 30168 (4127 Gflop/s)   errors: 0   temps: 47 C 
	Summary at:   Wed Mar  3 16:12:42 UTC 2021

52.3%  proc'd: 37710 (4094 Gflop/s)   errors: 0   temps: 49 C 
	Summary at:   Wed Mar  3 16:13:14 UTC 2021

63.0%  proc'd: 45252 (4050 Gflop/s)   errors: 0   temps: 51 C 
	Summary at:   Wed Mar  3 16:13:46 UTC 2021

73.3%  proc'd: 51956 (4008 Gflop/s)   errors: 0   temps: 53 C 
	Summary at:   Wed Mar  3 16:14:17 UTC 2021

83.7%  proc'd: 59498 (3998 Gflop/s)   errors: 0   temps: 56 C 
	Summary at:   Wed Mar  3 16:14:48 UTC 2021

94.3%  proc'd: 67040 (3903 Gflop/s)   errors: 0   temps: 60 C 
	Summary at:   Wed Mar  3 16:15:20 UTC 2021

100.0%  proc'd: 70392 (3841 Gflop/s)   errors: 0   temps: 62 C 
Killing processes.. done

Tested 1 GPUs:
	GPU 0: OK

# cat gpu-burn-resource.yaml 
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-burn-entrypoint
data:
  entrypoint.sh: |-
    #!/bin/bash
    NUM_GPUS=$(nvidia-smi -L | wc -l)
    if [ $NUM_GPUS -eq 0 ]; then
      echo "ERROR No GPUs found"
      exit 1
    fi

    /usr/local/bin/gpu-burn 300

    if [ ! $? -eq 0 ]; then 
      exit 1
    fi
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: gpu-burn-daemonset
  name: gpu-burn-daemonset
spec:
  selector:
    matchLabels:
      app: gpu-burn-daemonset
  template:
    metadata:
      labels:
        app: gpu-burn-daemonset
    spec:
      tolerations:
      - operator: Exists
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - image: quay.io/openshift-psap/gpu-burn
        imagePullPolicy: Always
        name: gpu-burn-ctr
        command: ["/bin/entrypoint.sh"]
        resources:
          limits:
            nvidia.com/gpu: 1 # requesting 1 GPU
        volumeMounts:
        - name: entrypoint
          mountPath: /bin/entrypoint.sh
          readOnly: true
          subPath: entrypoint.sh
      volumes:
        - name: entrypoint
          configMap:
            defaultMode: 0700
            name: gpu-burn-entrypoint
      nodeSelector:
        node-role.kubernetes.io/worker: ""
        feature.node.kubernetes.io/pci-10de.present: "true"

# oc describe node ip-10-0-129-108.us-east-2.compute.internal
Name:               ip-10-0-129-108.us-east-2.compute.internal
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=g4dn.xlarge
                    beta.kubernetes.io/os=linux
                    cpumanager=true
                    failure-domain.beta.kubernetes.io/region=us-east-2
                    failure-domain.beta.kubernetes.io/zone=us-east-2a
                    feature.node.kubernetes.io/cpu-cpuid.ADX=true
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512F=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true
                    feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                    feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true
                    feature.node.kubernetes.io/cpu-cpuid.MPX=true
                    feature.node.kubernetes.io/cpu-hardware_multithreading=true
                    feature.node.kubernetes.io/custom-rdma.available=true
                    feature.node.kubernetes.io/kernel-selinux.enabled=true
                    feature.node.kubernetes.io/kernel-version.full=4.18.0-240.10.1.el8_3.x86_64
                    feature.node.kubernetes.io/kernel-version.major=4
                    feature.node.kubernetes.io/kernel-version.minor=18
                    feature.node.kubernetes.io/kernel-version.revision=0
                    feature.node.kubernetes.io/pci-10de.present=true
                    feature.node.kubernetes.io/pci-1d0f.present=true
                    feature.node.kubernetes.io/storage-nonrotationaldisk=true
                    feature.node.kubernetes.io/system-os_release.ID=rhcos
                    feature.node.kubernetes.io/system-os_release.RHEL_VERSION=8.3
                    feature.node.kubernetes.io/system-os_release.VERSION_ID=4.7
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=7
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-129-108
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=g4dn.xlarge
                    node.openshift.io/os_id=rhcos
                    nvidia.com/cuda.driver.major=460
                    nvidia.com/cuda.driver.minor=32
                    nvidia.com/cuda.driver.rev=03
                    nvidia.com/cuda.runtime.major=11
                    nvidia.com/cuda.runtime.minor=2
                    nvidia.com/gfd.timestamp=1614658952
                    nvidia.com/gpu.compute.major=7
                    nvidia.com/gpu.compute.minor=5
                    nvidia.com/gpu.count=1
                    nvidia.com/gpu.family=turing
                    nvidia.com/gpu.machine=g4dn.xlarge
                    nvidia.com/gpu.memory=15109
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=Tesla-T4
                    nvidia.com/mig.strategy=single
                    topology.ebs.csi.aws.com/zone=us-east-2a
                    topology.kubernetes.io/region=us-east-2
                    topology.kubernetes.io/zone=us-east-2a
Annotations:        csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0ca03a40567b86f76"}
                    machine.openshift.io/machine: openshift-machine-api/walid470a-mqnd8-worker-gpu-us-east-2a-zdsh2
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-2a9780d7015681c395469893e41b4f76
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-2a9780d7015681c395469893e41b4f76
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    nfd.node.kubernetes.io/extended-resources: 
                    nfd.node.kubernetes.io/feature-labels:
                      cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVX512BW,cpu-cpuid.AVX512CD,cpu-cpuid.AVX512DQ,cpu-cpuid.AVX512F,cpu-...
                    nfd.node.kubernetes.io/worker.version: 1.15
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 02 Mar 2021 03:30:20 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-0-129-108.us-east-2.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Wed, 03 Mar 2021 17:09:18 +0000
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 03 Mar 2021 17:06:28 +0000   Tue, 02 Mar 2021 04:19:28 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 03 Mar 2021 17:06:28 +0000   Tue, 02 Mar 2021 04:19:28 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 03 Mar 2021 17:06:28 +0000   Tue, 02 Mar 2021 04:19:28 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 03 Mar 2021 17:06:28 +0000   Tue, 02 Mar 2021 04:19:28 +0000   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.0.129.108
  Hostname:     ip-10-0-129-108.us-east-2.compute.internal
  InternalDNS:  ip-10-0-129-108.us-east-2.compute.internal
Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         4
  ephemeral-storage:           104322028Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      16105772Ki
  nvidia.com/gpu:              1
  pods:                        250
Allocatable:
  attachable-volumes-aws-ebs:  39
  cpu:                         3500m
  ephemeral-storage:           95069439022
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      14954796Ki
  nvidia.com/gpu:              1
  pods:                        250
System Info:
  Machine ID:                             ec2a7dcc527a2fd7d1945bfbeec1b7c1
  System UUID:                            ec2a7dcc-527a-2fd7-d194-5bfbeec1b7c1
  Boot ID:                                c5d8a7f6-b9fc-4786-a0d0-df0a59c28bb4
  Kernel Version:                         4.18.0-240.10.1.el8_3.x86_64
  OS Image:                               Red Hat Enterprise Linux CoreOS 47.83.202102090044-0 (Ootpa)
  Operating System:                       linux
  Architecture:                           amd64
  Container Runtime Version:              cri-o://1.20.0-0.rhaos4.7.git8921e00.el8.51
  Kubelet Version:                        v1.20.0+ba45583
  Kube-Proxy Version:                     v1.20.0+ba45583
ProviderID:                               aws:///us-east-2a/i-0ca03a40567b86f76
Non-terminated Pods:                      (21 in total)
  Namespace                               Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                               ----                                        ------------  ----------  ---------------  -------------  ---
  default                                 gpu-burn-daemonset-8w77b                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         36h
  default                                 pod1cpu2quay                                2 (57%)       2 (57%)     100Mi (0%)       100Mi (0%)     36h
  gpu-operator-resources                  gpu-feature-discovery-t4t5m                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         37h
  gpu-operator-resources                  nvidia-container-toolkit-daemonset-j64c8    0 (0%)        0 (0%)      0 (0%)           0 (0%)         37h
  gpu-operator-resources                  nvidia-dcgm-exporter-5q7f8                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         37h
  gpu-operator-resources                  nvidia-device-plugin-daemonset-97mkc        0 (0%)        0 (0%)      0 (0%)           0 (0%)         37h
  gpu-operator-resources                  nvidia-driver-daemonset-zkkg7               0 (0%)        0 (0%)      0 (0%)           0 (0%)         37h
  openshift-cluster-csi-drivers           aws-ebs-csi-driver-node-55gpr               30m (0%)      0 (0%)      150Mi (1%)       0 (0%)         37h
  openshift-cluster-node-tuning-operator  tuned-9ldzp                                 10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         37h
  openshift-dns                           dns-default-tncf6                           65m (1%)      0 (0%)      131Mi (0%)       0 (0%)         37h
  openshift-image-registry                node-ca-bv875                               10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         37h
  openshift-ingress-canary                ingress-canary-wl566                        10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         37h
  openshift-machine-config-operator       machine-config-daemon-n4fz5                 40m (1%)      0 (0%)      100Mi (0%)       0 (0%)         37h
  openshift-monitoring                    alertmanager-main-2                         8m (0%)       0 (0%)      270Mi (1%)       0 (0%)         36h
  openshift-monitoring                    node-exporter-4txp7                         9m (0%)       0 (0%)      210Mi (1%)       0 (0%)         37h
  openshift-multus                        multus-5ldxg                                10m (0%)      0 (0%)      150Mi (1%)       0 (0%)         37h
  openshift-multus                        network-metrics-daemon-q2thd                20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         37h
  openshift-network-diagnostics           network-check-target-pmh78                  10m (0%)      0 (0%)      15Mi (0%)        0 (0%)         37h
  openshift-sdn                           ovs-m829x                                   15m (0%)      0 (0%)      400Mi (2%)       0 (0%)         37h
  openshift-sdn                           sdn-ktwxl                                   110m (3%)     0 (0%)      220Mi (1%)       0 (0%)         37h
  test-nfd                                nfd-worker-jkbmn                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         37h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         2347m (67%)   2 (57%)
  memory                      1946Mi (13%)  100Mi (0%)
  ephemeral-storage           0 (0%)        0 (0%)
  hugepages-1Gi               0 (0%)        0 (0%)
  hugepages-2Mi               0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0
  nvidia.com/gpu              1             1
Events:                       <none>

# oc exec -it nvidia-driver-daemonset-zkkg7 nvidia-smi -n gpu-operator-resources
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Wed Mar  3 17:15:28 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   61C    P0    69W /  70W |  13612MiB / 15109MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1515600      C   ./gpu_burn                      13609MiB |
+-----------------------------------------------------------------------------+

Comment 24 errata-xmlrpc 2021-07-27 22:36:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.