Description of problem: When enable cpumanager (https://docs.openshift.com/container-platform/4.6/scalability_and_performance/using-cpu-manager.html) and try to install the NVIDIA gpu-operator (tested with version 1.3.1 and 1.4.0, https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/index.html) the 'nvidia-driver-validation' pod will always fail with logs below: --- # oc logs nvidia-driver-validation Failed to allocate device vector A (error code no CUDA-capable device is detected)! [Vector addition of 50000 elements] --- Disabling the cpumanager results in success and we can proceed. Version-Release number of selected component (if applicable): 4.6 How reproducible: Easy Steps to Reproduce: 1. Enable cpumanager on node and install gpu-operator. 2. Watch nvidia-driver-validation Pod logs 3. Actual results: nvidia-driver-validation Pod is failed to start. Expected results: gpu-opeartor should be installed without any error. Additional info:
Sending this over to the node team who is responsible for this piece of the stack.
The last PR required to fix this issue has been posted on the GPU Operator repository: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/182
The PR has been merged into the GPU Operator master banch: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/182
Verified the proposed workarounds are now being incorporated on OCP 4.7.0 by Deploying NVIDIA GPU Operator version 1.6.0 from OperatorHub, and then enabling with CPU Manager on the GPU node: # oc get pods -n gpu-operator-resources NAME READY STATUS RESTARTS AGE gpu-feature-discovery-t4t5m 1/1 Running 5 37h nvidia-container-toolkit-daemonset-j64c8 1/1 Running 0 37h nvidia-dcgm-exporter-5q7f8 1/1 Running 0 37h nvidia-device-plugin-daemonset-97mkc 1/1 Running 0 37h nvidia-device-plugin-validation 0/1 Completed 0 36h nvidia-driver-daemonset-zkkg7 1/1 Running 0 37h 1. ds/nvidia-device-plugin-daemonset now shows "spec.template.spec.containers[0].args --pass-device-specs=true": oc get ds/nvidia-device-plugin-daemonset -n gpu-operator-resources -o yaml . . spec: affinity: {} containers: - args: - --mig-strategy=single - --pass-device-specs=true - --fail-on-init-error=true - --device-list-strategy=envvar - --nvidia-driver-root=/run/nvidia/driver 2. In the same ds/nvidia-device-plugin-daemonset, spec.template.spec.initContainers[0].securityContext sets privileged: true (removed the escalation: false): oc get ds/nvidia-device-plugin-daemonset -n gpu-operator-resources -o yaml . . . initContainers: - args: - /tmp/vectorAdd command: - sh - -c image: nvcr.io/nvidia/k8s/cuda-sample@sha256:2a30fe7e23067bc2c3f8f62a6867702a016af2b80b9f6ce861f3fea4dfd85bc2 imagePullPolicy: IfNotPresent name: toolkit-validation resources: {} securityContext: privileged: true 3. ds/nvidia-dcgm-exporter now sets spec.template.spec.initContainers[0].securityContext to privileged: true (remove the escalation: false): . . . initContainers: - args: - /tmp/vectorAdd command: - sh - -c image: nvcr.io/nvidia/k8s/cuda-sample@sha256:2a30fe7e23067bc2c3f8f62a6867702a016af2b80b9f6ce861f3fea4dfd85bc2 imagePullPolicy: IfNotPresent name: toolkit-validation resources: {} securityContext: privileged: true Also verified with CPU Manager Enabled that we can run the openshift-psap gpu-burn workload derived from (https://github.com/openshift-psap/gpu-burn): # oc get pods NAME READY STATUS RESTARTS AGE gpu-burn-daemonset-8w77b 0/1 CrashLoopBackOff 312 35h # oc logs -f gpu-burn-daemonset-8w77b GPU 0: Tesla T4 (UUID: GPU-711790a6-f758-de0a-62c3-ab6d9e7eb726) 10.7% proc'd: 7542 (4136 Gflop/s) errors: 0 temps: 43 C Summary at: Wed Mar 3 16:11:09 UTC 2021 21.0% proc'd: 15084 (4157 Gflop/s) errors: 0 temps: 45 C Summary at: Wed Mar 3 16:11:40 UTC 2021 31.3% proc'd: 22626 (4153 Gflop/s) errors: 0 temps: 46 C Summary at: Wed Mar 3 16:12:11 UTC 2021 41.7% proc'd: 30168 (4127 Gflop/s) errors: 0 temps: 47 C Summary at: Wed Mar 3 16:12:42 UTC 2021 52.3% proc'd: 37710 (4094 Gflop/s) errors: 0 temps: 49 C Summary at: Wed Mar 3 16:13:14 UTC 2021 63.0% proc'd: 45252 (4050 Gflop/s) errors: 0 temps: 51 C Summary at: Wed Mar 3 16:13:46 UTC 2021 73.3% proc'd: 51956 (4008 Gflop/s) errors: 0 temps: 53 C Summary at: Wed Mar 3 16:14:17 UTC 2021 83.7% proc'd: 59498 (3998 Gflop/s) errors: 0 temps: 56 C Summary at: Wed Mar 3 16:14:48 UTC 2021 94.3% proc'd: 67040 (3903 Gflop/s) errors: 0 temps: 60 C Summary at: Wed Mar 3 16:15:20 UTC 2021 100.0% proc'd: 70392 (3841 Gflop/s) errors: 0 temps: 62 C Killing processes.. done Tested 1 GPUs: GPU 0: OK # cat gpu-burn-resource.yaml apiVersion: v1 kind: ConfigMap metadata: name: gpu-burn-entrypoint data: entrypoint.sh: |- #!/bin/bash NUM_GPUS=$(nvidia-smi -L | wc -l) if [ $NUM_GPUS -eq 0 ]; then echo "ERROR No GPUs found" exit 1 fi /usr/local/bin/gpu-burn 300 if [ ! $? -eq 0 ]; then exit 1 fi --- apiVersion: apps/v1 kind: DaemonSet metadata: labels: app: gpu-burn-daemonset name: gpu-burn-daemonset spec: selector: matchLabels: app: gpu-burn-daemonset template: metadata: labels: app: gpu-burn-daemonset spec: tolerations: - operator: Exists - key: nvidia.com/gpu operator: Exists effect: NoSchedule containers: - image: quay.io/openshift-psap/gpu-burn imagePullPolicy: Always name: gpu-burn-ctr command: ["/bin/entrypoint.sh"] resources: limits: nvidia.com/gpu: 1 # requesting 1 GPU volumeMounts: - name: entrypoint mountPath: /bin/entrypoint.sh readOnly: true subPath: entrypoint.sh volumes: - name: entrypoint configMap: defaultMode: 0700 name: gpu-burn-entrypoint nodeSelector: node-role.kubernetes.io/worker: "" feature.node.kubernetes.io/pci-10de.present: "true" # oc describe node ip-10-0-129-108.us-east-2.compute.internal Name: ip-10-0-129-108.us-east-2.compute.internal Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=g4dn.xlarge beta.kubernetes.io/os=linux cpumanager=true failure-domain.beta.kubernetes.io/region=us-east-2 failure-domain.beta.kubernetes.io/zone=us-east-2a feature.node.kubernetes.io/cpu-cpuid.ADX=true feature.node.kubernetes.io/cpu-cpuid.AESNI=true feature.node.kubernetes.io/cpu-cpuid.AVX=true feature.node.kubernetes.io/cpu-cpuid.AVX2=true feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true feature.node.kubernetes.io/cpu-cpuid.AVX512F=true feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true feature.node.kubernetes.io/cpu-cpuid.FMA3=true feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true feature.node.kubernetes.io/cpu-cpuid.MPX=true feature.node.kubernetes.io/cpu-hardware_multithreading=true feature.node.kubernetes.io/custom-rdma.available=true feature.node.kubernetes.io/kernel-selinux.enabled=true feature.node.kubernetes.io/kernel-version.full=4.18.0-240.10.1.el8_3.x86_64 feature.node.kubernetes.io/kernel-version.major=4 feature.node.kubernetes.io/kernel-version.minor=18 feature.node.kubernetes.io/kernel-version.revision=0 feature.node.kubernetes.io/pci-10de.present=true feature.node.kubernetes.io/pci-1d0f.present=true feature.node.kubernetes.io/storage-nonrotationaldisk=true feature.node.kubernetes.io/system-os_release.ID=rhcos feature.node.kubernetes.io/system-os_release.RHEL_VERSION=8.3 feature.node.kubernetes.io/system-os_release.VERSION_ID=4.7 feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4 feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=7 kubernetes.io/arch=amd64 kubernetes.io/hostname=ip-10-0-129-108 kubernetes.io/os=linux node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=g4dn.xlarge node.openshift.io/os_id=rhcos nvidia.com/cuda.driver.major=460 nvidia.com/cuda.driver.minor=32 nvidia.com/cuda.driver.rev=03 nvidia.com/cuda.runtime.major=11 nvidia.com/cuda.runtime.minor=2 nvidia.com/gfd.timestamp=1614658952 nvidia.com/gpu.compute.major=7 nvidia.com/gpu.compute.minor=5 nvidia.com/gpu.count=1 nvidia.com/gpu.family=turing nvidia.com/gpu.machine=g4dn.xlarge nvidia.com/gpu.memory=15109 nvidia.com/gpu.present=true nvidia.com/gpu.product=Tesla-T4 nvidia.com/mig.strategy=single topology.ebs.csi.aws.com/zone=us-east-2a topology.kubernetes.io/region=us-east-2 topology.kubernetes.io/zone=us-east-2a Annotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0ca03a40567b86f76"} machine.openshift.io/machine: openshift-machine-api/walid470a-mqnd8-worker-gpu-us-east-2a-zdsh2 machineconfiguration.openshift.io/currentConfig: rendered-worker-2a9780d7015681c395469893e41b4f76 machineconfiguration.openshift.io/desiredConfig: rendered-worker-2a9780d7015681c395469893e41b4f76 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done nfd.node.kubernetes.io/extended-resources: nfd.node.kubernetes.io/feature-labels: cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVX512BW,cpu-cpuid.AVX512CD,cpu-cpuid.AVX512DQ,cpu-cpuid.AVX512F,cpu-... nfd.node.kubernetes.io/worker.version: 1.15 volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Tue, 02 Mar 2021 03:30:20 +0000 Taints: <none> Unschedulable: false Lease: HolderIdentity: ip-10-0-129-108.us-east-2.compute.internal AcquireTime: <unset> RenewTime: Wed, 03 Mar 2021 17:09:18 +0000 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Wed, 03 Mar 2021 17:06:28 +0000 Tue, 02 Mar 2021 04:19:28 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Wed, 03 Mar 2021 17:06:28 +0000 Tue, 02 Mar 2021 04:19:28 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Wed, 03 Mar 2021 17:06:28 +0000 Tue, 02 Mar 2021 04:19:28 +0000 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Wed, 03 Mar 2021 17:06:28 +0000 Tue, 02 Mar 2021 04:19:28 +0000 KubeletReady kubelet is posting ready status Addresses: InternalIP: 10.0.129.108 Hostname: ip-10-0-129-108.us-east-2.compute.internal InternalDNS: ip-10-0-129-108.us-east-2.compute.internal Capacity: attachable-volumes-aws-ebs: 39 cpu: 4 ephemeral-storage: 104322028Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 16105772Ki nvidia.com/gpu: 1 pods: 250 Allocatable: attachable-volumes-aws-ebs: 39 cpu: 3500m ephemeral-storage: 95069439022 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 14954796Ki nvidia.com/gpu: 1 pods: 250 System Info: Machine ID: ec2a7dcc527a2fd7d1945bfbeec1b7c1 System UUID: ec2a7dcc-527a-2fd7-d194-5bfbeec1b7c1 Boot ID: c5d8a7f6-b9fc-4786-a0d0-df0a59c28bb4 Kernel Version: 4.18.0-240.10.1.el8_3.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 47.83.202102090044-0 (Ootpa) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.20.0-0.rhaos4.7.git8921e00.el8.51 Kubelet Version: v1.20.0+ba45583 Kube-Proxy Version: v1.20.0+ba45583 ProviderID: aws:///us-east-2a/i-0ca03a40567b86f76 Non-terminated Pods: (21 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- default gpu-burn-daemonset-8w77b 0 (0%) 0 (0%) 0 (0%) 0 (0%) 36h default pod1cpu2quay 2 (57%) 2 (57%) 100Mi (0%) 100Mi (0%) 36h gpu-operator-resources gpu-feature-discovery-t4t5m 0 (0%) 0 (0%) 0 (0%) 0 (0%) 37h gpu-operator-resources nvidia-container-toolkit-daemonset-j64c8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 37h gpu-operator-resources nvidia-dcgm-exporter-5q7f8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 37h gpu-operator-resources nvidia-device-plugin-daemonset-97mkc 0 (0%) 0 (0%) 0 (0%) 0 (0%) 37h gpu-operator-resources nvidia-driver-daemonset-zkkg7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 37h openshift-cluster-csi-drivers aws-ebs-csi-driver-node-55gpr 30m (0%) 0 (0%) 150Mi (1%) 0 (0%) 37h openshift-cluster-node-tuning-operator tuned-9ldzp 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 37h openshift-dns dns-default-tncf6 65m (1%) 0 (0%) 131Mi (0%) 0 (0%) 37h openshift-image-registry node-ca-bv875 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 37h openshift-ingress-canary ingress-canary-wl566 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 37h openshift-machine-config-operator machine-config-daemon-n4fz5 40m (1%) 0 (0%) 100Mi (0%) 0 (0%) 37h openshift-monitoring alertmanager-main-2 8m (0%) 0 (0%) 270Mi (1%) 0 (0%) 36h openshift-monitoring node-exporter-4txp7 9m (0%) 0 (0%) 210Mi (1%) 0 (0%) 37h openshift-multus multus-5ldxg 10m (0%) 0 (0%) 150Mi (1%) 0 (0%) 37h openshift-multus network-metrics-daemon-q2thd 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 37h openshift-network-diagnostics network-check-target-pmh78 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 37h openshift-sdn ovs-m829x 15m (0%) 0 (0%) 400Mi (2%) 0 (0%) 37h openshift-sdn sdn-ktwxl 110m (3%) 0 (0%) 220Mi (1%) 0 (0%) 37h test-nfd nfd-worker-jkbmn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 37h Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 2347m (67%) 2 (57%) memory 1946Mi (13%) 100Mi (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 nvidia.com/gpu 1 1 Events: <none> # oc exec -it nvidia-driver-daemonset-zkkg7 nvidia-smi -n gpu-operator-resources kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead. Wed Mar 3 17:15:28 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 | | N/A 61C P0 69W / 70W | 13612MiB / 15109MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1515600 C ./gpu_burn 13609MiB | +-----------------------------------------------------------------------------+
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438