Bug 1915693
| Summary: | Not able to install gpu-operator on cpumanager enabled node. | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Vinu K <vkochuku> |
| Component: | Node | Assignee: | Jiří Mencák <jmencak> |
| Node sub component: | CPU manager | QA Contact: | Walid A. <wabouham> |
| Status: | CLOSED NOTABUG | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | urgent | CC: | akamra, aos-bugs, fromani, jinjli, jseunghw, kpouget, maszulik, mfojtik, nagrawal, ofamera, openshift-bugs-escalate, rpattath, wabouham, zkosic |
| Version: | 4.6 | Keywords: | Reopened |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-08-11 16:06:32 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Vinu K
2021-01-13 09:07:18 UTC
Sending this over to the node team who is responsible for this piece of the stack. The last PR required to fix this issue has been posted on the GPU Operator repository: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/182 The PR has been merged into the GPU Operator master banch: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/182 Verified the proposed workarounds are now being incorporated on OCP 4.7.0 by Deploying NVIDIA GPU Operator version 1.6.0 from OperatorHub, and then enabling with CPU Manager on the GPU node:
# oc get pods -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-t4t5m 1/1 Running 5 37h
nvidia-container-toolkit-daemonset-j64c8 1/1 Running 0 37h
nvidia-dcgm-exporter-5q7f8 1/1 Running 0 37h
nvidia-device-plugin-daemonset-97mkc 1/1 Running 0 37h
nvidia-device-plugin-validation 0/1 Completed 0 36h
nvidia-driver-daemonset-zkkg7 1/1 Running 0 37h
1. ds/nvidia-device-plugin-daemonset now shows "spec.template.spec.containers[0].args --pass-device-specs=true":
oc get ds/nvidia-device-plugin-daemonset -n gpu-operator-resources -o yaml
.
.
spec:
affinity: {}
containers:
- args:
- --mig-strategy=single
- --pass-device-specs=true
- --fail-on-init-error=true
- --device-list-strategy=envvar
- --nvidia-driver-root=/run/nvidia/driver
2. In the same ds/nvidia-device-plugin-daemonset, spec.template.spec.initContainers[0].securityContext sets privileged: true (removed the escalation: false):
oc get ds/nvidia-device-plugin-daemonset -n gpu-operator-resources -o yaml
.
.
.
initContainers:
- args:
- /tmp/vectorAdd
command:
- sh
- -c
image: nvcr.io/nvidia/k8s/cuda-sample@sha256:2a30fe7e23067bc2c3f8f62a6867702a016af2b80b9f6ce861f3fea4dfd85bc2
imagePullPolicy: IfNotPresent
name: toolkit-validation
resources: {}
securityContext:
privileged: true
3. ds/nvidia-dcgm-exporter now sets spec.template.spec.initContainers[0].securityContext to privileged: true (remove the escalation: false):
.
.
.
initContainers:
- args:
- /tmp/vectorAdd
command:
- sh
- -c
image: nvcr.io/nvidia/k8s/cuda-sample@sha256:2a30fe7e23067bc2c3f8f62a6867702a016af2b80b9f6ce861f3fea4dfd85bc2
imagePullPolicy: IfNotPresent
name: toolkit-validation
resources: {}
securityContext:
privileged: true
Also verified with CPU Manager Enabled that we can run the openshift-psap gpu-burn workload derived from (https://github.com/openshift-psap/gpu-burn):
# oc get pods
NAME READY STATUS RESTARTS AGE
gpu-burn-daemonset-8w77b 0/1 CrashLoopBackOff 312 35h
# oc logs -f gpu-burn-daemonset-8w77b
GPU 0: Tesla T4 (UUID: GPU-711790a6-f758-de0a-62c3-ab6d9e7eb726)
10.7% proc'd: 7542 (4136 Gflop/s) errors: 0 temps: 43 C
Summary at: Wed Mar 3 16:11:09 UTC 2021
21.0% proc'd: 15084 (4157 Gflop/s) errors: 0 temps: 45 C
Summary at: Wed Mar 3 16:11:40 UTC 2021
31.3% proc'd: 22626 (4153 Gflop/s) errors: 0 temps: 46 C
Summary at: Wed Mar 3 16:12:11 UTC 2021
41.7% proc'd: 30168 (4127 Gflop/s) errors: 0 temps: 47 C
Summary at: Wed Mar 3 16:12:42 UTC 2021
52.3% proc'd: 37710 (4094 Gflop/s) errors: 0 temps: 49 C
Summary at: Wed Mar 3 16:13:14 UTC 2021
63.0% proc'd: 45252 (4050 Gflop/s) errors: 0 temps: 51 C
Summary at: Wed Mar 3 16:13:46 UTC 2021
73.3% proc'd: 51956 (4008 Gflop/s) errors: 0 temps: 53 C
Summary at: Wed Mar 3 16:14:17 UTC 2021
83.7% proc'd: 59498 (3998 Gflop/s) errors: 0 temps: 56 C
Summary at: Wed Mar 3 16:14:48 UTC 2021
94.3% proc'd: 67040 (3903 Gflop/s) errors: 0 temps: 60 C
Summary at: Wed Mar 3 16:15:20 UTC 2021
100.0% proc'd: 70392 (3841 Gflop/s) errors: 0 temps: 62 C
Killing processes.. done
Tested 1 GPUs:
GPU 0: OK
# cat gpu-burn-resource.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-burn-entrypoint
data:
entrypoint.sh: |-
#!/bin/bash
NUM_GPUS=$(nvidia-smi -L | wc -l)
if [ $NUM_GPUS -eq 0 ]; then
echo "ERROR No GPUs found"
exit 1
fi
/usr/local/bin/gpu-burn 300
if [ ! $? -eq 0 ]; then
exit 1
fi
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
app: gpu-burn-daemonset
name: gpu-burn-daemonset
spec:
selector:
matchLabels:
app: gpu-burn-daemonset
template:
metadata:
labels:
app: gpu-burn-daemonset
spec:
tolerations:
- operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- image: quay.io/openshift-psap/gpu-burn
imagePullPolicy: Always
name: gpu-burn-ctr
command: ["/bin/entrypoint.sh"]
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
volumeMounts:
- name: entrypoint
mountPath: /bin/entrypoint.sh
readOnly: true
subPath: entrypoint.sh
volumes:
- name: entrypoint
configMap:
defaultMode: 0700
name: gpu-burn-entrypoint
nodeSelector:
node-role.kubernetes.io/worker: ""
feature.node.kubernetes.io/pci-10de.present: "true"
# oc describe node ip-10-0-129-108.us-east-2.compute.internal
Name: ip-10-0-129-108.us-east-2.compute.internal
Roles: worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=g4dn.xlarge
beta.kubernetes.io/os=linux
cpumanager=true
failure-domain.beta.kubernetes.io/region=us-east-2
failure-domain.beta.kubernetes.io/zone=us-east-2a
feature.node.kubernetes.io/cpu-cpuid.ADX=true
feature.node.kubernetes.io/cpu-cpuid.AESNI=true
feature.node.kubernetes.io/cpu-cpuid.AVX=true
feature.node.kubernetes.io/cpu-cpuid.AVX2=true
feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true
feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true
feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true
feature.node.kubernetes.io/cpu-cpuid.AVX512F=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true
feature.node.kubernetes.io/cpu-cpuid.FMA3=true
feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true
feature.node.kubernetes.io/cpu-cpuid.MPX=true
feature.node.kubernetes.io/cpu-hardware_multithreading=true
feature.node.kubernetes.io/custom-rdma.available=true
feature.node.kubernetes.io/kernel-selinux.enabled=true
feature.node.kubernetes.io/kernel-version.full=4.18.0-240.10.1.el8_3.x86_64
feature.node.kubernetes.io/kernel-version.major=4
feature.node.kubernetes.io/kernel-version.minor=18
feature.node.kubernetes.io/kernel-version.revision=0
feature.node.kubernetes.io/pci-10de.present=true
feature.node.kubernetes.io/pci-1d0f.present=true
feature.node.kubernetes.io/storage-nonrotationaldisk=true
feature.node.kubernetes.io/system-os_release.ID=rhcos
feature.node.kubernetes.io/system-os_release.RHEL_VERSION=8.3
feature.node.kubernetes.io/system-os_release.VERSION_ID=4.7
feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4
feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=7
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-10-0-129-108
kubernetes.io/os=linux
node-role.kubernetes.io/worker=
node.kubernetes.io/instance-type=g4dn.xlarge
node.openshift.io/os_id=rhcos
nvidia.com/cuda.driver.major=460
nvidia.com/cuda.driver.minor=32
nvidia.com/cuda.driver.rev=03
nvidia.com/cuda.runtime.major=11
nvidia.com/cuda.runtime.minor=2
nvidia.com/gfd.timestamp=1614658952
nvidia.com/gpu.compute.major=7
nvidia.com/gpu.compute.minor=5
nvidia.com/gpu.count=1
nvidia.com/gpu.family=turing
nvidia.com/gpu.machine=g4dn.xlarge
nvidia.com/gpu.memory=15109
nvidia.com/gpu.present=true
nvidia.com/gpu.product=Tesla-T4
nvidia.com/mig.strategy=single
topology.ebs.csi.aws.com/zone=us-east-2a
topology.kubernetes.io/region=us-east-2
topology.kubernetes.io/zone=us-east-2a
Annotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0ca03a40567b86f76"}
machine.openshift.io/machine: openshift-machine-api/walid470a-mqnd8-worker-gpu-us-east-2a-zdsh2
machineconfiguration.openshift.io/currentConfig: rendered-worker-2a9780d7015681c395469893e41b4f76
machineconfiguration.openshift.io/desiredConfig: rendered-worker-2a9780d7015681c395469893e41b4f76
machineconfiguration.openshift.io/reason:
machineconfiguration.openshift.io/state: Done
nfd.node.kubernetes.io/extended-resources:
nfd.node.kubernetes.io/feature-labels:
cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVX512BW,cpu-cpuid.AVX512CD,cpu-cpuid.AVX512DQ,cpu-cpuid.AVX512F,cpu-...
nfd.node.kubernetes.io/worker.version: 1.15
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Tue, 02 Mar 2021 03:30:20 +0000
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: ip-10-0-129-108.us-east-2.compute.internal
AcquireTime: <unset>
RenewTime: Wed, 03 Mar 2021 17:09:18 +0000
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Wed, 03 Mar 2021 17:06:28 +0000 Tue, 02 Mar 2021 04:19:28 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 03 Mar 2021 17:06:28 +0000 Tue, 02 Mar 2021 04:19:28 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 03 Mar 2021 17:06:28 +0000 Tue, 02 Mar 2021 04:19:28 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 03 Mar 2021 17:06:28 +0000 Tue, 02 Mar 2021 04:19:28 +0000 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.0.129.108
Hostname: ip-10-0-129-108.us-east-2.compute.internal
InternalDNS: ip-10-0-129-108.us-east-2.compute.internal
Capacity:
attachable-volumes-aws-ebs: 39
cpu: 4
ephemeral-storage: 104322028Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16105772Ki
nvidia.com/gpu: 1
pods: 250
Allocatable:
attachable-volumes-aws-ebs: 39
cpu: 3500m
ephemeral-storage: 95069439022
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 14954796Ki
nvidia.com/gpu: 1
pods: 250
System Info:
Machine ID: ec2a7dcc527a2fd7d1945bfbeec1b7c1
System UUID: ec2a7dcc-527a-2fd7-d194-5bfbeec1b7c1
Boot ID: c5d8a7f6-b9fc-4786-a0d0-df0a59c28bb4
Kernel Version: 4.18.0-240.10.1.el8_3.x86_64
OS Image: Red Hat Enterprise Linux CoreOS 47.83.202102090044-0 (Ootpa)
Operating System: linux
Architecture: amd64
Container Runtime Version: cri-o://1.20.0-0.rhaos4.7.git8921e00.el8.51
Kubelet Version: v1.20.0+ba45583
Kube-Proxy Version: v1.20.0+ba45583
ProviderID: aws:///us-east-2a/i-0ca03a40567b86f76
Non-terminated Pods: (21 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
default gpu-burn-daemonset-8w77b 0 (0%) 0 (0%) 0 (0%) 0 (0%) 36h
default pod1cpu2quay 2 (57%) 2 (57%) 100Mi (0%) 100Mi (0%) 36h
gpu-operator-resources gpu-feature-discovery-t4t5m 0 (0%) 0 (0%) 0 (0%) 0 (0%) 37h
gpu-operator-resources nvidia-container-toolkit-daemonset-j64c8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 37h
gpu-operator-resources nvidia-dcgm-exporter-5q7f8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 37h
gpu-operator-resources nvidia-device-plugin-daemonset-97mkc 0 (0%) 0 (0%) 0 (0%) 0 (0%) 37h
gpu-operator-resources nvidia-driver-daemonset-zkkg7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 37h
openshift-cluster-csi-drivers aws-ebs-csi-driver-node-55gpr 30m (0%) 0 (0%) 150Mi (1%) 0 (0%) 37h
openshift-cluster-node-tuning-operator tuned-9ldzp 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 37h
openshift-dns dns-default-tncf6 65m (1%) 0 (0%) 131Mi (0%) 0 (0%) 37h
openshift-image-registry node-ca-bv875 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 37h
openshift-ingress-canary ingress-canary-wl566 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 37h
openshift-machine-config-operator machine-config-daemon-n4fz5 40m (1%) 0 (0%) 100Mi (0%) 0 (0%) 37h
openshift-monitoring alertmanager-main-2 8m (0%) 0 (0%) 270Mi (1%) 0 (0%) 36h
openshift-monitoring node-exporter-4txp7 9m (0%) 0 (0%) 210Mi (1%) 0 (0%) 37h
openshift-multus multus-5ldxg 10m (0%) 0 (0%) 150Mi (1%) 0 (0%) 37h
openshift-multus network-metrics-daemon-q2thd 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 37h
openshift-network-diagnostics network-check-target-pmh78 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 37h
openshift-sdn ovs-m829x 15m (0%) 0 (0%) 400Mi (2%) 0 (0%) 37h
openshift-sdn sdn-ktwxl 110m (3%) 0 (0%) 220Mi (1%) 0 (0%) 37h
test-nfd nfd-worker-jkbmn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 37h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 2347m (67%) 2 (57%)
memory 1946Mi (13%) 100Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
nvidia.com/gpu 1 1
Events: <none>
# oc exec -it nvidia-driver-daemonset-zkkg7 nvidia-smi -n gpu-operator-resources
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Wed Mar 3 17:15:28 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 61C P0 69W / 70W | 13612MiB / 15109MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1515600 C ./gpu_burn 13609MiB |
+-----------------------------------------------------------------------------+
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |