Description of problem: On an OCP 4.7.0-rc.0 IPI installed cluster on AWS, with the NVIDIA GPU Operator installed via OperatoHub, we don't see the DCGM metrics on the OpenShift Console under Monitoring -> Metrics tab. These metrics where available on OCP 4.6.15 and 4.6.16. The nvidia-dcgm-exporter appears to expose these metrics but the Openshift Console Monitoring -> Metrics page is not picking them up: # cat get-dcgm-metrics.sh #!/bin/bash DCGM_POD=$(oc get pods -lapp=nvidia-dcgm-exporter -oname -n gpu-operator-resources | head -1); if [ -z "$DCGM_POD" ]; then echo "Failed to find a pod for nvidia-dcgm-exporter"; exit 10; fi; DCGM_PORT=9400; LOCAL_PORT=9401; retry=5; timeout 10 oc port-forward ${DCGM_POD} ${LOCAL_PORT}:${DCGM_PORT} -n gpu-operator-resources & while [ "$DCGM_OUTPUT" == "" ]; do sleep 1; DCGM_OUTPUT=$(curl localhost:${LOCAL_PORT}/metrics 2>/dev/null); retry=$(($retry - 1)); if [[ $retry == 0 ]]; then echo "Failed to get any output from DCGM/metrics ..."; exit 11; fi; done; grep "# TYPE DCGM_FI_DEV" <<< ${DCGM_OUTPUT} ./get-dcgm-metrics.sh Forwarding from 127.0.0.1:9401 -> 9400 Forwarding from [::1]:9401 -> 9400 Handling connection for 9401 # TYPE DCGM_FI_DEV_SM_CLOCK gauge # TYPE DCGM_FI_DEV_MEM_CLOCK gauge # TYPE DCGM_FI_DEV_MEMORY_TEMP gauge # TYPE DCGM_FI_DEV_GPU_TEMP gauge # TYPE DCGM_FI_DEV_POWER_USAGE gauge # TYPE DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION counter # TYPE DCGM_FI_DEV_PCIE_TX_THROUGHPUT counter # TYPE DCGM_FI_DEV_PCIE_RX_THROUGHPUT counter # TYPE DCGM_FI_DEV_PCIE_REPLAY_COUNTER counter # TYPE DCGM_FI_DEV_GPU_UTIL gauge # TYPE DCGM_FI_DEV_MEM_COPY_UTIL gauge # TYPE DCGM_FI_DEV_ENC_UTIL gauge # TYPE DCGM_FI_DEV_DEC_UTIL gauge # TYPE DCGM_FI_DEV_XID_ERRORS gauge # TYPE DCGM_FI_DEV_POWER_VIOLATION counter # TYPE DCGM_FI_DEV_THERMAL_VIOLATION counter # TYPE DCGM_FI_DEV_SYNC_BOOST_VIOLATION counter # TYPE DCGM_FI_DEV_BOARD_LIMIT_VIOLATION counter # TYPE DCGM_FI_DEV_LOW_UTIL_VIOLATION counter # TYPE DCGM_FI_DEV_RELIABILITY_VIOLATION counter # TYPE DCGM_FI_DEV_FB_FREE gauge # TYPE DCGM_FI_DEV_FB_USED gauge # TYPE DCGM_FI_DEV_ECC_SBE_VOL_TOTAL counter # TYPE DCGM_FI_DEV_ECC_DBE_VOL_TOTAL counter # TYPE DCGM_FI_DEV_ECC_SBE_AGG_TOTAL counter # TYPE DCGM_FI_DEV_ECC_DBE_AGG_TOTAL counter # TYPE DCGM_FI_DEV_RETIRED_SBE counter # TYPE DCGM_FI_DEV_RETIRED_DBE counter # TYPE DCGM_FI_DEV_RETIRED_PENDING counter # TYPE DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL counter # TYPE DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL counter # TYPE DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL counter # TYPE DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL counter # TYPE DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL counter # TYPE DCGM_FI_DEV_NVLINK_BANDWIDTH_L0 counter # TYPE DCGM_FI_DEV_VGPU_LICENSE_STATUS gauge ------- Also we do see the Service and ServiceMonitor in the gpu-operator-resources namespace (see Additional info section) Version-Release number of selected component (if applicable): Server Version: 4.7.0-rc.0 Kubernetes Version: v1.20.0+ba45583 NVIDIA GPU Operator version 1.5.2 How reproducible: Always Steps to Reproduce: 1. AWS IPI install of OCP 4.7.0-rc.0 on AWS, 3 workers, and 3 masters 2. Create a new machineset to add a g4dn.xlarge instance as a 4th worker 3. Deploy Node Feature Discovery (NFD) Operator from OperatorHub, in a new namespace, and create a NodeFeatureDiscoveries instance 4. Set up cluster-wise entitlement following procedure in : https://www.openshift.com/blog/how-to-use-entitled-image-builds-to-build-drivercontainers-with-ubi-on-openshift 5. Create a new namespace called "gpu-operator-resources" 6. From OpenShift Console, Operators -> OperatorHub, deploy the NVIDIA GPU operator. 7. Create a ClusterPolicy instance 8. Verify that the NVIDIA GPU Operator has been successfully deployed and the nvidia-device-plugin, nvidia-dcgm-exporter, nvidia-driver-daemonset, along with all the other pods in the NVIDIA stack: # oc get pods -n gpu-operator-resources NAME READY STATUS RESTARTS AGE gpu-feature-discovery-qd5st 1/1 Running 0 110m nvidia-container-toolkit-daemonset-4zk2j 1/1 Running 0 114m nvidia-dcgm-exporter-7zsfl 1/1 Running 0 111m nvidia-device-plugin-daemonset-slrgl 1/1 Running 0 111m nvidia-device-plugin-validation 0/1 Completed 0 111m nvidia-driver-daemonset-gr2p5 1/1 Running 0 114m 9. Run the gpu-burn workload in its own namespace (https://github.com/openshift-psap/gpu-burn): oc create -f gpu-burn.yaml 10. check you are seeing GPU Utilization and GPU Temperature stats from oc log -f output on the gpu-burn-daemonset created 11. From OpenShift Console, Monitoring -> Metrics: check if you can query for metrics starting with "DCGM" Actual results: No metrics starting with DCGM show up in the Metrics dropdown Expected results: DCGM metrics should be available and you can create graphs of them when running a gpu workload Additional info: # oc get Service -n gpu-operator-resources nvidia-dcgm-exporter -o yaml apiVersion: v1 kind: Service metadata: annotations: prometheus.io/scrape: "true" creationTimestamp: "2021-02-10T03:22:17Z" labels: app: nvidia-dcgm-exporter managedFields: - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:prometheus.io/scrape: {} f:labels: .: {} f:app: {} f:ownerReferences: .: {} k:{"uid":"74e1a5e7-ef8e-47ca-a821-68f7917a1a90"}: .: {} f:apiVersion: {} f:blockOwnerDeletion: {} f:controller: {} f:kind: {} f:name: {} f:uid: {} f:spec: f:ports: .: {} k:{"port":9400,"protocol":"TCP"}: .: {} f:name: {} f:port: {} f:protocol: {} f:targetPort: {} f:selector: .: {} f:app: {} f:sessionAffinity: {} f:type: {} manager: gpu-operator operation: Update time: "2021-02-10T03:22:17Z" name: nvidia-dcgm-exporter namespace: gpu-operator-resources ownerReferences: - apiVersion: nvidia.com/v1 blockOwnerDeletion: true controller: true kind: ClusterPolicy name: gpu-cluster-policy uid: 74e1a5e7-ef8e-47ca-a821-68f7917a1a90 resourceVersion: "288918" selfLink: /api/v1/namespaces/gpu-operator-resources/services/nvidia-dcgm-exporter uid: 46585c3b-625d-4b35-add8-c3d4f56cb162 spec: clusterIP: 172.30.35.190 clusterIPs: - 172.30.35.190 ports: - name: gpu-metrics port: 9400 protocol: TCP targetPort: 9400 selector: app: nvidia-dcgm-exporter sessionAffinity: None type: ClusterIP status: loadBalancer: {} # oc get ServiceMonitor -n gpu-operator-resources NAME AGE nvidia-dcgm-exporter 93m # oc get ServiceMonitor -n gpu-operator-resources nvidia-dcgm-exporter -o yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: creationTimestamp: "2021-02-10T03:22:17Z" generation: 1 labels: app: nvidia-dcgm-exporter managedFields: - apiVersion: monitoring.coreos.com/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:labels: .: {} f:app: {} f:ownerReferences: {} f:spec: .: {} f:endpoints: {} f:jobLabel: {} f:namespaceSelector: .: {} f:matchNames: {} f:selector: .: {} f:matchLabels: .: {} f:app: {} manager: gpu-operator operation: Update time: "2021-02-10T03:22:17Z" name: nvidia-dcgm-exporter namespace: gpu-operator-resources ownerReferences: - apiVersion: nvidia.com/v1 blockOwnerDeletion: true controller: true kind: ClusterPolicy name: gpu-cluster-policy uid: 74e1a5e7-ef8e-47ca-a821-68f7917a1a90 resourceVersion: "288920" selfLink: /apis/monitoring.coreos.com/v1/namespaces/gpu-operator-resources/servicemonitors/nvidia-dcgm-exporter uid: 3b07abfe-2ebe-49a5-8496-1cf83be2039f spec: endpoints: - bearerTokenSecret: key: "" path: /metrics port: gpu-metrics jobLabel: app namespaceSelector: matchNames: - gpu-operator-resources selector: matchLabels: app: nvidia-dcgm-exporter
Today I installed the GPU Operator 1.4.0 on OCP 4.7.0-0.nightly-2021-01-21-215614, and the DCGM metrics are correctly exposed to OpenShift Console Monitoring Metrics page, so the issue must come from the version 1.5.* of the GPU Operator.
The issue comes from the label "openshift.io/cluster-monitoring=true" missing in the gpu-operator-resources namespace. When installing from OperatorHub, NVIDIA mentions in the documentation that the namespace should be manually created step 3 in https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/#openshift-gpu-support-install-via-operatorhub but they do not mention the namespace label. Side note: when installing from helm, the namespace is automatically created, and the right label is set. --- The issue has been reported upstream: https://github.com/NVIDIA/gpu-operator/issues/151 The solution to this issue is to manually label the namespace: > oc label ns/gpu-operator-resources openshift.io/cluster-monitoring=true
The issue has been fixed upstream (https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/199) and should be available in the next release of the GPU Operator (1.7)
Waiting on GPU Operator version 1.7 availability
Walid, FYI I can confirm that the current master branch of the GPU Operator (soon to be cut into 1.7 release) fixes this issue: 1. the Operator is deployed from the bundle, > operator-sdk run bundle -n openshift-operators quay.io/openshift-psap/ci-artifacts:gpu-operator_bundle_latest the only thing that has to be currently fix is the Operator image (the 1.7.0 one used is the bundle isn't published) > oc set image deployment/gpu-operator gpu-operator=quay.io/openshift-psap/ci-artifacts:gpu-operator_operator_latest -n openshift-operators --> https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-psap-ci-artifacts-release-4.7-gpu-operator-e2e-master/1388267006574202880/artifacts/gpu-operator-e2e-master/nightly/artifacts/235858__gpu-operator__deploy_from_operatorhub/_ansible.log 2. the gpu-operator-resources namespace is automatically created by the operator as part of its state deployment steps, with the right labels: > <command> oc get ns -l openshift.io/cluster-monitoring -oname | grep gpu-operator-resources > <stdout> namespace/gpu-operator-resources --> https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-psap-ci-artifacts-release-4.7-gpu-operator-e2e-master/1388267006574202880/artifacts/gpu-operator-e2e-master/nightly/artifacts/235858__gpu-operator__deploy_from_operatorhub/_ansible.log
I was able to deploy the GPU Operator v1.70 via operator-sdk run bundle cmd, and the DCGM metrics show up and are available from OpenShift console Monitoring tab. This is on 4.8.0-0.nightly-2021-06-01-043518: However I am seeing a shorter list of DCGM metrics with some metrics missing compared to previous GPU Operator versions (see previous list in Description section). # ./get-dcgm-metrics.sh Forwarding from 127.0.0.1:9401 -> 9400 Forwarding from [::1]:9401 -> 9400 Handling connection for 9401 # TYPE DCGM_FI_DEV_SM_CLOCK gauge # TYPE DCGM_FI_DEV_MEM_CLOCK gauge # TYPE DCGM_FI_DEV_MEMORY_TEMP gauge # TYPE DCGM_FI_DEV_GPU_TEMP gauge # TYPE DCGM_FI_DEV_POWER_USAGE gauge # TYPE DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION counter # TYPE DCGM_FI_DEV_PCIE_REPLAY_COUNTER counter # TYPE DCGM_FI_DEV_MEM_COPY_UTIL gauge # TYPE DCGM_FI_DEV_ENC_UTIL gauge # TYPE DCGM_FI_DEV_DEC_UTIL gauge # TYPE DCGM_FI_DEV_XID_ERRORS gauge # TYPE DCGM_FI_DEV_FB_FREE gauge # TYPE DCGM_FI_DEV_FB_USED gauge # TYPE DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL counter # TYPE DCGM_FI_DEV_VGPU_LICENSE_STATUS gauge # TYPE DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS counter # TYPE DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS counter # TYPE DCGM_FI_DEV_ROW_REMAP_FAILURE gauge # oc describe ns gpu-operator-resources Name: gpu-operator-resources Labels: app.kubernetes.io/component=gpu-operator kubernetes.io/metadata.name=gpu-operator-resources olm.operatorgroup.uid/20a06096-466e-4c7f-bbfd-45281f81f500= openshift.io/cluster-monitoring=true Annotations: openshift.io/sa.scc.mcs: s0:c26,c0 openshift.io/sa.scc.supplemental-groups: 1000650000/10000 openshift.io/sa.scc.uid-range: 1000650000/10000 Status: Active No resource quota. No LimitRange resource.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438