1927118 – OCP 4.7: NVIDIA GPU Operator DCGM metrics not displayed in OpenShift Console Monitoring Metrics page

Bug 1927118 - OCP 4.7: NVIDIA GPU Operator DCGM metrics not displayed in OpenShift Console Monitoring Metrics page

Summary: OCP 4.7: NVIDIA GPU Operator DCGM metrics not displayed in OpenShift Console ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	ISV Operators
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Kevin Pouget
QA Contact:	Walid A.
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-10 05:43 UTC by Walid A.
Modified:	2021-07-27 22:44 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:43:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:44:17 UTC

Description Walid A. 2021-02-10 05:43:15 UTC

Description of problem:
On an OCP 4.7.0-rc.0 IPI installed cluster on AWS, with the NVIDIA GPU Operator installed via OperatoHub, we don't see the DCGM metrics on the OpenShift Console under Monitoring -> Metrics tab.  These metrics where available on OCP 4.6.15 and 4.6.16.

The nvidia-dcgm-exporter appears to expose these metrics but the Openshift Console Monitoring -> Metrics page is not picking them up:

# cat get-dcgm-metrics.sh
#!/bin/bash


DCGM_POD=$(oc get pods -lapp=nvidia-dcgm-exporter -oname -n gpu-operator-resources | head -1);
if [ -z "$DCGM_POD" ]; then
  echo "Failed to find a pod for nvidia-dcgm-exporter";
  exit 10;
fi;
DCGM_PORT=9400; LOCAL_PORT=9401;
retry=5;
timeout 10 oc port-forward ${DCGM_POD} ${LOCAL_PORT}:${DCGM_PORT} -n gpu-operator-resources &
while [ "$DCGM_OUTPUT" == "" ]; do
  sleep 1;
  DCGM_OUTPUT=$(curl localhost:${LOCAL_PORT}/metrics 2>/dev/null);
  retry=$(($retry - 1));
  if [[ $retry == 0 ]]; then
    echo "Failed to get any output from DCGM/metrics ...";
    exit 11;
  fi;
done;

grep "# TYPE DCGM_FI_DEV" <<< ${DCGM_OUTPUT}


 ./get-dcgm-metrics.sh
Forwarding from 127.0.0.1:9401 -> 9400
Forwarding from [::1]:9401 -> 9400
Handling connection for 9401
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
# TYPE DCGM_FI_DEV_GPU_TEMP gauge
# TYPE DCGM_FI_DEV_POWER_USAGE gauge
# TYPE DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION counter
# TYPE DCGM_FI_DEV_PCIE_TX_THROUGHPUT counter
# TYPE DCGM_FI_DEV_PCIE_RX_THROUGHPUT counter
# TYPE DCGM_FI_DEV_PCIE_REPLAY_COUNTER counter
# TYPE DCGM_FI_DEV_GPU_UTIL gauge
# TYPE DCGM_FI_DEV_MEM_COPY_UTIL gauge
# TYPE DCGM_FI_DEV_ENC_UTIL gauge
# TYPE DCGM_FI_DEV_DEC_UTIL gauge
# TYPE DCGM_FI_DEV_XID_ERRORS gauge
# TYPE DCGM_FI_DEV_POWER_VIOLATION counter
# TYPE DCGM_FI_DEV_THERMAL_VIOLATION counter
# TYPE DCGM_FI_DEV_SYNC_BOOST_VIOLATION counter
# TYPE DCGM_FI_DEV_BOARD_LIMIT_VIOLATION counter
# TYPE DCGM_FI_DEV_LOW_UTIL_VIOLATION counter
# TYPE DCGM_FI_DEV_RELIABILITY_VIOLATION counter
# TYPE DCGM_FI_DEV_FB_FREE gauge
# TYPE DCGM_FI_DEV_FB_USED gauge
# TYPE DCGM_FI_DEV_ECC_SBE_VOL_TOTAL counter
# TYPE DCGM_FI_DEV_ECC_DBE_VOL_TOTAL counter
# TYPE DCGM_FI_DEV_ECC_SBE_AGG_TOTAL counter
# TYPE DCGM_FI_DEV_ECC_DBE_AGG_TOTAL counter
# TYPE DCGM_FI_DEV_RETIRED_SBE counter
# TYPE DCGM_FI_DEV_RETIRED_DBE counter
# TYPE DCGM_FI_DEV_RETIRED_PENDING counter
# TYPE DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL counter
# TYPE DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL counter
# TYPE DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL counter
# TYPE DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL counter
# TYPE DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL counter
# TYPE DCGM_FI_DEV_NVLINK_BANDWIDTH_L0 counter
# TYPE DCGM_FI_DEV_VGPU_LICENSE_STATUS gauge

-------
Also we do see the Service and ServiceMonitor in the gpu-operator-resources namespace (see Additional info section)

Version-Release number of selected component (if applicable):
Server Version: 4.7.0-rc.0
Kubernetes Version: v1.20.0+ba45583
NVIDIA GPU Operator version 1.5.2

How reproducible:
Always

Steps to Reproduce:
1.  AWS IPI install of OCP 4.7.0-rc.0 on AWS, 3 workers, and 3 masters
2.  Create a new machineset to add a g4dn.xlarge instance as a 4th worker
3.  Deploy Node Feature Discovery (NFD) Operator from OperatorHub, in a new namespace, and create a NodeFeatureDiscoveries instance
4.  Set up cluster-wise entitlement following procedure in :  https://www.openshift.com/blog/how-to-use-entitled-image-builds-to-build-drivercontainers-with-ubi-on-openshift
5.  Create a new namespace called "gpu-operator-resources"
6.  From OpenShift Console, Operators -> OperatorHub, deploy the NVIDIA GPU operator.
7.  Create a ClusterPolicy instance
8.  Verify that the NVIDIA GPU Operator has been successfully deployed and the nvidia-device-plugin, nvidia-dcgm-exporter, nvidia-driver-daemonset, along with all the other pods in the NVIDIA stack:

    # oc get pods -n gpu-operator-resources
    NAME                                       READY   STATUS      RESTARTS   AGE
    gpu-feature-discovery-qd5st                1/1     Running     0          110m
    nvidia-container-toolkit-daemonset-4zk2j   1/1     Running     0          114m
    nvidia-dcgm-exporter-7zsfl                 1/1     Running     0          111m
    nvidia-device-plugin-daemonset-slrgl       1/1     Running     0          111m
    nvidia-device-plugin-validation            0/1     Completed   0          111m
    nvidia-driver-daemonset-gr2p5              1/1     Running     0          114m

9. Run the gpu-burn workload in its own namespace (https://github.com/openshift-psap/gpu-burn):
oc create -f gpu-burn.yaml

10. check you are seeing GPU Utilization and GPU Temperature stats from oc log -f output on the gpu-burn-daemonset created

11. From OpenShift Console, Monitoring -> Metrics: check if you can query for metrics starting with "DCGM"


Actual results:
No metrics starting with DCGM show up in the Metrics dropdown

Expected results:
DCGM metrics should be available and you can create graphs of them when running a gpu workload

Additional info:

# oc get Service -n gpu-operator-resources nvidia-dcgm-exporter -o yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/scrape: "true"
  creationTimestamp: "2021-02-10T03:22:17Z"
  labels:
    app: nvidia-dcgm-exporter
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:prometheus.io/scrape: {}
        f:labels:
          .: {}
          f:app: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"74e1a5e7-ef8e-47ca-a821-68f7917a1a90"}:
            .: {}
            f:apiVersion: {}
            f:blockOwnerDeletion: {}
            f:controller: {}
            f:kind: {}
            f:name: {}
            f:uid: {}
      f:spec:
        f:ports:
          .: {}
          k:{"port":9400,"protocol":"TCP"}:
            .: {}
            f:name: {}
            f:port: {}
            f:protocol: {}
            f:targetPort: {}
        f:selector:
          .: {}
          f:app: {}
        f:sessionAffinity: {}
        f:type: {}
    manager: gpu-operator
    operation: Update
    time: "2021-02-10T03:22:17Z"
  name: nvidia-dcgm-exporter
  namespace: gpu-operator-resources
  ownerReferences:
  - apiVersion: nvidia.com/v1
    blockOwnerDeletion: true
    controller: true
    kind: ClusterPolicy
    name: gpu-cluster-policy
    uid: 74e1a5e7-ef8e-47ca-a821-68f7917a1a90
  resourceVersion: "288918"
  selfLink: /api/v1/namespaces/gpu-operator-resources/services/nvidia-dcgm-exporter
  uid: 46585c3b-625d-4b35-add8-c3d4f56cb162
spec:
  clusterIP: 172.30.35.190
  clusterIPs:
  - 172.30.35.190
  ports:
  - name: gpu-metrics
    port: 9400
    protocol: TCP
    targetPort: 9400
  selector:
    app: nvidia-dcgm-exporter
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}


# oc get ServiceMonitor -n gpu-operator-resources
NAME                   AGE
nvidia-dcgm-exporter   93m

# oc get ServiceMonitor -n gpu-operator-resources nvidia-dcgm-exporter -o yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  creationTimestamp: "2021-02-10T03:22:17Z"
  generation: 1
  labels:
    app: nvidia-dcgm-exporter
  managedFields:
  - apiVersion: monitoring.coreos.com/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .: {}
          f:app: {}
        f:ownerReferences: {}
      f:spec:
        .: {}
        f:endpoints: {}
        f:jobLabel: {}
        f:namespaceSelector:
          .: {}
          f:matchNames: {}
        f:selector:
          .: {}
          f:matchLabels:
            .: {}
            f:app: {}
    manager: gpu-operator
    operation: Update
    time: "2021-02-10T03:22:17Z"
  name: nvidia-dcgm-exporter
  namespace: gpu-operator-resources
  ownerReferences:
  - apiVersion: nvidia.com/v1
    blockOwnerDeletion: true
    controller: true
    kind: ClusterPolicy
    name: gpu-cluster-policy
    uid: 74e1a5e7-ef8e-47ca-a821-68f7917a1a90
  resourceVersion: "288920"
  selfLink: /apis/monitoring.coreos.com/v1/namespaces/gpu-operator-resources/servicemonitors/nvidia-dcgm-exporter
  uid: 3b07abfe-2ebe-49a5-8496-1cf83be2039f
spec:
  endpoints:
  - bearerTokenSecret:
      key: ""
    path: /metrics
    port: gpu-metrics
  jobLabel: app
  namespaceSelector:
    matchNames:
    - gpu-operator-resources
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter

Comment 1 Kevin Pouget 2021-02-12 10:37:52 UTC

Today I installed the GPU Operator 1.4.0 on OCP 4.7.0-0.nightly-2021-01-21-215614, and the DCGM metrics are correctly exposed to OpenShift Console Monitoring Metrics page,
so the issue must come from the version 1.5.* of the GPU Operator.

Comment 2 Kevin Pouget 2021-02-16 15:44:33 UTC

The issue comes from the label "openshift.io/cluster-monitoring=true" missing in the gpu-operator-resources namespace.

When installing from OperatorHub, NVIDIA mentions in the documentation that the namespace should be manually created

step 3 in https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/#openshift-gpu-support-install-via-operatorhub 

but they do not mention the namespace label.

Side note: when installing from helm, the namespace is automatically created, and the right label is set.

---

The issue has been reported upstream: https://github.com/NVIDIA/gpu-operator/issues/151

The solution to this issue is to manually label the namespace:

> oc label ns/gpu-operator-resources openshift.io/cluster-monitoring=true

Comment 3 Kevin Pouget 2021-04-07 14:27:34 UTC

The issue has been fixed upstream (https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/199) and should be available in the next release of the GPU Operator (1.7)

Comment 4 Walid A. 2021-04-20 16:13:28 UTC

Waiting on GPU Operator version 1.7 availability

Comment 5 Kevin Pouget 2021-05-01 07:03:48 UTC

Walid, FYI I can confirm that the current master branch of the GPU Operator (soon to be cut into 1.7 release) fixes this issue:

1. the Operator is deployed from the bundle, 

> operator-sdk run bundle -n openshift-operators quay.io/openshift-psap/ci-artifacts:gpu-operator_bundle_latest

the only thing that has to be currently fix is the Operator image (the 1.7.0 one used is the bundle isn't published)

> oc set image deployment/gpu-operator gpu-operator=quay.io/openshift-psap/ci-artifacts:gpu-operator_operator_latest -n openshift-operators

--> https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-psap-ci-artifacts-release-4.7-gpu-operator-e2e-master/1388267006574202880/artifacts/gpu-operator-e2e-master/nightly/artifacts/235858__gpu-operator__deploy_from_operatorhub/_ansible.log

2. the gpu-operator-resources namespace is automatically created by the operator as part of its state deployment steps, with the right labels:

>  <command> oc get ns -l openshift.io/cluster-monitoring -oname | grep gpu-operator-resources
>  <stdout> namespace/gpu-operator-resources

--> https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-psap-ci-artifacts-release-4.7-gpu-operator-e2e-master/1388267006574202880/artifacts/gpu-operator-e2e-master/nightly/artifacts/235858__gpu-operator__deploy_from_operatorhub/_ansible.log

Comment 6 Walid A. 2021-06-02 17:10:31 UTC

I was able to deploy the GPU Operator v1.70 via operator-sdk run bundle cmd, and the DCGM metrics show up and are available from OpenShift console Monitoring tab.
This is on 4.8.0-0.nightly-2021-06-01-043518:

However I am seeing a shorter list of DCGM metrics with some metrics missing compared to previous GPU Operator versions (see previous list in Description section).

# ./get-dcgm-metrics.sh
Forwarding from 127.0.0.1:9401 -> 9400
Forwarding from [::1]:9401 -> 9400
Handling connection for 9401
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
# TYPE DCGM_FI_DEV_GPU_TEMP gauge
# TYPE DCGM_FI_DEV_POWER_USAGE gauge
# TYPE DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION counter
# TYPE DCGM_FI_DEV_PCIE_REPLAY_COUNTER counter
# TYPE DCGM_FI_DEV_MEM_COPY_UTIL gauge
# TYPE DCGM_FI_DEV_ENC_UTIL gauge
# TYPE DCGM_FI_DEV_DEC_UTIL gauge
# TYPE DCGM_FI_DEV_XID_ERRORS gauge
# TYPE DCGM_FI_DEV_FB_FREE gauge
# TYPE DCGM_FI_DEV_FB_USED gauge
# TYPE DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL counter
# TYPE DCGM_FI_DEV_VGPU_LICENSE_STATUS gauge
# TYPE DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS counter
# TYPE DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS counter
# TYPE DCGM_FI_DEV_ROW_REMAP_FAILURE gauge


# oc describe ns gpu-operator-resources
Name:         gpu-operator-resources
Labels:       app.kubernetes.io/component=gpu-operator
              kubernetes.io/metadata.name=gpu-operator-resources
              olm.operatorgroup.uid/20a06096-466e-4c7f-bbfd-45281f81f500=
              openshift.io/cluster-monitoring=true
Annotations:  openshift.io/sa.scc.mcs: s0:c26,c0
              openshift.io/sa.scc.supplemental-groups: 1000650000/10000
              openshift.io/sa.scc.uid-range: 1000650000/10000
Status:       Active

No resource quota.

No LimitRange resource.

Comment 9 errata-xmlrpc 2021-07-27 22:43:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.