Bug 2190241 - nfs metric details are unavailable and server health is displaying as "Degraded" under Network file system tab in UI
Summary: nfs metric details are unavailable and server health is displaying as "Degrad...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.13.0
Assignee: Shachar Sharon
QA Contact: Elad
URL:
Whiteboard:
Depends On: 2208029
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-04-27 16:28 UTC by Amrita Mahapatra
Modified: 2023-08-09 17:00 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Cause: Missing servicemonitor configuration for nfs-ganesha metrics exporter. Consequence: Prometheus monitoring did not scrape and collect nfs-ganesha metrics. Fix: Add appropriate servicemonitor. Result: NFS-Ganesha metrics are now visible via Prometheus monitoring.
Clone Of:
Environment:
Last Closed: 2023-06-21 15:25:28 UTC
Embargoed:
ssharon: needinfo+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-operator pull 2047 0 None Merged nfs: servicemonitor for nfs-metrics 2023-05-08 14:12:07 UTC
Github red-hat-storage ocs-operator pull 2050 0 None open Bug 2190241: [release-4.13] nfs: servicemonitor for nfs-metrics 2023-05-08 14:12:07 UTC
Red Hat Product Errata RHBA-2023:3742 0 None None None 2023-06-21 15:25:44 UTC

Description Amrita Mahapatra 2023-04-27 16:28:05 UTC
Description of problem (please be detailed as possible and provide log
snippests): nfs metric details are unavailable in network file system tab in UI. Even the empty graphs keep disappearing time to time. And the server health is displaying as "Degraded" with an error sign without any more details.

However if we create (manually) a servicemonitor for prometheus to start the metrics scraping from this endpoint:

like:
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app.kubernetes.io/name: rook-ceph-nfs-metrics
    app.kubernetes.io/part-of: rook-ceph-nfs
  name: rook-ceph-nfs-metrics
  namespace: openshift-storage
spec:
  namespaceSelector:
    matchNames:
      - openshift-storage
  endpoints:
    - interval: 1m
      port: nfs-metrics
      path: /metrics
      scheme: http
  selector:
    matchLabels:
      app: rook-ceph-nfs
      app.kubernetes.io/component: cephnfses.ceph.rook.io
      app.kubernetes.io/created-by: rook-ceph-operator
      app.kubernetes.io/managed-by: rook-ceph-operator
      app.kubernetes.io/name: ceph-nfs
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: rook-ceph-nfs-metrics
  namespace: openshift-storage
rules:
  - apiGroups:
      - ""
    resources:
      - services
      - endpoints
      - pods
    verbs:
      - get
      - list
      - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: rook-ceph-nfs-metrics
  namespace: openshift-storage
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: rook-ceph-nfs-metrics
subjects:
  - kind: ServiceAccount
    name: prometheus-k8s
    namespace: openshift-monitoring


Then we are able to fetch the metrics directly from the Prometheus server.

Example:
$ oc exec -c prometheus prometheus-k8s-0 -n openshift-monitoring --  curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq -M | grep exposer
    "exposer_request_latencies",
    "exposer_request_latencies_count",
    "exposer_request_latencies_sum",
    "exposer_scrapes_total",
    "exposer_transferred_bytes_total",


$ oc exec -c prometheus prometheus-k8s-0 -n openshift-monitoring --  curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq -M | grep nfs_bytes
    "nfs_bytes_received_by_export_total",
    "nfs_bytes_received_total",
    "nfs_bytes_sent_by_export_total",
    "nfs_bytes_sent_total",

Version of all relevant components (if applicable):
ocs version: 4.13.0-175
OCP version: 4.13.0-0.nightly-2023-04-21-084440

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge? There is no work around as per my knowledge to e able to see the details in UI.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)? 3


Can this issue reproducible? yes


Can this issue reproduce from the UI? yes


If this is a regression, please provide more details to justify this: NA


Steps to Reproduce:
1. Enable nfs feature from CLI using patch command or from UI while creating storagesystem.

oc patch -n openshift-storage storageclusters.ocs.openshift.io ocs-storagecluster --patch '{"spec": {"nfs":{"enable": true}}}' --type merge

oc patch cm rook-ceph-operator-config -n openshift-storage -p $'data:\n "ROOK_CSI_ENABLE_NFS":  "true"'

2. nfs ganesha server is up and running

3. Check StorageSystems --> StorageSystem details dashboad Network file system tab is displayed with nfs server details

4. Create nfs PVCs using storageclass: ocs-storagecluster-ceph-nfs

5. Mount the nfs pvc incluster and outcluster and check the server status, thoughput, and performance details under Network file system tab


Actual results:
nfs metric details are unavailable in network file system tab in UI and the server health is displaying as "Degraded" with an error sign without any more details.

Expected results:
nfs metric details should be available in network file system tab when nfs feature is enabled and in UI and appropriate server health details should display.



Additional info:

Comment 11 errata-xmlrpc 2023-06-21 15:25:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742


Note You need to log in before you can comment on or make changes to this bug.