Bug 2190241

Summary: nfs metric details are unavailable and server health is displaying as "Degraded" under Network file system tab in UI
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Amrita Mahapatra <ammahapa>
Component: ocs-operatorAssignee: Shachar Sharon <ssharon>
Status: CLOSED ERRATA QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: kramdoss, ocs-bugs, odf-bz-bot, skatiyar, ssharon
Target Milestone: ---Flags: ssharon: needinfo+
Target Release: ODF 4.13.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Cause: Missing servicemonitor configuration for nfs-ganesha metrics exporter. Consequence: Prometheus monitoring did not scrape and collect nfs-ganesha metrics. Fix: Add appropriate servicemonitor. Result: NFS-Ganesha metrics are now visible via Prometheus monitoring.
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-21 15:25:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2208029    
Bug Blocks:    

Description Amrita Mahapatra 2023-04-27 16:28:05 UTC
Description of problem (please be detailed as possible and provide log
snippests): nfs metric details are unavailable in network file system tab in UI. Even the empty graphs keep disappearing time to time. And the server health is displaying as "Degraded" with an error sign without any more details.

However if we create (manually) a servicemonitor for prometheus to start the metrics scraping from this endpoint:

like:
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app.kubernetes.io/name: rook-ceph-nfs-metrics
    app.kubernetes.io/part-of: rook-ceph-nfs
  name: rook-ceph-nfs-metrics
  namespace: openshift-storage
spec:
  namespaceSelector:
    matchNames:
      - openshift-storage
  endpoints:
    - interval: 1m
      port: nfs-metrics
      path: /metrics
      scheme: http
  selector:
    matchLabels:
      app: rook-ceph-nfs
      app.kubernetes.io/component: cephnfses.ceph.rook.io
      app.kubernetes.io/created-by: rook-ceph-operator
      app.kubernetes.io/managed-by: rook-ceph-operator
      app.kubernetes.io/name: ceph-nfs
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: rook-ceph-nfs-metrics
  namespace: openshift-storage
rules:
  - apiGroups:
      - ""
    resources:
      - services
      - endpoints
      - pods
    verbs:
      - get
      - list
      - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: rook-ceph-nfs-metrics
  namespace: openshift-storage
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: rook-ceph-nfs-metrics
subjects:
  - kind: ServiceAccount
    name: prometheus-k8s
    namespace: openshift-monitoring


Then we are able to fetch the metrics directly from the Prometheus server.

Example:
$ oc exec -c prometheus prometheus-k8s-0 -n openshift-monitoring --  curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq -M | grep exposer
    "exposer_request_latencies",
    "exposer_request_latencies_count",
    "exposer_request_latencies_sum",
    "exposer_scrapes_total",
    "exposer_transferred_bytes_total",


$ oc exec -c prometheus prometheus-k8s-0 -n openshift-monitoring --  curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq -M | grep nfs_bytes
    "nfs_bytes_received_by_export_total",
    "nfs_bytes_received_total",
    "nfs_bytes_sent_by_export_total",
    "nfs_bytes_sent_total",

Version of all relevant components (if applicable):
ocs version: 4.13.0-175
OCP version: 4.13.0-0.nightly-2023-04-21-084440

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge? There is no work around as per my knowledge to e able to see the details in UI.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)? 3


Can this issue reproducible? yes


Can this issue reproduce from the UI? yes


If this is a regression, please provide more details to justify this: NA


Steps to Reproduce:
1. Enable nfs feature from CLI using patch command or from UI while creating storagesystem.

oc patch -n openshift-storage storageclusters.ocs.openshift.io ocs-storagecluster --patch '{"spec": {"nfs":{"enable": true}}}' --type merge

oc patch cm rook-ceph-operator-config -n openshift-storage -p $'data:\n "ROOK_CSI_ENABLE_NFS":  "true"'

2. nfs ganesha server is up and running

3. Check StorageSystems --> StorageSystem details dashboad Network file system tab is displayed with nfs server details

4. Create nfs PVCs using storageclass: ocs-storagecluster-ceph-nfs

5. Mount the nfs pvc incluster and outcluster and check the server status, thoughput, and performance details under Network file system tab


Actual results:
nfs metric details are unavailable in network file system tab in UI and the server health is displaying as "Degraded" with an error sign without any more details.

Expected results:
nfs metric details should be available in network file system tab when nfs feature is enabled and in UI and appropriate server health details should display.



Additional info:

Comment 11 errata-xmlrpc 2023-06-21 15:25:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742