2190241 – nfs metric details are unavailable and server health is displaying as "Degraded" under Network file system tab in UI

Bug 2190241 - nfs metric details are unavailable and server health is displaying as "Degraded" under Network file system tab in UI

Summary: nfs metric details are unavailable and server health is displaying as "Degrad...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.13.0
Assignee:	Shachar Sharon
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:	2208029
Blocks:
TreeView+	depends on / blocked

Reported:	2023-04-27 16:28 UTC by Amrita Mahapatra
Modified:	2023-08-09 17:00 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:	Cause: Missing servicemonitor configuration for nfs-ganesha metrics exporter. Consequence: Prometheus monitoring did not scrape and collect nfs-ganesha metrics. Fix: Add appropriate servicemonitor. Result: NFS-Ganesha metrics are now visible via Prometheus monitoring.
Clone Of:
Environment:
Last Closed:	2023-06-21 15:25:28 UTC
Embargoed:
Flags:	ssharon: needinfo+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 2047	None	Merged	nfs: servicemonitor for nfs-metrics	2023-05-08 14:12:07 UTC
Github	red-hat-storage ocs-operator pull 2050	None	open	Bug 2190241: [release-4.13] nfs: servicemonitor for nfs-metrics	2023-05-08 14:12:07 UTC
Red Hat Product Errata	RHBA-2023:3742	None	None	None	2023-06-21 15:25:44 UTC

Description Amrita Mahapatra 2023-04-27 16:28:05 UTC

Description of problem (please be detailed as possible and provide log
snippests): nfs metric details are unavailable in network file system tab in UI. Even the empty graphs keep disappearing time to time. And the server health is displaying as "Degraded" with an error sign without any more details.

However if we create (manually) a servicemonitor for prometheus to start the metrics scraping from this endpoint:

like:
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app.kubernetes.io/name: rook-ceph-nfs-metrics
    app.kubernetes.io/part-of: rook-ceph-nfs
  name: rook-ceph-nfs-metrics
  namespace: openshift-storage
spec:
  namespaceSelector:
    matchNames:
      - openshift-storage
  endpoints:
    - interval: 1m
      port: nfs-metrics
      path: /metrics
      scheme: http
  selector:
    matchLabels:
      app: rook-ceph-nfs
      app.kubernetes.io/component: cephnfses.ceph.rook.io
      app.kubernetes.io/created-by: rook-ceph-operator
      app.kubernetes.io/managed-by: rook-ceph-operator
      app.kubernetes.io/name: ceph-nfs
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: rook-ceph-nfs-metrics
  namespace: openshift-storage
rules:
  - apiGroups:
      - ""
    resources:
      - services
      - endpoints
      - pods
    verbs:
      - get
      - list
      - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: rook-ceph-nfs-metrics
  namespace: openshift-storage
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: rook-ceph-nfs-metrics
subjects:
  - kind: ServiceAccount
    name: prometheus-k8s
    namespace: openshift-monitoring


Then we are able to fetch the metrics directly from the Prometheus server.

Example:
$ oc exec -c prometheus prometheus-k8s-0 -n openshift-monitoring --  curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq -M | grep exposer
    "exposer_request_latencies",
    "exposer_request_latencies_count",
    "exposer_request_latencies_sum",
    "exposer_scrapes_total",
    "exposer_transferred_bytes_total",


$ oc exec -c prometheus prometheus-k8s-0 -n openshift-monitoring --  curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq -M | grep nfs_bytes
    "nfs_bytes_received_by_export_total",
    "nfs_bytes_received_total",
    "nfs_bytes_sent_by_export_total",
    "nfs_bytes_sent_total",

Version of all relevant components (if applicable):
ocs version: 4.13.0-175
OCP version: 4.13.0-0.nightly-2023-04-21-084440

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge? There is no work around as per my knowledge to e able to see the details in UI.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)? 3


Can this issue reproducible? yes


Can this issue reproduce from the UI? yes


If this is a regression, please provide more details to justify this: NA


Steps to Reproduce:
1. Enable nfs feature from CLI using patch command or from UI while creating storagesystem.

oc patch -n openshift-storage storageclusters.ocs.openshift.io ocs-storagecluster --patch '{"spec": {"nfs":{"enable": true}}}' --type merge

oc patch cm rook-ceph-operator-config -n openshift-storage -p $'data:\n "ROOK_CSI_ENABLE_NFS":  "true"'

2. nfs ganesha server is up and running

3. Check StorageSystems --> StorageSystem details dashboad Network file system tab is displayed with nfs server details

4. Create nfs PVCs using storageclass: ocs-storagecluster-ceph-nfs

5. Mount the nfs pvc incluster and outcluster and check the server status, thoughput, and performance details under Network file system tab


Actual results:
nfs metric details are unavailable in network file system tab in UI and the server health is displaying as "Degraded" with an error sign without any more details.

Expected results:
nfs metric details should be available in network file system tab when nfs feature is enabled and in UI and appropriate server health details should display.



Additional info:

Comment 11 errata-xmlrpc 2023-06-21 15:25:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742

Note You need to log in before you can comment on or make changes to this bug.