2257310 – The MDSCPUUsageHigh alert is lacking a call to action

Bug 2257310 - The MDSCPUUsageHigh alert is lacking a call to action

Summary: The MDSCPUUsageHigh alert is lacking a call to action

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.15
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.15.0
Assignee:	Manish Yathnalli
QA Contact:	Nagendra Reddy
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-01-08 18:29 UTC by Nagendra Reddy
Modified:	2024-07-18 04:25 UTC (History)
CC List:	11 users (show)
Fixed In Version:	4.15.0-155
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-03-19 15:30:43 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift runbooks pull 162	None	open	Add runbooks for CephMdsCpuUsageHigh	2024-01-22 09:41:26 UTC
Github	openshift runbooks pull 167	None	open	Fix runbook to remove horizontal scaling of MDS.	2024-02-26 12:05:04 UTC
Github	red-hat-storage ocs-operator pull 2404	None	open	add runbook url for MDSCpuUsageHigh alert.	2024-01-22 09:41:26 UTC
Github	red-hat-storage ocs-operator pull 2409	None	open	Bug 2257310: [release-4.15] add runbook url for MDSCpuUsageHigh alert.	2024-01-22 13:57:08 UTC
Github	red-hat-storage ocs-operator pull 2482	None	open	Remove active mds solution from alert	2024-02-28 07:59:28 UTC
Github	red-hat-storage ocs-operator pull 2483	None	open	Bug 2257310: [release-4.15] Remove active mds solution from alert	2024-02-28 09:49:24 UTC
Red Hat Product Errata	RHSA-2024:1383	None	None	None	2024-03-19 15:30:47 UTC

Description Nagendra Reddy 2024-01-08 18:29:25 UTC

Created attachment 2007833 [details]
alert

Description of problem (please be detailed as possible and provide log
snippests):

Description:
Description

Below MDSCPUUsageHigh alert is fired when Ceph metadata server pod (rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-8494565f5zs9m) has high cpu usage but the alert does not provide clear instructions or steps to take in response to the alert. 

The alert should include a call to action, providing either steps to increase the number of active metadata servers or a link to the documentation on what to do when the MDSCPUUsageHigh alert is received.


Alert: Ceph metadata server pod (rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-8494565f5zs9m) has high cpu usage. Please consider increasing the number of active metadata servers, it can be done by increasing the number of activeMetadataServers parameter in the StorageCluster CR.

Name
MDSCPUUsageHigh

Severity
 Warning

Message
Ceph metadata server pod (rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-8494565f5zs9m) has high cpu usage

Version of all relevant components (if applicable):
odf: 4.15.0-104.stable
4.15.0-0.nightly-2024-01-06-062415

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Create 3m, 3w OCP cluster [BM Nvme platform] and install ODF on it.
2.Create two or more cephfs PVCs with RWX access mode
3. Run file creator pods with more no.of threads [eg: 80], the CPU load will go high and the alert will be received.
4. Go to the dashboard, alert will be received but the alert won't be having any instructions to perform he required action.


Actual results:

The MDSCPUUsageHigh alert is lacking a call to action

Expected results:
The alert should include a call to action, providing either steps to increase the number of active metadata servers or a link to the documentation on what to do when the MDSCPUUsageHigh alert is received.

Additional info:

Comment 3 Malay Kumar parida 2024-01-09 04:10:29 UTC


*** This bug has been marked as a duplicate of bug 2256725 ***

Comment 4 Malay Kumar parida 2024-01-09 04:16:17 UTC

Reopening this, as this is different from the mentioned one as it's about CPU usage, and that one is about cache usage.

Comment 5 Malay Kumar parida 2024-01-18 05:48:41 UTC

Hi Manish, Any plans for this BZ in 4.15? else we can move it out to 4.16.

Comment 6 Mudit Agarwal 2024-01-19 16:08:35 UTC

It should be fixed in a similar fashion like https://bugzilla.redhat.com/show_bug.cgi?id=2256725
Providing devel ack

Comment 11 Nagendra Reddy 2024-02-19 13:32:47 UTC

Verified with below versions, issue still persists. I didn't find any actionable link provided in the alert, please check once. Please find the attached screenshot for the same.

OCP:4.15.0-0.nightly-2024-02-16-235514

ODF: 4.15.0-144

Comment 13 Nagendra Reddy 2024-02-19 13:36:15 UTC

Increasing the severity to High as it is failing one of our test case and blocking us to verify the actionable link.

Comment 15 Nagendra Reddy 2024-02-19 14:43:20 UTC

Rule was modified to tune the time for test requirement. Used latest prometheus yaml file and retested, I am able to see the actionable link. Thanks!


One query regarding the feature. According to this feature we are suggesting customer to increase the MDS pods right. Do we need to suggest CPU increment as well? 

RHSTOR-3865: Alert when CephFS MDS scaling is needed - More MDS pods are required
-->Goal : Improve customer experience and alert if MDS scaling is needed

The below information found it the article linked in the alert.


We need to either increase the allocated CPU or run multiple active MDS. The blow command describes how to set the number of allocated CPU for MDS server.


oc patch -n openshift-storage storagecluster ocs-storagecluster \
    --type merge \
    --patch '{"spec": {"resources": {"mds": {"limits": {"cpu": "8"},
    "requests": {"cpu": "8"}}}}}'


In order to run multiple active MDS servers, use below command

```bash
oc patch -n openshift-storage cephfilesystem ocs-storagecluster-cephfilesystem\
    --type merge \
    --patch '{"spec": {"metadataServer": {"activeCount": 2}}}'

Make sure we have enough CPU provisioned for MDS depending on the load.
```

Comment 17 Manish Yathnalli 2024-02-19 14:49:36 UTC

I have given steps to do both, its their choice to choose one of the two remedies.

Comment 18 Nagendra Reddy 2024-02-21 06:51:59 UTC

(In reply to Manish Yathnalli from comment #17)
> I have given steps to do both, its their choice to choose one of the two
> remedies.

Eran and Bipin,

When the MDSCPUHighUsage alert is received, the actionable link provides steps to either increase the allocated CPU by editing the ocs-storagecluster or to run multiple active MDS by patching the cephfilesystem, increasing the metadataServer activeCount to 2.

I followed the steps in the https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCpuUsageHigh.md to run multiple active metadata serves and observed that
 
Two Active and Two S-R MDS daemons found after scale up. Out of two active mds one will be stopped very soon  and there won't be any load share between two active MDS pods, load will continue only on Single active MDS's CPU. Hence, the alert still remains in firing state only.

Before scale up of MDS pods:
sh-5.1$ ceph fs status
ocs-storagecluster-cephfilesystem - 3 clients
=================================
RANK      STATE                       MDS                     ACTIVITY     DNS    INOS   DIRS   CAPS
 0        active      ocs-storagecluster-cephfilesystem-a  Reqs:  807 /s  2530k  2531k  1299   2065k
0-s   standby-replay  ocs-storagecluster-cephfilesystem-b  Evts: 4529 /s  2862k  2862k  1288      0
                   POOL                       TYPE     USED  AVAIL
ocs-storagecluster-cephfilesystem-metadata  metadata  10.1G   742G
 ocs-storagecluster-cephfilesystem-data0      data    62.2G   742G
MDS version: ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)

After Scale up of MDS pods using patch cmd given in the document provided above.

sh-5.1$ date
Wed Feb 21 06:22:11 UTC 2024

sh-5.1$ ceph fs status
ocs-storagecluster-cephfilesystem - 4 clients
=================================
RANK      STATE                       MDS                     ACTIVITY     DNS    INOS   DIRS   CAPS
 0        active      ocs-storagecluster-cephfilesystem-a  Reqs: 1668 /s  2536k  2536k  1299   2051k
 1        active      ocs-storagecluster-cephfilesystem-d  Reqs:    0 /s    12     15     13      3
0-s   standby-replay  ocs-storagecluster-cephfilesystem-b  Evts: 9454 /s  2862k  2862k  1288      0
1-s   standby-replay  ocs-storagecluster-cephfilesystem-c  Evts:    0 /s     0      3      1      0
                   POOL                       TYPE     USED  AVAIL
ocs-storagecluster-cephfilesystem-metadata  metadata  10.2G   741G
 ocs-storagecluster-cephfilesystem-data0      data    63.8G   741G
MDS version: ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)

Waited for few minutes, no load share b/w active MDS. Found that one active MDS stopped and went to Standby MDS.

sh-5.1$ date
Wed Feb 21 06:26:24 UTC 2024

sh-5.1$ ceph fs status
ocs-storagecluster-cephfilesystem - 3 clients
=================================
RANK      STATE                       MDS                     ACTIVITY     DNS    INOS   DIRS   CAPS
 0        active      ocs-storagecluster-cephfilesystem-a  Reqs: 1256 /s  2536k  2536k  1299   2064k
0-s   standby-replay  ocs-storagecluster-cephfilesystem-b  Evts: 2888 /s  2864k  2864k  1288      0
                   POOL                       TYPE     USED  AVAIL
ocs-storagecluster-cephfilesystem-metadata  metadata  10.7G   740G
 ocs-storagecluster-cephfilesystem-data0      data    67.4G   740G
            STANDBY MDS
ocs-storagecluster-cephfilesystem-d
ocs-storagecluster-cephfilesystem-c
MDS version: ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)

------------------------------------------------
Below are Ceph status logs monitored thought the procedure of MDS scale up

sh-5.1$ ceph -s -w
  cluster:
    id:     23416e5d-5223-492f-89e0-eefdebcb0193
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 12h)
    mgr: a(active, since 45h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 12h), 3 in (since 45h)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 5.43M objects, 17 GiB
    usage:   144 GiB used, 2.6 TiB / 2.7 TiB avail
    pgs:     169 active+clean

  io:
    client:   25 MiB/s rd, 17 MiB/s wr, 129 op/s rd, 6.74k op/s wr


2024-02-21T06:20:49.191293+0000 mon.a [WRN] Health check failed: 1 filesystem is online with fewer MDS than max_mds (MDS_UP_LESS_THAN_MAX)
2024-02-21T06:20:49.203774+0000 mon.a [INF] daemon mds.ocs-storagecluster-cephfilesystem-d assigned to filesystem ocs-storagecluster-cephfilesystem as rank 1 (now has 2 ranks)
2024-02-21T06:20:49.203829+0000 mon.a [INF] Health check cleared: MDS_UP_LESS_THAN_MAX (was: 1 filesystem is online with fewer MDS than max_mds)
2024-02-21T06:20:49.203838+0000 mon.a [INF] Cluster is now healthy
2024-02-21T06:20:49.295776+0000 mon.a [INF] daemon mds.ocs-storagecluster-cephfilesystem-d is now active in filesystem ocs-storagecluster-cephfilesystem as rank 1
2024-02-21T06:24:12.136660+0000 mon.a [INF] stopping daemon mds.ocs-storagecluster-cephfilesystem-d
2024-02-21T06:24:29.174132+0000 mon.a [INF] daemon mds.ocs-storagecluster-cephfilesystem-d finished stopping rank 1 in filesystem ocs-storagecluster-cephfilesystem (now has 1 ranks)
2024-02-21T06:30:00.000141+0000 mon.a [INF] overall HEALTH_OK



Currently, I'm unsure if we have been recommending this procedure to customers to increase the active MDS count. Do you think we should include this procedure in the actionable link? Please share your thoughts.

Comment 21 Nagendra Reddy 2024-02-22 07:38:27 UTC

Manish,

Out of two solutions provided by you in the https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCpuUsageHigh.md, only CPU increment is working fine. The load share is happening when CPU increment was done and the alert disappeared.

The second solution--> MDS pods scale-up is not working as expected. The load share is not happening on Active MDS pods when the scale up is done. Observed that only one Active MDS will be available for few minutes after scale up. In sometime after scale up, lets say 20 to 30mins--> there won't be any Active MDS available, MDS daemon will be in rojoin state forever. This observation is already updated in Comment 18.


@Manish,

The MDS scale up procedure seems to be not working and it looks like a blocker for 4.15. Based on our test results, QE can agree with only CPU increment based on our test results.

Comment 24 Malay Kumar parida 2024-02-22 08:41:27 UTC

Yes it's exposed here, https://github.com/red-hat-storage/ocs-operator/blob/f8a0c2c9fc43de45527a5ef892d682fa5e98f5c2/api/v1/storagecluster_types.go#L234
Can be done by oc patch like this

oc patch storagecluster ocs-storagecluster -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/managedResources/cephFilesystems/activeMetadataServers", "value": <> }]'

Comment 26 Malay Kumar parida 2024-02-22 09:01:38 UTC

Yes Subham, directly modifying the Filesystem CR won't work as it will just be reconciled back. You have to patch via the storagecluster CR only, so the command should be changed in the runbook example.

Comment 27 Nagendra Reddy 2024-02-22 12:07:47 UTC

Based on Comment 18, moving it back to Assigned state.

Comment 28 Manish Yathnalli 2024-02-22 15:34:00 UTC

https://github.com/openshift/runbooks/pull/167

Comment 31 Parth Arora 2024-02-26 08:39:13 UTC

>This command is adding 1 Active and 1 standby-replay mds daemons. So there will be total 2-Active and 2 standby-replay mds after this patch command. 
But there is no Load share is happening on newly added mds pods. Attached snapshot of the metrics to this BZ.

In 4.15 clusters we have default csi-subvolume group pinning enabled with default settings, So we should see the load share, @Patrick would you like to share some thoughts.

Nagendra can you share the output of `oc get CephFilesystemSubVolumeGroup -o yaml`

Comment 32 Nagendra Reddy 2024-02-26 10:51:16 UTC

Created a separate BZ https://bugzilla.redhat.com/show_bug.cgi?id=2265987 for Comment 29.

@Manish,

you can edit document and remove the instructions for MDS scale up. We don't recommend to suggest MDS scale up officially until it is tested in ODF environment internally.

Comment 33 Nagendra Reddy 2024-02-26 10:56:42 UTC

(In reply to Parth Arora from comment #31)
> >This command is adding 1 Active and 1 standby-replay mds daemons. So there will be total 2-Active and 2 standby-replay mds after this patch command. 
> But there is no Load share is happening on newly added mds pods. Attached
> snapshot of the metrics to this BZ.
> 
> In 4.15 clusters we have default csi-subvolume group pinning enabled with
> default settings, So we should see the load share, @Patrick would you like
> to share some thoughts.
> 
> Nagendra can you share the output of `oc get CephFilesystemSubVolumeGroup -o
> yaml`


Load share happened after 7hrs of scale up, created a separate BZ [comment 32] for that. We can discuss about load share issue on the new BZ. Let's close this BZ with document modification. 

sh-5.1$ ceph fs status
ocs-storagecluster-cephfilesystem - 2 clients
=================================
RANK      STATE                       MDS                     ACTIVITY     DNS    INOS   DIRS   CAPS
 0        active      ocs-storagecluster-cephfilesystem-c  Reqs: 1229 /s  7098   1459     54   1446
 1        active      ocs-storagecluster-cephfilesystem-d  Reqs:    0 /s    15     18     15      1
0-s   standby-replay  ocs-storagecluster-cephfilesystem-b  Evts: 1370 /s  89.7k  1568     54      0
1-s   standby-replay  ocs-storagecluster-cephfilesystem-a  Evts:    0 /s     5      8      5      0
                   POOL                       TYPE     USED  AVAIL
ocs-storagecluster-cephfilesystem-metadata  metadata  22.0G   707G
 ocs-storagecluster-cephfilesystem-data0      data    60.3G   707G
MDS version: ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)
sh-5.1$ exit
exit
oc get CephFilesystemSubVolumeGroup -o yaml
apiVersion: v1
items:
- apiVersion: ceph.rook.io/v1
  kind: CephFilesystemSubVolumeGroup
  metadata:
    creationTimestamp: "2024-02-22T06:08:51Z"
    finalizers:
    - cephfilesystemsubvolumegroup.ceph.rook.io
    generation: 1
    name: ocs-storagecluster-cephfilesystem-csi
    namespace: openshift-storage
    ownerReferences:
    - apiVersion: ocs.openshift.io/v1
      blockOwnerDeletion: true
      controller: true
      kind: StorageCluster
      name: ocs-storagecluster
      uid: 8a7362b4-3284-4ea8-84da-2ead62d72179
    resourceVersion: "375147"
    uid: aabedaf8-8811-4dcc-9977-faa2f16f4b30
  spec:
    filesystemName: ocs-storagecluster-cephfilesystem
    name: csi
    pinning:
      distributed: 1
  status:
    info:
      clusterID: 5bb69c306a7d011c3e91c3cec112fb7a
    observedGeneration: 1
    phase: Ready
kind: List
metadata:
  resourceVersion: ""

Comment 34 Mudit Agarwal 2024-02-26 12:05:05 UTC

We are removing the instructions to increase MDS for 4.15

Harish, as discussed please create a BZ for 4.16 to fix the MDS scale up/down part.

Comment 35 Nagendra Reddy 2024-02-27 12:44:59 UTC

(In reply to Mudit Agarwal from comment #34)
> We are removing the instructions to increase MDS for 4.15
> 
> Harish, as discussed please create a BZ for 4.16 to fix the MDS scale
> up/down part.

Mudit, Please find the BZ for MDS scale up https://bugzilla.redhat.com/show_bug.cgi?id=2265987

Comment 36 Nagendra Reddy 2024-02-27 16:10:17 UTC

Verified with  4.15.0-150. Yes, the document is modified and only CPU increment is suggested. But some important changes needs to be done as described below.  

1. Please modify the Alert description in Prometheus-rules to suggest CPU increment not the MDS servers increment.


 - alert: MDSCPUUsageHigh
      annotations:
        description: |-
          Ceph metadata server pod ({{ $labels.pod }}) has high cpu usage.
          Please consider increasing the number of active metadata servers,
          it can be done by increasing the number of activeMetadataServers parameter in the StorageCluster CR.
        message: Ceph metadata server pod ({{ $labels.pod }}) has high cpu usage
        runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCpuUsageHigh.md
        severity_level: warning
      expr: |
        pod:container_cpu_usage:sum{pod=~"rook-ceph-mds.*"}/ on(pod) kube_pod_resource_limit{resource='cpu',pod=~"rook-ceph-mds.*"} > 0.67
      for: 6h
      labels:
        severity: warning

2. Please remove "or run multiple active metadata servers" from the Impact section in https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCpuUsageHigh.md

Comment 37 Patrick Donnelly 2024-02-28 02:20:20 UTC

See: https://bugzilla.redhat.com/show_bug.cgi?id=2265987#c3

Comment 38 Nagendra Reddy 2024-03-05 11:13:20 UTC

Verified with fix, changes reflected in alert and runbook. Please find the snapshots for the same.

Comment 42 errata-xmlrpc 2024-03-19 15:30:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Comment 43 Red Hat Bugzilla 2024-07-18 04:25:22 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.