2277944 – [7.1z clone] Ceph health getting to warning state with "HEALTH_WARN 1 MDSs report slow requests" roughly 2 hours after cluster installation

Bug 2277944 - [7.1z clone] Ceph health getting to warning state with "HEALTH_WARN 1 MDSs report slow requests" roughly 2 hours after cluster installation

Summary: [7.1z clone] Ceph health getting to warning state with "HEALTH_WARN 1 MDSs re...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	7.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	7.1
Assignee:	Xiubo Li
QA Contact:	Amarnath
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2277687 (view as bug list)
Depends On:	2277687
Blocks:
TreeView+	depends on / blocked

Reported:	2024-04-30 08:30 UTC by Mudit Agarwal
Modified:	2024-06-13 14:32 UTC (History)
CC List:	11 users (show)
Fixed In Version:	ceph-18.2.1-177.el9cp
Doc Type:	Bug Fix
Doc Text:	.No slow requests are caused by the batch operations Previously, a regression was introduced by the quiesce protocol code. When killing the client requests, it would just skip choosing the new batch head for the batch operations. This caused the stale batch head requests to stay in the MDS cache forever and then be treated as slow requests. With this fix, choose a new batch head when killing requests and no slow requests are caused by the batch operations.
Clone Of:	2277687
Environment:
Last Closed:	2024-06-13 14:32:25 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	65536	None	None	None	2024-04-30 09:14:02 UTC
Red Hat Issue Tracker	RHCEPH-8908	None	None	None	2024-04-30 08:34:53 UTC
Red Hat Product Errata	RHSA-2024:3925	None	None	None	2024-06-13 14:32:28 UTC

Description Mudit Agarwal 2024-04-30 08:30:48 UTC

+++ This bug was initially created as a clone of Bug #2277687 +++

This bug was initially created as a copy of Bug #2274015

I am copying this bug because: 



Description of problem (please be detailed as possible and provide log
snippets):
It has been observed in at least 2 ODF 4.16 clusters, that Ceph health is getting into WARNING state with the following message:


  cluster:
    id:     45c47df0-e2fa-4931-9e45-b6c109ce5b69
    health: HEALTH_WARN
            1 MDSs report slow requests
 
  services:
    mon: 3 daemons, quorum a,b,c (age 6h)
    mgr: b(active, since 5h), standbys: a
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 26h), 3 in (since 30h)
 
  data:
    volumes: 1/1 healthy
    pools:   5 pools, 145 pgs
    objects: 226 objects, 577 MiB
    usage:   3.9 GiB used, 296 GiB / 300 GiB avail
    pgs:     145 active+clean
 
  io:
    client:   853 B/s rd, 3.3 KiB/s wr, 1 op/s rd, 0 op/s wr


This seems to impact CephFS functionality, with CephFS backed PVCs fail to reach Bound state:

E           ocs_ci.ocs.exceptions.ResourceWrongStatusException: Resource pvc-test-87a4c63e82584fdabf50638748121fe describe output: Name:          pvc-test-87a4c63e82584fdabf50638748121fe
E           Namespace:     namespace-test-f41280f9e48b49a98deb0bc0f
E           StorageClass:  ocs-storagecluster-cephfs
E           Status:        Pending
E           Volume:        
E           Labels:        <none>
E           Annotations:   volume.beta.kubernetes.io/storage-provisioner: openshift-storage.cephfs.csi.ceph.com
E                          volume.kubernetes.io/storage-provisioner: openshift-storage.cephfs.csi.ceph.com
E           Finalizers:    [kubernetes.io/pvc-protection]
E           Capacity:      
E           Access Modes:  
E           VolumeMode:    Filesystem
E           Used By:       <none>
E           Events:
E             Type    Reason                Age                From                                                                                                                      Message
E             ----    ------                ----               ----                                                                                                                      -------
E             Normal  Provisioning          76s                openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-675dbc77b8-xfzs4_9aae553a-5912-479b-95f2-1dcbe530dbdf  External provisioner is provisioning volume for claim "namespace-test-f41280f9e48b49a98deb0bc0f/pvc-test-87a4c63e82584fdabf50638748121fe"
E             Normal  ExternalProvisioning  11s (x6 over 76s)  persistentvolume-controller                                                                                               Waiting for a volume to be created either by the external provisioner 'openshift-storage.cephfs.csi.ceph.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.


Seeing these log entries in the MDS pod logs:


debug 2024-04-08T13:49:01.154+0000 7f08f1877640  0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 15522.036438 secs
debug 2024-04-08T13:49:06.154+0000 7f08f1877640  0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 15527.036638 secs
debug 2024-04-08T13:49:11.154+0000 7f08f1877640  0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 15532.036821 secs
debug 2024-04-08T13:49:16.154+0000 7f08f1877640  0 log_channel(cluster) log [WRN] : 4 slow requests, 1 included below; oldest blocked for > 15537.036953 secs
debug 2024-04-08T13:49:16.154+0000 7f08f1877640  0 log_channel(cluster) log [WRN] : slow request 15363.773573 seconds old, received at 2024-04-08T09:33:12.382280+0000: client_request(client.271142:3 lookup #0x10000000000/csi 2024-04-08T09:33:12.380870+0000 caller_uid=0, caller_gid=0{}) currently cleaned up request

Version of all relevant components (if applicable):
ODF 4.16.0-69
Ceph Version	18.2.1-76.el9cp (2517f8a5ef5f5a6a22013b2fb11a591afd474668) reef (stable)
OCP 4.16.0-0.nightly-2024-04-06-020637


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
CephFS functionality seems to be impacted as described above


Is there any workaround available to the best of your knowledge?
Restarting one of the MDS pods brings Ceph health back to OK but the issue repeats again after 1-2 hours 


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
N/A

If this is a regression, please provide more details to justify this:
This is new in 4.16


Steps to Reproduce:
1. Deploy ODF 4.16. Wait for 1-2 hours and check Ceph health


Actual results:
Ceph health showing the aforementioned WARNING


Expected results:
Ceph health should not degrade

Additional info:
Must gather attached with MDS in log level = 20

--- Additional comment from RHEL Program Management on 2024-04-29 07:37:20 UTC ---

Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 1 Venky Shankar 2024-04-30 09:17:39 UTC

*** Bug 2277687 has been marked as a duplicate of this bug. ***

Comment 2 Scott Ostapovicz 2024-05-02 13:23:06 UTC

This is mistargeted.  We do not even have a the 7.1 release done yet, and this is being targeted at 7.1 z2.  Retargeting this to 7.1 for now.

Comment 11 errata-xmlrpc 2024-06-13 14:32:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925

Note You need to log in before you can comment on or make changes to this bug.