2277687 – Ceph health getting to warning state with "HEALTH_WARN 1 MDSs report slow requests" roughly 2 hours after cluster installation

Bug 2277687 - Ceph health getting to warning state with "HEALTH_WARN 1 MDSs report slow requests" roughly 2 hours after cluster installation

Summary: Ceph health getting to warning state with "HEALTH_WARN 1 MDSs report slow req...

Keywords:
Status:	CLOSED DUPLICATE of bug 2277944
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	7.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	8.0
Assignee:	Xiubo Li
QA Contact:	Hemanth Kumar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2277944
TreeView+	depends on / blocked

Reported:	2024-04-29 07:37 UTC by Xiubo Li
Modified:	2024-04-30 09:17 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2277944 (view as bug list)
Environment:
Last Closed:	2024-04-30 09:17:39 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	65536	0	None	None	None	2024-04-29 07:40:06 UTC
Red Hat Issue Tracker	RHCEPH-8896	0	None	None	None	2024-04-29 07:45:22 UTC

Description Xiubo Li 2024-04-29 07:37:12 UTC

This bug was initially created as a copy of Bug #2274015

I am copying this bug because: 



Description of problem (please be detailed as possible and provide log
snippets):
It has been observed in at least 2 ODF 4.16 clusters, that Ceph health is getting into WARNING state with the following message:


  cluster:
    id:     45c47df0-e2fa-4931-9e45-b6c109ce5b69
    health: HEALTH_WARN
            1 MDSs report slow requests
 
  services:
    mon: 3 daemons, quorum a,b,c (age 6h)
    mgr: b(active, since 5h), standbys: a
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 26h), 3 in (since 30h)
 
  data:
    volumes: 1/1 healthy
    pools:   5 pools, 145 pgs
    objects: 226 objects, 577 MiB
    usage:   3.9 GiB used, 296 GiB / 300 GiB avail
    pgs:     145 active+clean
 
  io:
    client:   853 B/s rd, 3.3 KiB/s wr, 1 op/s rd, 0 op/s wr


This seems to impact CephFS functionality, with CephFS backed PVCs fail to reach Bound state:

E           ocs_ci.ocs.exceptions.ResourceWrongStatusException: Resource pvc-test-87a4c63e82584fdabf50638748121fe describe output: Name:          pvc-test-87a4c63e82584fdabf50638748121fe
E           Namespace:     namespace-test-f41280f9e48b49a98deb0bc0f
E           StorageClass:  ocs-storagecluster-cephfs
E           Status:        Pending
E           Volume:        
E           Labels:        <none>
E           Annotations:   volume.beta.kubernetes.io/storage-provisioner: openshift-storage.cephfs.csi.ceph.com
E                          volume.kubernetes.io/storage-provisioner: openshift-storage.cephfs.csi.ceph.com
E           Finalizers:    [kubernetes.io/pvc-protection]
E           Capacity:      
E           Access Modes:  
E           VolumeMode:    Filesystem
E           Used By:       <none>
E           Events:
E             Type    Reason                Age                From                                                                                                                      Message
E             ----    ------                ----               ----                                                                                                                      -------
E             Normal  Provisioning          76s                openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-675dbc77b8-xfzs4_9aae553a-5912-479b-95f2-1dcbe530dbdf  External provisioner is provisioning volume for claim "namespace-test-f41280f9e48b49a98deb0bc0f/pvc-test-87a4c63e82584fdabf50638748121fe"
E             Normal  ExternalProvisioning  11s (x6 over 76s)  persistentvolume-controller                                                                                               Waiting for a volume to be created either by the external provisioner 'openshift-storage.cephfs.csi.ceph.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.


Seeing these log entries in the MDS pod logs:


debug 2024-04-08T13:49:01.154+0000 7f08f1877640  0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 15522.036438 secs
debug 2024-04-08T13:49:06.154+0000 7f08f1877640  0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 15527.036638 secs
debug 2024-04-08T13:49:11.154+0000 7f08f1877640  0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 15532.036821 secs
debug 2024-04-08T13:49:16.154+0000 7f08f1877640  0 log_channel(cluster) log [WRN] : 4 slow requests, 1 included below; oldest blocked for > 15537.036953 secs
debug 2024-04-08T13:49:16.154+0000 7f08f1877640  0 log_channel(cluster) log [WRN] : slow request 15363.773573 seconds old, received at 2024-04-08T09:33:12.382280+0000: client_request(client.271142:3 lookup #0x10000000000/csi 2024-04-08T09:33:12.380870+0000 caller_uid=0, caller_gid=0{}) currently cleaned up request

Version of all relevant components (if applicable):
ODF 4.16.0-69
Ceph Version	18.2.1-76.el9cp (2517f8a5ef5f5a6a22013b2fb11a591afd474668) reef (stable)
OCP 4.16.0-0.nightly-2024-04-06-020637


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
CephFS functionality seems to be impacted as described above


Is there any workaround available to the best of your knowledge?
Restarting one of the MDS pods brings Ceph health back to OK but the issue repeats again after 1-2 hours 


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
N/A

If this is a regression, please provide more details to justify this:
This is new in 4.16


Steps to Reproduce:
1. Deploy ODF 4.16. Wait for 1-2 hours and check Ceph health


Actual results:
Ceph health showing the aforementioned WARNING


Expected results:
Ceph health should not degrade

Additional info:
Must gather attached with MDS in log level = 20

Comment 1 RHEL Program Management 2024-04-29 07:37:20 UTC

Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 2 Venky Shankar 2024-04-30 09:17:39 UTC


*** This bug has been marked as a duplicate of bug 2277944 ***

Note You need to log in before you can comment on or make changes to this bug.