Bug 2229151

Summary:	[GSS] Ceph mds daemon crashing Frequently.
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Manjunatha <mmanjuna>
Component:	ceph	Assignee:	Xiubo Li <xiubli>
ceph sub component:	CephFS	QA Contact:	Elad <ebenahar>
Status:	NEW ---	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	amagrawa, bniver, dparmar, muagarwa, odf-bz-bot, sheggodu, sostapov, vshankar, vumrao, xiubli
Version:	4.10
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2242896
Bug Blocks:

Description Manjunatha 2023-08-04 12:55:04 UTC

Description of problem (please be detailed as possible and provide log
snippests):
We are frequently getting CephClusterWarningState alert on our prod and non-prod cluster.
$ ceph -s
  cluster:
    id:     16fff585-704d-499b-9084-bc3c97534601
    health: HEALTH_WARN
            8 daemons have recently crashed

  services:
    mon: 3 daemons, quorum a,f,g (age 3d)
    mgr: a(active, since 3d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 3d), 6 in (since 2y)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   11 pools, 369 pgs
    objects: 7.90M objects, 4.7 TiB
    usage:   15 TiB used, 8.9 TiB / 24 TiB avail
    pgs:     369 active+clean

  io:
    client:   4.6 MiB/s rd, 43 MiB/s wr, 270 op/s rd, 640 op/s wr

sh-4.4$ ceph health detail
HEALTH_WARN 8 daemons have recently crashed
[WRN] RECENT_CRASH: 8 daemons have recently crashed
    mds.ocs-storagecluster-cephfilesystem-a crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7959f576k7sqg at 2023-08-02T01:40:26.576663Z
    mds.ocs-storagecluster-cephfilesystem-a crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7959f576k7sqg at 2023-08-02T07:00:52.824774Z
    mds.ocs-storagecluster-cephfilesystem-a crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7959f576k7sqg at 2023-08-03T00:01:24.797894Z
    mds.ocs-storagecluster-cephfilesystem-a crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7959f576k7sqg at 2023-08-03T00:01:41.095726Z
    mds.ocs-storagecluster-cephfilesystem-a crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7959f576k7sqg at 2023-08-03T06:00:36.311469Z
    mds.ocs-storagecluster-cephfilesystem-a crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7959f576k7sqg at 2023-08-03T09:00:56.592728Z
    mds.ocs-storagecluster-cephfilesystem-a crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7959f576k7sqg at 2023-08-03T17:40:37.542708Z
    mds.ocs-storagecluster-cephfilesystem-a crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7959f576k7sqg at 2023-08-04T01:00:43.379724Z
sh-4.4$

Version of all relevant components (if applicable):
ODF 4.10

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes. 

Is there any workaround available to the best of your knowledge?
No

Can this issue reproducible?
Yes,