Bug 2229151

Summary: [GSS] Ceph mds daemon crashing Frequently.
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Manjunatha <mmanjuna>
Component: cephAssignee: Xiubo Li <xiubli>
ceph sub component: CephFS QA Contact: Elad <ebenahar>
Status: NEW --- Docs Contact:
Severity: high    
Priority: unspecified CC: bniver, dparmar, muagarwa, odf-bz-bot, sostapov, vshankar, xiubli
Version: 4.10Flags: xiubli: needinfo? (mmanjuna)
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Manjunatha 2023-08-04 12:55:04 UTC
Description of problem (please be detailed as possible and provide log
snippests):
We are frequently getting CephClusterWarningState alert on our prod and non-prod cluster.
$ ceph -s
  cluster:
    id:     16fff585-704d-499b-9084-bc3c97534601
    health: HEALTH_WARN
            8 daemons have recently crashed

  services:
    mon: 3 daemons, quorum a,f,g (age 3d)
    mgr: a(active, since 3d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 3d), 6 in (since 2y)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   11 pools, 369 pgs
    objects: 7.90M objects, 4.7 TiB
    usage:   15 TiB used, 8.9 TiB / 24 TiB avail
    pgs:     369 active+clean

  io:
    client:   4.6 MiB/s rd, 43 MiB/s wr, 270 op/s rd, 640 op/s wr

sh-4.4$ ceph health detail
HEALTH_WARN 8 daemons have recently crashed
[WRN] RECENT_CRASH: 8 daemons have recently crashed
    mds.ocs-storagecluster-cephfilesystem-a crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7959f576k7sqg at 2023-08-02T01:40:26.576663Z
    mds.ocs-storagecluster-cephfilesystem-a crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7959f576k7sqg at 2023-08-02T07:00:52.824774Z
    mds.ocs-storagecluster-cephfilesystem-a crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7959f576k7sqg at 2023-08-03T00:01:24.797894Z
    mds.ocs-storagecluster-cephfilesystem-a crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7959f576k7sqg at 2023-08-03T00:01:41.095726Z
    mds.ocs-storagecluster-cephfilesystem-a crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7959f576k7sqg at 2023-08-03T06:00:36.311469Z
    mds.ocs-storagecluster-cephfilesystem-a crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7959f576k7sqg at 2023-08-03T09:00:56.592728Z
    mds.ocs-storagecluster-cephfilesystem-a crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7959f576k7sqg at 2023-08-03T17:40:37.542708Z
    mds.ocs-storagecluster-cephfilesystem-a crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7959f576k7sqg at 2023-08-04T01:00:43.379724Z
sh-4.4$

Version of all relevant components (if applicable):
ODF 4.10

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes. 

Is there any workaround available to the best of your knowledge?
No

Can this issue reproducible?
Yes,