Bug 2262105

Summary: Stand-by MDS stuck in 'client replay' state when active MDS restarted.
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Nagendra Reddy <nagreddy>
Component: cephAssignee: Kotresh HR <khiremat>
ceph sub component: CephFS QA Contact: Elad <ebenahar>
Status: ASSIGNED --- Docs Contact:
Severity: high    
Priority: unspecified CC: bniver, khiremat, muagarwa, sheggodu, sostapov
Version: 4.15   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Nagendra Reddy 2024-01-31 15:44:00 UTC
Description of problem (please be detailed as possible and provide log
snippests):
Stand-by MDS stuck in 'client-replay' state only forever when the active gets MDS restarted and stand-by supposed to be active.

Version of all relevant components (if applicable):
4.15.0-126
4.15.0-0.nightly-2024-01-25-051548

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes
Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?

Yes
Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Run more IO on Active MDS to utilize High amount of cache. 
2.Active MDS pod will be restarted.

3.Stand-by MDS supposed to be active. In that process, the MDS pod stuck in client-replay state forever.

4. Now, none of the mds pods are active.

Actual results:

Stand-by MDS pod stuck in the 'client-replay' state

Expected results:
Stand-by MDS should be Active when active MDS failed or restarted.


Additional info:

sh-5.1$ ceph -s
  cluster:
    id:     a622f0f3-09a6-412b-9b06-e651e1d75e7f
    health: HEALTH_WARN
            1 filesystem is degraded
            1 MDSs report slow requests
            1 MDSs behind on trimming

  services:
    mon: 3 daemons, quorum a,b,c (age 21h)
    mgr: a(active, since 21h), standbys: b
    mds: 1/1 daemons up, 1 standby
    osd: 3 osds: 3 up (since 21h), 3 in (since 6d)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 0/1 healthy, 1 recovering
    pools:   12 pools, 169 pgs
    objects: 1.86M objects, 71 GiB
    usage:   237 GiB used, 1.3 TiB / 1.5 TiB avail
    pgs:     169 active+clean

  io:
    client:   195 KiB/s wr, 0 op/s rd, 3 op/s wr

sh-5.1$ ceph fs status
ocs-storagecluster-cephfilesystem - 5 clients
=================================
RANK     STATE                      MDS                  ACTIVITY   DNS    INOS   DIRS   CAPS
 0    clientreplay  ocs-storagecluster-cephfilesystem-a            1699k  1652k   727   13.4k
                   POOL                       TYPE     USED  AVAIL
ocs-storagecluster-cephfilesystem-metadata  metadata  9155M   356G
 ocs-storagecluster-cephfilesystem-data0      data    30.6G   356G
            STANDBY MDS
ocs-storagecluster-cephfilesystem-b
MDS version: ceph version 17.2.6-194.el9cp (d9f4aedda0fc0d99e7e0e06892a69523d2eb06dc) quincy (stable)
sh-5.1$



-----------------------------------------------------------------
oc get pods
NAME                                                              READY   STATUS      RESTARTS          AGE
csi-addons-controller-manager-8649f7f85f-z77p5                    2/2     Running     0                 50s
csi-cephfsplugin-czhph                                            2/2     Running     13 (21h ago)      6d
csi-cephfsplugin-m4cwp                                            2/2     Running     40 (22h ago)      6d
csi-cephfsplugin-provisioner-7f87d9556b-dqwl6                     6/6     Running     40 (21h ago)      6d
csi-cephfsplugin-provisioner-7f87d9556b-gdgpp                     6/6     Running     64                6d
csi-cephfsplugin-rqf2k                                            2/2     Running     12 (21h ago)      6d
csi-rbdplugin-8x6j6                                               3/3     Running     54 (22h ago)      6d
csi-rbdplugin-bt5dp                                               3/3     Running     16 (21h ago)      6d
csi-rbdplugin-provisioner-78884f6f8c-jqhlz                        6/6     Running     62                6d
csi-rbdplugin-provisioner-78884f6f8c-lq8mg                        6/6     Running     79                6d
csi-rbdplugin-snhdl                                               3/3     Running     16 (21h ago)      6d
noobaa-core-0                                                     1/1     Running     3                 2d6h
noobaa-db-pg-0                                                    1/1     Running     3                 2d6h
noobaa-endpoint-5456dd8bd-4shm8                                   1/1     Running     1                 23h
noobaa-operator-54d5fc85b8-qsr5l                                  2/2     Running     74 (10h ago)      2d4h
ocs-metrics-exporter-b94d575ff-pjd6c                              1/1     Running     3                 6d
ocs-operator-d57b464dd-4szrv                                      1/1     Running     232 (8m25s ago)   6d
odf-console-6d664888c8-tbnqw                                      1/1     Running     3                 6d
odf-operator-controller-manager-67ff86cb69-2fwjx                  2/2     Running     207 (8m27s ago)   6d
rook-ceph-crashcollector-compute-0-5776bbfc8d-ll4gh               1/1     Running     0                 22h
rook-ceph-crashcollector-compute-1-7bb5565597-4pktq               1/1     Running     0                 22h
rook-ceph-crashcollector-compute-2-c4d75658b-l9frn                1/1     Running     0                 21h
rook-ceph-exporter-compute-0-d79bbf9b8-gmqs4                      1/1     Running     1 (21h ago)       22h
rook-ceph-exporter-compute-1-75fff6dcbf-tmrdc                     1/1     Running     0                 22h
rook-ceph-exporter-compute-2-5d7ffc454-767tc                      1/1     Running     0                 21h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-56c5cd89s6f9x   2/2     Running     1 (66m ago)       150m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-775ddcf88tv94   2/2     Running     1 (38m ago)       149m
rook-ceph-mgr-a-59dcf4bbd9-6ccvn                                  3/3     Running     4 (21h ago)       22h
rook-ceph-mgr-b-855b9c966b-gk57d                                  3/3     Running     1 (21h ago)       21h
rook-ceph-mon-a-6d8d6595bf-rdv6m                                  2/2     Running     0                 22h
rook-ceph-mon-b-7f7775b869-bc68t                                  2/2     Running     0                 22h
rook-ceph-mon-c-6cc496dfd9-kbg42                                  2/2     Running     0                 21h
rook-ceph-operator-5b5c5d9b76-qwdkp                               1/1     Running     9                 6d
rook-ceph-osd-0-76cc86458c-bmz6l                                  2/2     Running     0                 22h
rook-ceph-osd-1-6c469b7c87-krdc9                                  2/2     Running     0                 22h
rook-ceph-osd-2-6754d6657d-ws9vt                                  2/2     Running     0                 21h
rook-ceph-osd-prepare-ocs-deviceset-0-data-08bmpn-h4fbb           0/1     Completed   0                 6d
rook-ceph-osd-prepare-ocs-deviceset-1-data-0xztrp-srq9j           0/1     Completed   0                 6d
rook-ceph-osd-prepare-ocs-deviceset-2-data-0v4rs4-vplsv           0/1     Completed   0                 6d
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-c5b686bfgp7t   2/2     Running     0                 21h
rook-ceph-tools-6c854d5d84-jmv7m                                  1/1     Running     3                 6d
ux-backend-server-7d5f748f7c-6mwb7                                2/2     Running     6                 6d