Bug 2262105 - Stand-by MDS stuck in 'client replay' state when active MDS restarted.
Summary: Stand-by MDS stuck in 'client replay' state when active MDS restarted.
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.15
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Kotresh HR
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-01-31 15:44 UTC by Nagendra Reddy
Modified: 2024-09-11 13:21 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)

Description Nagendra Reddy 2024-01-31 15:44:00 UTC
Description of problem (please be detailed as possible and provide log
snippests):
Stand-by MDS stuck in 'client-replay' state only forever when the active gets MDS restarted and stand-by supposed to be active.

Version of all relevant components (if applicable):
4.15.0-126
4.15.0-0.nightly-2024-01-25-051548

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes
Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?

Yes
Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Run more IO on Active MDS to utilize High amount of cache. 
2.Active MDS pod will be restarted.

3.Stand-by MDS supposed to be active. In that process, the MDS pod stuck in client-replay state forever.

4. Now, none of the mds pods are active.

Actual results:

Stand-by MDS pod stuck in the 'client-replay' state

Expected results:
Stand-by MDS should be Active when active MDS failed or restarted.


Additional info:

sh-5.1$ ceph -s
  cluster:
    id:     a622f0f3-09a6-412b-9b06-e651e1d75e7f
    health: HEALTH_WARN
            1 filesystem is degraded
            1 MDSs report slow requests
            1 MDSs behind on trimming

  services:
    mon: 3 daemons, quorum a,b,c (age 21h)
    mgr: a(active, since 21h), standbys: b
    mds: 1/1 daemons up, 1 standby
    osd: 3 osds: 3 up (since 21h), 3 in (since 6d)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 0/1 healthy, 1 recovering
    pools:   12 pools, 169 pgs
    objects: 1.86M objects, 71 GiB
    usage:   237 GiB used, 1.3 TiB / 1.5 TiB avail
    pgs:     169 active+clean

  io:
    client:   195 KiB/s wr, 0 op/s rd, 3 op/s wr

sh-5.1$ ceph fs status
ocs-storagecluster-cephfilesystem - 5 clients
=================================
RANK     STATE                      MDS                  ACTIVITY   DNS    INOS   DIRS   CAPS
 0    clientreplay  ocs-storagecluster-cephfilesystem-a            1699k  1652k   727   13.4k
                   POOL                       TYPE     USED  AVAIL
ocs-storagecluster-cephfilesystem-metadata  metadata  9155M   356G
 ocs-storagecluster-cephfilesystem-data0      data    30.6G   356G
            STANDBY MDS
ocs-storagecluster-cephfilesystem-b
MDS version: ceph version 17.2.6-194.el9cp (d9f4aedda0fc0d99e7e0e06892a69523d2eb06dc) quincy (stable)
sh-5.1$



-----------------------------------------------------------------
oc get pods
NAME                                                              READY   STATUS      RESTARTS          AGE
csi-addons-controller-manager-8649f7f85f-z77p5                    2/2     Running     0                 50s
csi-cephfsplugin-czhph                                            2/2     Running     13 (21h ago)      6d
csi-cephfsplugin-m4cwp                                            2/2     Running     40 (22h ago)      6d
csi-cephfsplugin-provisioner-7f87d9556b-dqwl6                     6/6     Running     40 (21h ago)      6d
csi-cephfsplugin-provisioner-7f87d9556b-gdgpp                     6/6     Running     64                6d
csi-cephfsplugin-rqf2k                                            2/2     Running     12 (21h ago)      6d
csi-rbdplugin-8x6j6                                               3/3     Running     54 (22h ago)      6d
csi-rbdplugin-bt5dp                                               3/3     Running     16 (21h ago)      6d
csi-rbdplugin-provisioner-78884f6f8c-jqhlz                        6/6     Running     62                6d
csi-rbdplugin-provisioner-78884f6f8c-lq8mg                        6/6     Running     79                6d
csi-rbdplugin-snhdl                                               3/3     Running     16 (21h ago)      6d
noobaa-core-0                                                     1/1     Running     3                 2d6h
noobaa-db-pg-0                                                    1/1     Running     3                 2d6h
noobaa-endpoint-5456dd8bd-4shm8                                   1/1     Running     1                 23h
noobaa-operator-54d5fc85b8-qsr5l                                  2/2     Running     74 (10h ago)      2d4h
ocs-metrics-exporter-b94d575ff-pjd6c                              1/1     Running     3                 6d
ocs-operator-d57b464dd-4szrv                                      1/1     Running     232 (8m25s ago)   6d
odf-console-6d664888c8-tbnqw                                      1/1     Running     3                 6d
odf-operator-controller-manager-67ff86cb69-2fwjx                  2/2     Running     207 (8m27s ago)   6d
rook-ceph-crashcollector-compute-0-5776bbfc8d-ll4gh               1/1     Running     0                 22h
rook-ceph-crashcollector-compute-1-7bb5565597-4pktq               1/1     Running     0                 22h
rook-ceph-crashcollector-compute-2-c4d75658b-l9frn                1/1     Running     0                 21h
rook-ceph-exporter-compute-0-d79bbf9b8-gmqs4                      1/1     Running     1 (21h ago)       22h
rook-ceph-exporter-compute-1-75fff6dcbf-tmrdc                     1/1     Running     0                 22h
rook-ceph-exporter-compute-2-5d7ffc454-767tc                      1/1     Running     0                 21h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-56c5cd89s6f9x   2/2     Running     1 (66m ago)       150m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-775ddcf88tv94   2/2     Running     1 (38m ago)       149m
rook-ceph-mgr-a-59dcf4bbd9-6ccvn                                  3/3     Running     4 (21h ago)       22h
rook-ceph-mgr-b-855b9c966b-gk57d                                  3/3     Running     1 (21h ago)       21h
rook-ceph-mon-a-6d8d6595bf-rdv6m                                  2/2     Running     0                 22h
rook-ceph-mon-b-7f7775b869-bc68t                                  2/2     Running     0                 22h
rook-ceph-mon-c-6cc496dfd9-kbg42                                  2/2     Running     0                 21h
rook-ceph-operator-5b5c5d9b76-qwdkp                               1/1     Running     9                 6d
rook-ceph-osd-0-76cc86458c-bmz6l                                  2/2     Running     0                 22h
rook-ceph-osd-1-6c469b7c87-krdc9                                  2/2     Running     0                 22h
rook-ceph-osd-2-6754d6657d-ws9vt                                  2/2     Running     0                 21h
rook-ceph-osd-prepare-ocs-deviceset-0-data-08bmpn-h4fbb           0/1     Completed   0                 6d
rook-ceph-osd-prepare-ocs-deviceset-1-data-0xztrp-srq9j           0/1     Completed   0                 6d
rook-ceph-osd-prepare-ocs-deviceset-2-data-0v4rs4-vplsv           0/1     Completed   0                 6d
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-c5b686bfgp7t   2/2     Running     0                 21h
rook-ceph-tools-6c854d5d84-jmv7m                                  1/1     Running     3                 6d
ux-backend-server-7d5f748f7c-6mwb7                                2/2     Running     6                 6d


Note You need to log in before you can comment on or make changes to this bug.