Bug 2177585

Summary:	[cephfs]MDS standby-replay daemon removed by monitor repeatedly
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	James Biao <jbiao>
Component:	ceph	Assignee:	Kotresh HR <khiremat>
ceph sub component:	CephFS	QA Contact:	Elad <ebenahar>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	high
Priority:	high	CC:	adolling, bhubbard, bniver, gfarnum, hnallurv, khiremat, khover, muagarwa, odf-bz-bot, sostapov
Version:	4.10
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-08-22 12:11:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description James Biao 2023-03-13 03:36:27 UTC

Description of problem (please be detailed as possible and provide log
snippests):

MDS in standby-replay status remove by monitor repeatedly



Version of all relevant components (if applicable):

ODF 4.10


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Persistent in customer environment

Can this issue reproduce from the UI?
n/a

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:
MDS removed by mon during replay

Expected results:
MDS standby-replay finish replay and join the cluster

Additional info:

Comment 10 khover 2023-04-03 13:22:46 UTC

Hi Kotresh,

Cluster is more stable now after migrating unused data and some workloads off cephfs.

3 areas that are hindering ODF performance and contributing to slow Ceph recovery.

I continue to see a growing trend of customers deploying with rotational and OCP with MTU 1500. Continuing to see customers put db workloads on cephfs.

MTU of 1500 would fall into the network issues category IMHO.



Rotational devices config at the node layer.
sh-4.4# lsblk -t
NAME                         ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED       RQ-SIZE   RA WSAME
loop0                                0    512      0     512     512    0 mq-deadline     256  128    0B
loop1                                0    512      0     512     512    0 mq-deadline     256  128    0B
sda                                  0    512      0     512     512    1 mq-deadline     256 4096    4M
|-sda1                               0    512      0     512     512    1 mq-deadline     256 4096    4M
|-sda2                               0    512      0     512     512    1 mq-deadline     256 4096    4M
|-sda3                               0    512      0     512     512    1 mq-deadline     256 4096    4M
`-sda4                               0    512      0     512     512    1 mq-deadline     256 4096    4M
`-coreos-luks-root-nocrypt         0    512      0     512     512    1                 128 4096    4M
sdb                                  0    512      0     512     512    1 mq-deadline     256 4096    4M
sdc                                  0    512      0     512     512    1 mq-deadline     256  128    4M
sdd                                  0    512      0     512     512    1 mq-deadline     256  128    4M



MTU size is 1500 


sh-4.4# ip link list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 00:50:56:86:b8:b4 brd ff:ff:ff:ff:ff:ff

DB workloads on cephfs sc and the # of objects this has created in cephfilesystem-metadata and data pools.


omg get pv | grep cephfs| grep db ( snip of a few for example )

pvc-0d0299b2-da74-4e20-bf01-3c023a5adbc0  1Gi       RWO           Delete          Bound     sandbox-j16877r/mongodb                                                  ocs-storagecluster-cephfs            384d
pvc-1ab487bc-5e93-4ce3-92d9-528aa4b07b2e  8Gi       RWO           Delete          Bound     sqa-neoload-web-test/mongodb                                             ocs-storagecluster-cephfs            326d

   

sh-4.4$ ceph df
--- RAW STORAGE ---
CLASS    SIZE    AVAIL     USED  RAW USED  %RAW USED
ssd    12 TiB  4.2 TiB  7.8 TiB   7.8 TiB      65.06
TOTAL  12 TiB  4.2 TiB  7.8 TiB   7.8 TiB      65.06
 
--- POOLS ---
POOL                                                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.rgw.root                                               1    8  4.7 KiB       16  2.8 MiB      0    678 GiB
ocs-storagecluster-cephobjectstore.rgw.control          2    8      0 B        8      0 B      0    678 GiB
ocs-storagecluster-cephblockpool                        3  128  769 GiB  197.80k  2.3 TiB  53.16    678 GiB
ocs-storagecluster-cephfilesystem-metadata              4   32  236 GiB    8.16M  682 GiB  25.11    678 GiB
ocs-storagecluster-cephfilesystem-data0                 5  116  301 GiB   22.99M  4.7 TiB  70.36    678 GiB


We also uncovered lots of Kasten unneeded volume snapshots that customer cleaned up on cephfs.

$ less namespaces/openshift-storage/oc_output/volumesnapshot_-A | grep cephfs | wc -l
14072