1713527 – [GSS] clients not able to mount cephfs and mds stuck in up:replay

Bug 1713527 - [GSS] clients not able to mount cephfs and mds stuck in up:replay

Summary: [GSS] clients not able to mount cephfs and mds stuck in up:replay

Keywords:
Status:	CLOSED DUPLICATE of bug 1714810
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	3.1
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	3.3
Assignee:	Patrick Donnelly
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-24 01:26 UTC by Prashant Dhange
Modified:	2021-08-27 22:38 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-14 18:03:57 UTC
Embargoed:
Dependent Products:
Flags:	pdhange: automate_bug?

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-1100	0	None	None	None	2021-08-27 22:38:34 UTC

Internal Links: 1714810 1714812 1714814

Comment 3 Yan, Zheng 2019-05-24 02:38:04 UTC

"Behind on trimming (105235/128)max_segments" can explain this. There are lots of log segments, replaying them require long time.

Ask customer to not ignore 'MDSs behind on trimming' health warning next time

Comment 10 Yan, Zheng 2019-05-24 07:45:38 UTC

Connected clients do affect mds journal replay (It's unlikely that they do IOs on metadata pool). The best solution for now is wait until journal replay finishes. Because journal reset and scan whole filesystem may also require very long time. 

Disable all mds debug can speed up journal replay.

Comment 11 Yan, Zheng 2019-05-24 08:01:21 UTC

Sorry. I mean "Connected clients do not affect mds journal replay"

Comment 13 Yan, Zheng 2019-05-24 08:31:45 UTC

For /cases/02388834/ceph-mds.storageM3-STG-NGN1.log.tgz/ceph-mds.storageM3-STG-NGN1.log

The recovering mds had "heartbeat map not healthy" when it's in rejoin stage. It likely the mds was iterating all inodes. To prevent mds from being replaced by monitor, set mds_beacon_grace config of monitor to 300 or more.

Comment 18 Yan, Zheng 2019-05-25 01:35:54 UTC

mds_log_max_segments default is 128.  decrease it by 100 every 10 seconds, until it reach 128


There are lots of log segments in this case. when mds become active, it tries trimming all of them, which create lots of osd requests.

Comment 32 Yan, Zheng 2019-05-27 13:25:24 UTC

no new discover from the log. still looks like http://tracker.ceph.com/issues/40028

Note You need to log in before you can comment on or make changes to this bug.