Bug 1714812 - MDS may fail heartbeats during up:replay if there are too many journal segments
Summary: MDS may fail heartbeats during up:replay if there are too many journal segments
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: CephFS
Version: 3.1
Hardware: All
OS: All
urgent
high
Target Milestone: rc
: 3.3
Assignee: Yan, Zheng
QA Contact: ceph-qe-bugs
URL:
Whiteboard: NeedsDev
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-28 23:09 UTC by Patrick Donnelly
Modified: 2019-05-30 18:36 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-05-30 18:36:12 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1713527 0 urgent CLOSED [GSS] clients not able to mount cephfs and mds stuck in up:replay 2021-08-27 22:40:18 UTC

Description Patrick Donnelly 2019-05-28 23:09:41 UTC
Description of problem:

If there are too many journal segments during recovery, the MDS will fail internal heartbeats.

See also: https://bugzilla.redhat.com/show_bug.cgi?id=1713527#c10

Version-Release number of selected component (if applicable):

3.1

How reproducible:

Test case needs written.

Comment 1 Patrick Donnelly 2019-05-28 23:10:59 UTC
MDS should also detect this situation explain what's delaying recovery during up:replay.

Comment 2 Yan, Zheng 2019-05-29 10:59:02 UTC
Replaying lots of segments does not cause unhealthy heartbeat. The origin issue is that trimming lots of log segments after mds recovered, which causes unhealthy heartbeat

Comment 3 Patrick Donnelly 2019-05-29 23:28:09 UTC
(In reply to Yan, Zheng from comment #2)
> Replaying lots of segments does not cause unhealthy heartbeat. The origin
> issue is that trimming lots of log segments after mds recovered, which
> causes unhealthy heartbeat

But the trimming occurs during up:active? The original issue was that the MDS was stuck in up:replay. The trimming issue once the MDS hits up:active is bz1714814.

Comment 4 Yan, Zheng 2019-05-30 02:14:35 UTC
yes, trimming happens when mds is active. The origin issue is that there were lots of log segments, replaying them spent long time. After journal replay finished, mds started to trim logs and caused unhealthy heartbeat. Then mds got replaced by new mds.

Comment 5 Patrick Donnelly 2019-05-30 18:36:12 UTC
Yes, you're right. Description of the issue in the case confused me.


Note You need to log in before you can comment on or make changes to this bug.