1601138 – MDS stuck in up:resolve during many concurrent failovers

Bug 1601138 - MDS stuck in up:resolve during many concurrent failovers

Summary: MDS stuck in up:resolve during many concurrent failovers

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	3.0
Hardware:	All
OS:	All
Priority:	urgent
Severity:	urgent
Target Milestone:	z5
Target Release:	3.0
Assignee:	Yan, Zheng
QA Contact:	Vasu Kulkarni
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-07-14 03:43 UTC by Patrick Donnelly
Modified:	2021-12-10 16:43 UTC (History)
CC List:	9 users (show)
Fixed In Version:	RHEL: ceph-12.2.4-39.el7cp Ubuntu: ceph_12.2.4-44redhat1
Doc Type:	Bug Fix
Doc Text:	Previously, in cluster configurations with multiple active metadata servers, it was possible for an MDS to become stuck in "up:resolve" state during recovery. This would generally happen in scenarios involving concurrent recovery and active MDSs becoming stuck on long running operations like balancing metadata load. The MDS could only be resolved by restarting it. With this update, the underlying code has been fixed to resolve the underlying issue where an MDS could miss updates from the Monitors that indicated another MDS failed. MDSs no longer becomes stuck in "up:resolve" and continue recovery.
Clone Of:
Environment:
Last Closed:	2018-08-09 18:27:13 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	25048	0	None	None	None	2018-07-21 04:04:41 UTC
Red Hat Bugzilla	1598443	1	None	None	None	2024-06-27 07:40:34 UTC
Red Hat Bugzilla	1607601	0	urgent	CLOSED	MDS should dump recent log messages in memory before respawn	2021-12-10 17:20:10 UTC
Red Hat Bugzilla	1607606	0	high	CLOSED	MDs should dump MDSMap epoch currently being processed at low debug level	2021-12-10 16:52:42 UTC
Red Hat Issue Tracker	RHCEPH-2654	0	None	None	None	2021-12-10 16:43:00 UTC
Red Hat Product Errata	RHBA-2018:2375	0	None	None	None	2018-08-09 18:28:03 UTC

Internal Links: 1598443 1607601 1607606

Description Patrick Donnelly 2018-07-14 03:43:30 UTC

Description of problem:

When two or more active MDS repeatedly and concurrently fail over, it's possible for one MDS to become stuck in up:resolve state.

Version-Release number of selected component (if applicable):

3.0

How reproducible:

Unknown yet.

Steps to Reproduce:

Reproducer unknown yet.

Comment 13 Vasu Kulkarni 2018-07-28 20:13:05 UTC

Cherry-picked https://github.com/ceph/ceph/pull/23169 and ran on downstream, looks good

http://pulpito.ceph.redhat.com/vasu-2018-07-27_19:33:34-fs-luminous-distro-basic-argo/

Comment 16 errata-xmlrpc 2018-08-09 18:27:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2375

Note You need to log in before you can comment on or make changes to this bug.