Bug 1601138 - MDS stuck in up:resolve during many concurrent failovers
Summary: MDS stuck in up:resolve during many concurrent failovers
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: CephFS
Version: 3.0
Hardware: All
OS: All
urgent
urgent
Target Milestone: z5
: 3.0
Assignee: Yan, Zheng
QA Contact: Vasu Kulkarni
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-14 03:43 UTC by Patrick Donnelly
Modified: 2018-08-13 20:58 UTC (History)
9 users (show)

Fixed In Version: RHEL: ceph-12.2.4-39.el7cp Ubuntu: ceph_12.2.4-44redhat1
Doc Type: Bug Fix
Doc Text:
Previously, in cluster configurations with multiple active metadata servers, it was possible for an MDS to become stuck in "up:resolve" state during recovery. This would generally happen in scenarios involving concurrent recovery and active MDSs becoming stuck on long running operations like balancing metadata load. The MDS could only be resolved by restarting it. With this update, the underlying code has been fixed to resolve the underlying issue where an MDS could miss updates from the Monitors that indicated another MDS failed. MDSs no longer becomes stuck in "up:resolve" and continue recovery.
Clone Of:
Environment:
Last Closed: 2018-08-09 18:27:13 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Ceph Project Bug Tracker 25048 None None None 2018-07-21 04:04:41 UTC
Red Hat Bugzilla 1598443 None None None 2019-06-14 03:32:53 UTC
Red Hat Bugzilla 1607601 None CLOSED MDS should dump recent log messages in memory before respawn 2019-06-14 03:32:53 UTC
Red Hat Bugzilla 1607606 None CLOSED MDs should dump MDSMap epoch currently being processed at low debug level 2019-06-14 03:32:52 UTC
Red Hat Product Errata RHBA-2018:2375 None None None 2018-08-09 18:28:03 UTC

Internal Links: 1598443 1607601 1607606

Description Patrick Donnelly 2018-07-14 03:43:30 UTC
Description of problem:

When two or more active MDS repeatedly and concurrently fail over, it's possible for one MDS to become stuck in up:resolve state.

Version-Release number of selected component (if applicable):

3.0

How reproducible:

Unknown yet.

Steps to Reproduce:

Reproducer unknown yet.

Comment 13 Vasu Kulkarni 2018-07-28 20:13:05 UTC
Cherry-picked https://github.com/ceph/ceph/pull/23169 and ran on downstream, looks good

http://pulpito.ceph.redhat.com/vasu-2018-07-27_19:33:34-fs-luminous-distro-basic-argo/

Comment 16 errata-xmlrpc 2018-08-09 18:27:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2375


Note You need to log in before you can comment on or make changes to this bug.