Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1601138

Summary: MDS stuck in up:resolve during many concurrent failovers
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Patrick Donnelly <pdonnell>
Component: CephFSAssignee: Yan, Zheng <zyan>
Status: CLOSED ERRATA QA Contact: Vasu Kulkarni <vakulkar>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.0CC: anharris, ceph-eng-bugs, edonnell, john.spray, pdonnell, rperiyas, tchandra, tserlin, vumrao
Target Milestone: z5   
Target Release: 3.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: RHEL: ceph-12.2.4-39.el7cp Ubuntu: ceph_12.2.4-44redhat1 Doc Type: Bug Fix
Doc Text:
Previously, in cluster configurations with multiple active metadata servers, it was possible for an MDS to become stuck in "up:resolve" state during recovery. This would generally happen in scenarios involving concurrent recovery and active MDSs becoming stuck on long running operations like balancing metadata load. The MDS could only be resolved by restarting it. With this update, the underlying code has been fixed to resolve the underlying issue where an MDS could miss updates from the Monitors that indicated another MDS failed. MDSs no longer becomes stuck in "up:resolve" and continue recovery.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-08-09 18:27:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Patrick Donnelly 2018-07-14 03:43:30 UTC
Description of problem:

When two or more active MDS repeatedly and concurrently fail over, it's possible for one MDS to become stuck in up:resolve state.

Version-Release number of selected component (if applicable):

3.0

How reproducible:

Unknown yet.

Steps to Reproduce:

Reproducer unknown yet.

Comment 13 Vasu Kulkarni 2018-07-28 20:13:05 UTC
Cherry-picked https://github.com/ceph/ceph/pull/23169 and ran on downstream, looks good

http://pulpito.ceph.redhat.com/vasu-2018-07-27_19:33:34-fs-luminous-distro-basic-argo/

Comment 16 errata-xmlrpc 2018-08-09 18:27:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2375