1566016 – [cephfs]: MDS asserted while in Starting/resolve state

Bug 1566016 - [cephfs]: MDS asserted while in Starting/resolve state

Summary: [cephfs]: MDS asserted while in Starting/resolve state

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	z4
Target Release:	3.0
Assignee:	Patrick Donnelly
QA Contact:	Ramakrishnan Periyasamy
Docs Contact:	Erin Donnelly
URL:
Whiteboard:
Duplicates (1):	1567030 (view as bug list)
Depends On:
Blocks:	1557269 1578142
TreeView+	depends on / blocked

Reported:	2018-04-11 11:02 UTC by Ramakrishnan Periyasamy
Modified:	2021-09-09 13:43 UTC (History)
CC List:	12 users (show)
Fixed In Version:	RHEL: ceph-12.2.4-15.el7cp Ubuntu: 12.2.4-19redhat1xenial
Doc Type:	Bug Fix
Doc Text:	Previously, when increasing "max_mds" from "1" to "2", if the Metadata Server (MDS) daemon was in the starting/resolve state for a long period of time, then restarting the MDS daemon lead to assert. This caused the Ceph File System (CephFS) to be in degraded state. With this update, the underlying issue has been fixed, and increasing "max_mds" no longer causes CephFS to be in degraded state.
Clone Of:
Clones:	1578142 (view as bug list)
Environment:
Last Closed:	2018-07-11 18:11:08 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	23812	None	None	None	2018-04-25 18:01:55 UTC
Ceph Project Bug Tracker	23813	None	None	None	2018-05-03 18:23:08 UTC
Ceph Project Bug Tracker	23919	None	None	None	2018-04-30 04:29:07 UTC
Red Hat Bugzilla	1567030	urgent	CLOSED	[Cephfs:Fuse]: Fuse service stopped and crefi IO failed during MDS in starting state.	2021-02-22 00:41:40 UTC
Red Hat Issue Tracker	RHCEPH-1565	None	None	None	2021-09-09 13:43:13 UTC
Red Hat Product Errata	RHSA-2018:2177	None	None	None	2018-07-11 18:11:55 UTC

Internal Links: 1567030

Description Ramakrishnan Periyasamy 2018-04-11 11:02:19 UTC

Description of problem:

/builddir/build/BUILD/ceph-12.2.4/src/mds/Locker.cc: 3793: FAILED assert(mds->is_rejoin() || mds->is_clientreplay() || mds->is_active() || mds->is_stopping())

 ceph version 12.2.4-6.el7cp (78f60b924802e34d44f7078029a40dbe6c0c922f) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55bba8bda2a0]
 2: (Locker::handle_lock(MLock*)+0x57) [0x55bba8a4a7c7]
 3: (Locker::dispatch(Message*)+0x85) [0x55bba8a565e5]
 4: (MDSRank::handle_deferrable_message(Message*)+0xbb4) [0x55bba88bd264]
 5: (MDSRank::_dispatch(Message*, bool)+0x1e3) [0x55bba88cab33]
 6: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x55bba88cb975]
 7: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x55bba88b4593]
 8: (DispatchQueue::entry()+0x792) [0x55bba8ec3d32]
 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x55bba8c5fefd]
 10: (()+0x7dd5) [0x7f2098147dd5]
 11: (clone()+0x6d) [0x7f2097227b3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Version-Release number of selected component (if applicable):
 ceph version 12.2.4-6.el7cp

How reproducible:
1/1

Steps to Reproduce:
1. Configure cluster with 3 MDS (2 acitve and 1 standby)
2. Reduce the max mds to 1 
3. deactivate the 1 rank MDS's wait till mds stops.
4. Stop the MDS daemons of standby MDS
5. restart the active MDS wait till it comes active
6. Start the standby MDS
7. Increate max_mds to 2, MDS will be in starting state.
MDS was in starting state for more than 2 hrs, Tried to restart the MDS (which is in starting state), after service restart found the it is in starting state again and move d to resolve state.

Still the MDS is in resolve state, observed assert in MDS log.

Actual results:
MDS in resolve state, and fs state is degraded

Expected results:
MDS should become active and FS should be OK

Additional info:

Comment 12 Ramakrishnan Periyasamy 2018-04-20 13:46:33 UTC

Sorry please ignore the before comment updated in wrong bz.

Comment 13 Ramakrishnan Periyasamy 2018-04-20 15:55:49 UTC

Added doc text

Comment 14 Yan, Zheng 2018-04-26 13:01:57 UTC

this one should be fixed by https://github.com/ceph/ceph/pull/21601

Comment 46 Patrick Donnelly 2018-06-05 19:52:16 UTC

*** Bug 1567030 has been marked as a duplicate of this bug. ***

Comment 47 Ramakrishnan Periyasamy 2018-06-19 08:15:01 UTC

Moving this bug to verified state. Not observed any MDS assert during testing.

Verified in ceph version 12.2.4-27.el7cp

CI Automation regression runs passed without any issues.

Comment 49 errata-xmlrpc 2018-07-11 18:11:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2177

Note You need to log in before you can comment on or make changes to this bug.