1485783 – [CephFS] Standby-Replay daemon is hanging in "resolve" state while trying to take over rank

Bug 1485783 - [CephFS] Standby-Replay daemon is hanging in "resolve" state while trying to take over rank

Summary: [CephFS] Standby-Replay daemon is hanging in "resolve" state while trying to ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	3.0
Assignee:	Patrick Donnelly
QA Contact:	Ramakrishnan Periyasamy
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-08-28 05:49 UTC by Ramakrishnan Periyasamy
Modified:	2017-12-05 23:41 UTC (History)
CC List:	7 users (show)
Fixed In Version:	RHEL: ceph-12.2.1-1.el7cp Ubuntu: ceph_12.2.1-2redhat1xenial
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-12-05 23:41:09 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	21221	0	None	None	None	2017-09-04 18:33:26 UTC
Red Hat Product Errata	RHBA-2017:3387	0	normal	SHIPPED_LIVE	Red Hat Ceph Storage 3.0 bug fix and enhancement update	2017-12-06 03:03:45 UTC

Description Ramakrishnan Periyasamy 2017-08-28 05:49:57 UTC

Description of problem:

Standby-Replay daemon is not taking over failed rank daemon during client IO(IO consist of MDs operations like rename, chgrp, chown, setxattr, symlink, hardlink of files)

Observed MDS behind on trimming health messages in "ceph -w" console.

Copied "ceph -w" console logs in this link, https://privatebin-it-iso.int.open.paas.redhat.com/?e366844336dda378#zgP8h1akL+3GhQ6fx9l7fFrTZV8LCMs3cOLT720wj4w=

CephFS setup Details:
3 MDS(2 Active MDS and 1 Standby-Replay MDS for rank 1), When the active MDS with rank 1 went down standby-replay went to resolve state but didn't become active. Saw "MDS daemon 'host061' is not responding, replacing it as rank 1 with standby 'host058'" WARN messages.


Version-Release number of selected component (if applicable):
ceph: ceph version 12.1.2-1.el7cp (b661348f156f148d764b998b65b90451f096cb27) luminous (rc)

How reproducible:
4/4

Steps to Reproduce:
1. Configure 3 MDS (2 Active and 1 Standby-replay)
2. Created 50K directories and 1 Million files.
3. Started Client IO's with "rename, chgrp, chown, setxattr, symlink, hardlink" operations and during IO's did failover of active MDS daemon. 

Actual results:
Standby-Replay fails to take over failed active MDS daemon. There is no Crashes observed in Standby-Replay MDS logs.

Expected results:
Standby-Replay should not Fail.

Additional info:

Comment 4 John Spray 2017-08-28 08:53:13 UTC

Since you have two active MDS daemons and one standby replay, the standby replay daemon will be arbitrarily picking one of the active ones to follow.  If the other active daemon is killed, then the standby replay will not replace it.

Comment 5 Ramakrishnan Periyasamy 2017-08-28 12:41:07 UTC

John, 

The standby-replay daemon is configured for rank 1 MDS when existing MDS node goes for reboot then the standby-replay MDS is not replacing the failed MDS. It is in replay --> resolve state but not moving to up:active state and fs in degraded.

Please check this pastebin link http://pastebin.test.redhat.com/511697

Comment 6 Patrick Donnelly 2017-08-28 17:20:27 UTC

Ramakrishnan, do you have logs for the MDS daemons?

Comment 8 Yan, Zheng 2017-09-04 09:04:37 UTC

The reason is that there were about 8k subtrees in directory /. MDS calls MDCache::try_subtree_merge("root dirfrag") during process resolve message. try_subtree_merge() calls MDCache::try_subtree_merge_at() for each subtree.
try_subtree_merge_at() calls MDCache::show_subtrees(15) when it about to return.
So MDCache::try_subtree_merge("root dirfrag") prink about 64M lines of message (when debug_mds >= 15). printing these message took several minutes.

This issue happens only when there are lots of subtrees and debug_mds >= 10.

Did you use 'ceph.dir.pin'? or these subtree were automatically created by balancer?


FYI: Please set debug_mds to 10 during mds QE test. 'debug_mds == 20' is too verbose, it significantly slow down mds.

Comment 9 Yan, Zheng 2017-09-04 09:11:28 UTC

open ticket http://tracker.ceph.com/issues/21221

Comment 10 Ramakrishnan Periyasamy 2017-09-04 11:50:44 UTC

Yes I've used "ceph.dir.pin". There are total 40k directories pinned among 2 acitve MDS(i.e. 20k pin on each MDS).

Comment 11 Yan, Zheng 2017-09-04 11:53:59 UTC

please don't use ceph.dir.pin this way. It's better to create dir0 and dir1, set ceph.dir.pin on dir0 and dir1. Then create lots of sub-directories in dir1 and dir2.

Comment 12 Patrick Donnelly 2017-09-04 18:44:46 UTC

https://github.com/ceph/ceph/pull/17456

Comment 16 Ramakrishnan Periyasamy 2017-10-17 03:51:35 UTC

Moving this bz to verified state.

verified in ceph version 12.2.1-10.el7cp (5ba1c3fa606d7bf16f72756b0026f04a40297673) luminous (stable)

Comment 19 errata-xmlrpc 2017-12-05 23:41:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3387

Note You need to log in before you can comment on or make changes to this bug.