Bug 1485783

Summary:	[CephFS] Standby-Replay daemon is hanging in "resolve" state while trying to take over rank
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Ramakrishnan Periyasamy <rperiyas>
Component:	CephFS	Assignee:	Patrick Donnelly <pdonnell>
Status:	CLOSED ERRATA	QA Contact:	Ramakrishnan Periyasamy <rperiyas>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.0	CC:	ceph-eng-bugs, hnallurv, icolle, john.spray, kdreyer, rperiyas, zyan
Target Milestone:	rc
Target Release:	3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	RHEL: ceph-12.2.1-1.el7cp Ubuntu: ceph_12.2.1-2redhat1xenial	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-12-05 23:41:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ramakrishnan Periyasamy 2017-08-28 05:49:57 UTC

Description of problem:

Standby-Replay daemon is not taking over failed rank daemon during client IO(IO consist of MDs operations like rename, chgrp, chown, setxattr, symlink, hardlink of files)

Observed MDS behind on trimming health messages in "ceph -w" console.

Copied "ceph -w" console logs in this link, https://privatebin-it-iso.int.open.paas.redhat.com/?e366844336dda378#zgP8h1akL+3GhQ6fx9l7fFrTZV8LCMs3cOLT720wj4w=

CephFS setup Details:
3 MDS(2 Active MDS and 1 Standby-Replay MDS for rank 1), When the active MDS with rank 1 went down standby-replay went to resolve state but didn't become active. Saw "MDS daemon 'host061' is not responding, replacing it as rank 1 with standby 'host058'" WARN messages.


Version-Release number of selected component (if applicable):
ceph: ceph version 12.1.2-1.el7cp (b661348f156f148d764b998b65b90451f096cb27) luminous (rc)

How reproducible:
4/4

Steps to Reproduce:
1. Configure 3 MDS (2 Active and 1 Standby-replay)
2. Created 50K directories and 1 Million files.
3. Started Client IO's with "rename, chgrp, chown, setxattr, symlink, hardlink" operations and during IO's did failover of active MDS daemon. 

Actual results:
Standby-Replay fails to take over failed active MDS daemon. There is no Crashes observed in Standby-Replay MDS logs.

Expected results:
Standby-Replay should not Fail.

Additional info:

Comment 4 John Spray 2017-08-28 08:53:13 UTC

Since you have two active MDS daemons and one standby replay, the standby replay daemon will be arbitrarily picking one of the active ones to follow.  If the other active daemon is killed, then the standby replay will not replace it.

Comment 5 Ramakrishnan Periyasamy 2017-08-28 12:41:07 UTC

John, 

The standby-replay daemon is configured for rank 1 MDS when existing MDS node goes for reboot then the standby-replay MDS is not replacing the failed MDS. It is in replay --> resolve state but not moving to up:active state and fs in degraded.

Please check this pastebin link http://pastebin.test.redhat.com/511697

Comment 6 Patrick Donnelly 2017-08-28 17:20:27 UTC

Ramakrishnan, do you have logs for the MDS daemons?

Comment 8 Yan, Zheng 2017-09-04 09:04:37 UTC

The reason is that there were about 8k subtrees in directory /. MDS calls MDCache::try_subtree_merge("root dirfrag") during process resolve message. try_subtree_merge() calls MDCache::try_subtree_merge_at() for each subtree.
try_subtree_merge_at() calls MDCache::show_subtrees(15) when it about to return.
So MDCache::try_subtree_merge("root dirfrag") prink about 64M lines of message (when debug_mds >= 15). printing these message took several minutes.

This issue happens only when there are lots of subtrees and debug_mds >= 10.

Did you use 'ceph.dir.pin'? or these subtree were automatically created by balancer?


FYI: Please set debug_mds to 10 during mds QE test. 'debug_mds == 20' is too verbose, it significantly slow down mds.

Comment 9 Yan, Zheng 2017-09-04 09:11:28 UTC

open ticket http://tracker.ceph.com/issues/21221

Comment 10 Ramakrishnan Periyasamy 2017-09-04 11:50:44 UTC

Yes I've used "ceph.dir.pin". There are total 40k directories pinned among 2 acitve MDS(i.e. 20k pin on each MDS).

Comment 11 Yan, Zheng 2017-09-04 11:53:59 UTC

please don't use ceph.dir.pin this way. It's better to create dir0 and dir1, set ceph.dir.pin on dir0 and dir1. Then create lots of sub-directories in dir1 and dir2.

Comment 12 Patrick Donnelly 2017-09-04 18:44:46 UTC

https://github.com/ceph/ceph/pull/17456

Comment 16 Ramakrishnan Periyasamy 2017-10-17 03:51:35 UTC

Moving this bz to verified state.

verified in ceph version 12.2.1-10.el7cp (5ba1c3fa606d7bf16f72756b0026f04a40297673) luminous (stable)

Comment 19 errata-xmlrpc 2017-12-05 23:41:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3387