Description of problem: Standby-Replay daemon is not taking over failed rank daemon during client IO(IO consist of MDs operations like rename, chgrp, chown, setxattr, symlink, hardlink of files) Observed MDS behind on trimming health messages in "ceph -w" console. Copied "ceph -w" console logs in this link, https://privatebin-it-iso.int.open.paas.redhat.com/?e366844336dda378#zgP8h1akL+3GhQ6fx9l7fFrTZV8LCMs3cOLT720wj4w= CephFS setup Details: 3 MDS(2 Active MDS and 1 Standby-Replay MDS for rank 1), When the active MDS with rank 1 went down standby-replay went to resolve state but didn't become active. Saw "MDS daemon 'host061' is not responding, replacing it as rank 1 with standby 'host058'" WARN messages. Version-Release number of selected component (if applicable): ceph: ceph version 12.1.2-1.el7cp (b661348f156f148d764b998b65b90451f096cb27) luminous (rc) How reproducible: 4/4 Steps to Reproduce: 1. Configure 3 MDS (2 Active and 1 Standby-replay) 2. Created 50K directories and 1 Million files. 3. Started Client IO's with "rename, chgrp, chown, setxattr, symlink, hardlink" operations and during IO's did failover of active MDS daemon. Actual results: Standby-Replay fails to take over failed active MDS daemon. There is no Crashes observed in Standby-Replay MDS logs. Expected results: Standby-Replay should not Fail. Additional info:
Since you have two active MDS daemons and one standby replay, the standby replay daemon will be arbitrarily picking one of the active ones to follow. If the other active daemon is killed, then the standby replay will not replace it.
John, The standby-replay daemon is configured for rank 1 MDS when existing MDS node goes for reboot then the standby-replay MDS is not replacing the failed MDS. It is in replay --> resolve state but not moving to up:active state and fs in degraded. Please check this pastebin link http://pastebin.test.redhat.com/511697
Ramakrishnan, do you have logs for the MDS daemons?
The reason is that there were about 8k subtrees in directory /. MDS calls MDCache::try_subtree_merge("root dirfrag") during process resolve message. try_subtree_merge() calls MDCache::try_subtree_merge_at() for each subtree. try_subtree_merge_at() calls MDCache::show_subtrees(15) when it about to return. So MDCache::try_subtree_merge("root dirfrag") prink about 64M lines of message (when debug_mds >= 15). printing these message took several minutes. This issue happens only when there are lots of subtrees and debug_mds >= 10. Did you use 'ceph.dir.pin'? or these subtree were automatically created by balancer? FYI: Please set debug_mds to 10 during mds QE test. 'debug_mds == 20' is too verbose, it significantly slow down mds.
open ticket http://tracker.ceph.com/issues/21221
Yes I've used "ceph.dir.pin". There are total 40k directories pinned among 2 acitve MDS(i.e. 20k pin on each MDS).
please don't use ceph.dir.pin this way. It's better to create dir0 and dir1, set ceph.dir.pin on dir0 and dir1. Then create lots of sub-directories in dir1 and dir2.
https://github.com/ceph/ceph/pull/17456
Moving this bz to verified state. verified in ceph version 12.2.1-10.el7cp (5ba1c3fa606d7bf16f72756b0026f04a40297673) luminous (stable)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3387