Description of problem: MDS may fail heartbeat checks while iterating inodes during up:rejoin. Version-Release number of selected component (if applicable): 3.1 series.
Might be related to bz1614498. That fix would have been in their cluster.
Support comment to customer: Hi Shri, The mds was hung during up:rejoin and also mds was being removed from MDSmap seems to be because of it was busy iterating over inodes, refer BZ [1] and patch [2] which was cherrypicked in luminous. This has been fixed in rhcs 3.1 i.e in ceph-mds-12.2.5-59.el7cp.x86_64. We have opened BZ [3] to investigate why OSDs were flapping when mds_beacon_grace to 3600. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1714810 [2] https://github.com/ceph/ceph/pull/21366 [3] https://bugzilla.redhat.com/show_bug.cgi?id=1714848 Let us know if you have any further queries. ------------------------------- Customer response: Hi, we have specific hotfix version on our CEPH servers, does it have fixes for 1 &2? ceph version 12.2.4-42.2.hotfix.nvidia.el7cp (2ae8fcd75c666ffc9badac24707996801ac24fd0) luminous (stable) Thanks Shri ------------------------------ Can engineering assist in validating the customer query above related to their specific hotfix version?
(In reply to Bob Emerson from comment #3) > Hi, > we have specific hotfix version on our CEPH servers, does it have fixes for > 1 &2? > > ceph version 12.2.4-42.2.hotfix.nvidia.el7cp > (2ae8fcd75c666ffc9badac24707996801ac24fd0) luminous (stable) Yes, that release has a942cc479c0df10cefe08d1eefac8bee20a39a2e (the fix from [2]). This must be a different problem.
*** Bug 1713527 has been marked as a duplicate of this bug. ***
It's not easy to reproduce. need to setup lots of cephfs client, each client opens lots of files, then restart mds.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:2538