Description of problem: If there are a lot of dirs to be fetched, the warning like "MDS internal heartbeat is not healthy!" happens for a while until the prefetch_state is transitioned to FILES_INODES executed by the finisher thread. This timeout issue happens with v14.2.19. It may also be reproduced in the latest version. The logs: 2021-12-05 20:42:13.472 7f1d2863a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2021-12-05 20:42:13.472 7f1d2863a700 0 mds.beacon.mds005 Skipping beacon heartbeat to monitors (last acked 3.99999s ago); MDS internal heartbeat is not healthy! 2021-12-05 20:42:13.972 7f1d2863a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2021-12-05 20:42:13.972 7f1d2863a700 0 mds.beacon.mds005 Skipping beacon heartbeat to monitors (last acked 4.49999s ago); MDS internal heartbeat is not healthy! 2021-12-05 20:42:14.472 7f1d2863a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2021-12-05 20:42:14.472 7f1d2863a700 0 mds.beacon.mds005 Skipping beacon heartbeat to monitors (last acked 4.99999s ago); MDS internal heartbeat is not healthy! 2021-12-05 20:42:14.972 7f1d2863a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2021-12-05 20:43:36.787 7f1d2ae3f700 1 mds.mds005Map removed me [mds.mds005{0:296030} state up:rejoin seq 2076 addr [] from cluster; respawning! See cluster/monitor logs for details. ... 2021-12-05 20:43:36.787 7f1d2ae3f700 1 mds.mds005 respawn! #0 0x00007f122f75a1f0 in ceph::buffer::v14_2_0::list::iterator_impl<false>::advance (this=0x7f121d007460, o=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/buffer.cc:743 #1 0x00007f122f75a082 in ceph::buffer::v14_2_0::list::iterator_impl<false>::iterator_impl (this=0x7f121d007460, l=0x7f121d007988, o=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/buffer.cc:728 #2 0x00007f122f750f79 in ceph::buffer::v14_2_0::list::iterator::iterator (this=0x7f121d007460, l=0x7f121d007988, o=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/buffer.cc:956 #3 0x000055b8ad3f8630 in ceph::buffer::v14_2_0::list::begin (this=0x7f121d007988) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/include/buffer.h:1148 #4 0x000055b8ad3f84d0 in ceph::buffer::v14_2_0::list::clear (this=0x7f121d007988) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/include/buffer.h:1071 #5 0x000055b8ad43da92 in ceph::decode (s=..., p=...) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/include/encoding.h:293 #6 0x000055b8ad8096f2 in InodeStoreBase::decode_bare (this=0x7f121d007760, bl=..., snap_blob=..., struct_v=5 '\005') at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/CInode.cc:1485 #7 0x000055b8ad7e4318 in InodeStore::decode_bare (this=0x7f121d007760, bl=...) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/CInode.h:137 #8 0x000055b8ad7d160a in CDir::_load_dentry (this=0x55b8c79c8500, key=..., dname=..., last=..., bl=..., pos=1135, snaps=0x0, force_dirty=0x7f121d007bcc) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/CDir.cc:1795 Python Exception <class 'gdb.error'> There is no member or method named _M_value_field.: #9 0x000055b8ad7d3c85 in CDir::_omap_fetched (this=0x55b8c79c8500, hdrbl=..., omap=std::map with 2512 elements, complete=true, r=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/CDir.cc:1995 #10 0x000055b8ad7e690f in C_IO_Dir_OMAP_Fetched::finish (this=0x55b8cd02cd80, r=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/CDir.cc:1643 #11 0x000055b8ad3fa23d in Context::complete (this=0x55b8cd02cd80, r=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/include/Context.h:77 #12 0x000055b8ad8bbdee in MDSContext::complete (this=0x55b8cd02cd80, r=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/MDSContext.cc:29 #13 0x000055b8ad8bc577 in MDSIOContextBase::complete (this=0x55b8cd02cd80, r=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/MDSContext.cc:114 #14 0x00007f122f2675ad in Finisher::finisher_thread_entry (this=0x55b8afd03440) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/Finisher.cc:67 #15 0x000055b8ad44435c in Finisher::FinisherThread::entry (this=0x55b8afd03530) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/Finisher.h:62 #16 0x00007f122f2d93a8 in Thread::entry_wrapper (this=0x55b8afd03530) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/Thread.cc:84 #17 0x00007f122f2d9326 in Thread::_entry_func (arg=0x55b8afd03530) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/Thread.cc:71 #18 0x00007f122c19eea5 in start_thread () from /lib64/libpthread.so.0 #19 0x00007f122ae4b96d in clone () from /lib64/libc.so.6 Version-Release number of selected component (if applicable): v14.2.19+ How reproducible: 1% Steps to Reproduce: 1. create a directory with a large number of dentries 2. let the standby replay mds to take over the active mds 3. check the logs Actual results: Health warning with "MDS internal heartbeat is not healthy!" Expected results: Health okay. Additional info: ceph tracker: https://tracker.ceph.com/issues/53521
I am marking this fixed based on the discussion above, since this was created by the devs anyway. Cloned to https://bugzilla.redhat.com/show_bug.cgi?id=2060989 for the second issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Ceph Storage 5.1 Security, Enhancement, and Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:1174