Bug 2041660

Summary:	mds: reset heartbeat in each MDSContext complete()
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Xiubo Li <xiubli>
Component:	CephFS	Assignee:	Xiubo Li <xiubli>
Status:	CLOSED ERRATA	QA Contact:	Hemanth Kumar <hyelloji>
Severity:	low	Docs Contact:
Priority:	unspecified
Version:	5.1	CC:	ceph-eng-bugs, ceph-qe-bugs, gfarnum, hyelloji, tserlin, vereddy, vshankar
Target Milestone:	---	Flags:	hyelloji: needinfo-
Target Release:	5.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ceph-16.2.7-54.el8cp	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2042863 2060989 (view as bug list)		Environment:
Last Closed:	2022-04-04 10:23:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2042863

Description Xiubo Li 2022-01-18 02:16:33 UTC

Description of problem:

If there are a lot of dirs to be fetched, the warning like "MDS internal heartbeat is not healthy!" happens for a while until the prefetch_state is transitioned to FILES_INODES executed by the finisher thread.

This timeout issue happens with v14.2.19. It may also be reproduced in the latest version.

The logs:

2021-12-05 20:42:13.472 7f1d2863a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2021-12-05 20:42:13.472 7f1d2863a700 0 mds.beacon.mds005 Skipping beacon heartbeat to monitors (last acked 3.99999s ago); MDS internal heartbeat is not healthy!
2021-12-05 20:42:13.972 7f1d2863a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2021-12-05 20:42:13.972 7f1d2863a700 0 mds.beacon.mds005 Skipping beacon heartbeat to monitors (last acked 4.49999s ago); MDS internal heartbeat is not healthy!
2021-12-05 20:42:14.472 7f1d2863a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2021-12-05 20:42:14.472 7f1d2863a700 0 mds.beacon.mds005 Skipping beacon heartbeat to monitors (last acked 4.99999s ago); MDS internal heartbeat is not healthy!
2021-12-05 20:42:14.972 7f1d2863a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2021-12-05 20:43:36.787 7f1d2ae3f700 1 mds.mds005Map removed me [mds.mds005{0:296030} state up:rejoin seq 2076 addr [] from cluster; respawning! See cluster/monitor logs for details.
...
2021-12-05 20:43:36.787 7f1d2ae3f700 1 mds.mds005 respawn!

#0 0x00007f122f75a1f0 in ceph::buffer::v14_2_0::list::iterator_impl<false>::advance (this=0x7f121d007460, o=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/buffer.cc:743
#1 0x00007f122f75a082 in ceph::buffer::v14_2_0::list::iterator_impl<false>::iterator_impl (this=0x7f121d007460, l=0x7f121d007988, o=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/buffer.cc:728
#2 0x00007f122f750f79 in ceph::buffer::v14_2_0::list::iterator::iterator (this=0x7f121d007460, l=0x7f121d007988, o=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/buffer.cc:956
#3 0x000055b8ad3f8630 in ceph::buffer::v14_2_0::list::begin (this=0x7f121d007988) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/include/buffer.h:1148
#4 0x000055b8ad3f84d0 in ceph::buffer::v14_2_0::list::clear (this=0x7f121d007988) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/include/buffer.h:1071
#5 0x000055b8ad43da92 in ceph::decode (s=..., p=...) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/include/encoding.h:293
#6 0x000055b8ad8096f2 in InodeStoreBase::decode_bare (this=0x7f121d007760, bl=..., snap_blob=..., struct_v=5 '\005') at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/CInode.cc:1485
#7 0x000055b8ad7e4318 in InodeStore::decode_bare (this=0x7f121d007760, bl=...) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/CInode.h:137
#8 0x000055b8ad7d160a in CDir::_load_dentry (this=0x55b8c79c8500, key=..., dname=..., last=..., bl=..., pos=1135, snaps=0x0, force_dirty=0x7f121d007bcc) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/CDir.cc:1795
Python Exception <class 'gdb.error'> There is no member or method named _M_value_field.:
#9 0x000055b8ad7d3c85 in CDir::_omap_fetched (this=0x55b8c79c8500, hdrbl=..., omap=std::map with 2512 elements, complete=true, r=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/CDir.cc:1995
#10 0x000055b8ad7e690f in C_IO_Dir_OMAP_Fetched::finish (this=0x55b8cd02cd80, r=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/CDir.cc:1643
#11 0x000055b8ad3fa23d in Context::complete (this=0x55b8cd02cd80, r=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/include/Context.h:77
#12 0x000055b8ad8bbdee in MDSContext::complete (this=0x55b8cd02cd80, r=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/MDSContext.cc:29
#13 0x000055b8ad8bc577 in MDSIOContextBase::complete (this=0x55b8cd02cd80, r=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/MDSContext.cc:114
#14 0x00007f122f2675ad in Finisher::finisher_thread_entry (this=0x55b8afd03440) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/Finisher.cc:67
#15 0x000055b8ad44435c in Finisher::FinisherThread::entry (this=0x55b8afd03530) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/Finisher.h:62
#16 0x00007f122f2d93a8 in Thread::entry_wrapper (this=0x55b8afd03530) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/Thread.cc:84
#17 0x00007f122f2d9326 in Thread::_entry_func (arg=0x55b8afd03530) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/Thread.cc:71
#18 0x00007f122c19eea5 in start_thread () from /lib64/libpthread.so.0
#19 0x00007f122ae4b96d in clone () from /lib64/libc.so.6



Version-Release number of selected component (if applicable):

v14.2.19+


How reproducible:

1%

Steps to Reproduce:
1. create a directory with a large number of dentries
2. let the standby replay mds to take over the active mds
3. check the logs

Actual results:
Health warning with "MDS internal heartbeat is not healthy!"


Expected results:
Health okay.

Additional info:

ceph tracker: https://tracker.ceph.com/issues/53521

Comment 11 Greg Farnum 2022-03-04 20:22:19 UTC

I am marking this fixed based on the discussion above, since this was created by the devs anyway. Cloned to https://bugzilla.redhat.com/show_bug.cgi?id=2060989 for the second issue.

Comment 13 errata-xmlrpc 2022-04-04 10:23:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.1 Security, Enhancement, and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1174