2041660 – mds: reset heartbeat in each MDSContext complete()

Bug 2041660 - mds: reset heartbeat in each MDSContext complete()

Summary: mds: reset heartbeat in each MDSContext complete()

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	5.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	5.1
Assignee:	Xiubo Li
QA Contact:	Hemanth Kumar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2042863
TreeView+	depends on / blocked

Reported:	2022-01-18 02:16 UTC by Xiubo Li
Modified:	2022-04-04 10:23 UTC (History)
CC List:	7 users (show)
Fixed In Version:	ceph-16.2.7-54.el8cp
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2042863 2060989 (view as bug list)
Environment:
Last Closed:	2022-04-04 10:23:35 UTC
Embargoed:
Dependent Products:
Flags:	hyelloji: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	53521	None	None	None	2022-01-18 02:18:13 UTC
Red Hat Issue Tracker	RHCEPH-2987	None	None	None	2022-01-18 02:19:59 UTC
Red Hat Product Errata	RHSA-2022:1174	None	None	None	2022-04-04 10:23:57 UTC

Description Xiubo Li 2022-01-18 02:16:33 UTC

Description of problem:

If there are a lot of dirs to be fetched, the warning like "MDS internal heartbeat is not healthy!" happens for a while until the prefetch_state is transitioned to FILES_INODES executed by the finisher thread.

This timeout issue happens with v14.2.19. It may also be reproduced in the latest version.

The logs:

2021-12-05 20:42:13.472 7f1d2863a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2021-12-05 20:42:13.472 7f1d2863a700 0 mds.beacon.mds005 Skipping beacon heartbeat to monitors (last acked 3.99999s ago); MDS internal heartbeat is not healthy!
2021-12-05 20:42:13.972 7f1d2863a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2021-12-05 20:42:13.972 7f1d2863a700 0 mds.beacon.mds005 Skipping beacon heartbeat to monitors (last acked 4.49999s ago); MDS internal heartbeat is not healthy!
2021-12-05 20:42:14.472 7f1d2863a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2021-12-05 20:42:14.472 7f1d2863a700 0 mds.beacon.mds005 Skipping beacon heartbeat to monitors (last acked 4.99999s ago); MDS internal heartbeat is not healthy!
2021-12-05 20:42:14.972 7f1d2863a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2021-12-05 20:43:36.787 7f1d2ae3f700 1 mds.mds005Map removed me [mds.mds005{0:296030} state up:rejoin seq 2076 addr [] from cluster; respawning! See cluster/monitor logs for details.
...
2021-12-05 20:43:36.787 7f1d2ae3f700 1 mds.mds005 respawn!

#0 0x00007f122f75a1f0 in ceph::buffer::v14_2_0::list::iterator_impl<false>::advance (this=0x7f121d007460, o=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/buffer.cc:743
#1 0x00007f122f75a082 in ceph::buffer::v14_2_0::list::iterator_impl<false>::iterator_impl (this=0x7f121d007460, l=0x7f121d007988, o=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/buffer.cc:728
#2 0x00007f122f750f79 in ceph::buffer::v14_2_0::list::iterator::iterator (this=0x7f121d007460, l=0x7f121d007988, o=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/buffer.cc:956
#3 0x000055b8ad3f8630 in ceph::buffer::v14_2_0::list::begin (this=0x7f121d007988) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/include/buffer.h:1148
#4 0x000055b8ad3f84d0 in ceph::buffer::v14_2_0::list::clear (this=0x7f121d007988) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/include/buffer.h:1071
#5 0x000055b8ad43da92 in ceph::decode (s=..., p=...) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/include/encoding.h:293
#6 0x000055b8ad8096f2 in InodeStoreBase::decode_bare (this=0x7f121d007760, bl=..., snap_blob=..., struct_v=5 '\005') at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/CInode.cc:1485
#7 0x000055b8ad7e4318 in InodeStore::decode_bare (this=0x7f121d007760, bl=...) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/CInode.h:137
#8 0x000055b8ad7d160a in CDir::_load_dentry (this=0x55b8c79c8500, key=..., dname=..., last=..., bl=..., pos=1135, snaps=0x0, force_dirty=0x7f121d007bcc) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/CDir.cc:1795
Python Exception <class 'gdb.error'> There is no member or method named _M_value_field.:
#9 0x000055b8ad7d3c85 in CDir::_omap_fetched (this=0x55b8c79c8500, hdrbl=..., omap=std::map with 2512 elements, complete=true, r=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/CDir.cc:1995
#10 0x000055b8ad7e690f in C_IO_Dir_OMAP_Fetched::finish (this=0x55b8cd02cd80, r=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/CDir.cc:1643
#11 0x000055b8ad3fa23d in Context::complete (this=0x55b8cd02cd80, r=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/include/Context.h:77
#12 0x000055b8ad8bbdee in MDSContext::complete (this=0x55b8cd02cd80, r=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/MDSContext.cc:29
#13 0x000055b8ad8bc577 in MDSIOContextBase::complete (this=0x55b8cd02cd80, r=0) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/mds/MDSContext.cc:114
#14 0x00007f122f2675ad in Finisher::finisher_thread_entry (this=0x55b8afd03440) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/Finisher.cc:67
#15 0x000055b8ad44435c in Finisher::FinisherThread::entry (this=0x55b8afd03530) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/Finisher.h:62
#16 0x00007f122f2d93a8 in Thread::entry_wrapper (this=0x55b8afd03530) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/Thread.cc:84
#17 0x00007f122f2d9326 in Thread::_entry_func (arg=0x55b8afd03530) at /usr/src/debug/ceph-14.2.19-307.g505bc2a/src/common/Thread.cc:71
#18 0x00007f122c19eea5 in start_thread () from /lib64/libpthread.so.0
#19 0x00007f122ae4b96d in clone () from /lib64/libc.so.6



Version-Release number of selected component (if applicable):

v14.2.19+


How reproducible:

1%

Steps to Reproduce:
1. create a directory with a large number of dentries
2. let the standby replay mds to take over the active mds
3. check the logs

Actual results:
Health warning with "MDS internal heartbeat is not healthy!"


Expected results:
Health okay.

Additional info:

ceph tracker: https://tracker.ceph.com/issues/53521

Comment 11 Greg Farnum 2022-03-04 20:22:19 UTC

I am marking this fixed based on the discussion above, since this was created by the devs anyway. Cloned to https://bugzilla.redhat.com/show_bug.cgi?id=2060989 for the second issue.

Comment 13 errata-xmlrpc 2022-04-04 10:23:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.1 Security, Enhancement, and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1174

Note You need to log in before you can comment on or make changes to this bug.