1714810 – MDS may hang during up:rejoin while iterating inodes

Bug 1714810 - MDS may hang during up:rejoin while iterating inodes

Summary: MDS may hang during up:rejoin while iterating inodes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	3.1
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	3.3
Assignee:	Yan, Zheng
QA Contact:	subhash
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1713527 (view as bug list)
Depends On:
Blocks:	1726135
TreeView+	depends on / blocked

Reported:	2019-05-28 23:01 UTC by Patrick Donnelly
Modified:	2019-08-21 15:11 UTC (History)
CC List:	9 users (show)
Fixed In Version:	RHEL: ceph-12.2.12-23.el7cp Ubuntu: ceph_12.2.12-19redhat1xenial
Doc Type:	Bug Fix
Doc Text:	.Heartbeat packets are reset as expected Previously, the Ceph Metadata Server (MDS) did not reset heartbeat packets when it was busy in a large loops. This prevented the MDS from sending a beacon to the Monitor. With this update, the Monitor replaces the busy MDS, and the heartbeat packets are reset when the MDS is busy in a large loop.
Clone Of:
Environment:
Last Closed:	2019-08-21 15:11:09 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	40171	None	None	None	2019-06-05 09:31:52 UTC
Red Hat Bugzilla	1713527	urgent	CLOSED	[GSS] clients not able to mount cephfs and mds stuck in up:replay	2021-08-27 22:40:18 UTC
Red Hat Product Errata	RHSA-2019:2538	None	None	None	2019-08-21 15:11:26 UTC

Description Patrick Donnelly 2019-05-28 23:01:06 UTC

Description of problem:

MDS may fail heartbeat checks while iterating inodes during up:rejoin.

Version-Release number of selected component (if applicable):

3.1 series.

Comment 2 Patrick Donnelly 2019-05-28 23:03:59 UTC

Might be related to bz1614498. That fix would have been in their cluster.

Comment 3 Bob Emerson 2019-05-30 16:33:02 UTC

Support comment to customer:

Hi Shri,

The mds was hung during up:rejoin and also mds was being removed from MDSmap seems to be because of it was busy iterating over inodes, refer BZ [1] and patch [2] which was cherrypicked in luminous. This has been fixed in rhcs 3.1 i.e in ceph-mds-12.2.5-59.el7cp.x86_64. We have opened BZ [3] to investigate why OSDs were flapping when mds_beacon_grace to 3600.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1714810
[2] https://github.com/ceph/ceph/pull/21366
[3] https://bugzilla.redhat.com/show_bug.cgi?id=1714848

Let us know if you have any further queries.

-------------------------------

Customer response:

Hi,
we have specific hotfix version on our CEPH servers, does it have fixes for 1 &2?

ceph version 12.2.4-42.2.hotfix.nvidia.el7cp (2ae8fcd75c666ffc9badac24707996801ac24fd0) luminous (stable)


Thanks
Shri

------------------------------

Can engineering assist in validating the customer query above related to their specific hotfix version?

Comment 5 Patrick Donnelly 2019-05-30 19:09:19 UTC

(In reply to Bob Emerson from comment #3)
> Hi,
> we have specific hotfix version on our CEPH servers, does it have fixes for
> 1 &2?
> 
> ceph version 12.2.4-42.2.hotfix.nvidia.el7cp
> (2ae8fcd75c666ffc9badac24707996801ac24fd0) luminous (stable)

Yes, that release has a942cc479c0df10cefe08d1eefac8bee20a39a2e (the fix from [2]). This must be a different problem.

Comment 13 Patrick Donnelly 2019-06-14 18:03:57 UTC

*** Bug 1713527 has been marked as a duplicate of this bug. ***

Comment 18 Yan, Zheng 2019-07-25 03:01:18 UTC

It's not easy to reproduce.  need to setup lots of cephfs client, each client opens lots of files, then restart mds.

Comment 21 errata-xmlrpc 2019-08-21 15:11:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:2538

Note You need to log in before you can comment on or make changes to this bug.