Bug 1614498 - MDS is removed from MDSMap if it is processing imported caps too long
Summary: MDS is removed from MDSMap if it is processing imported caps too long
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: CephFS
Version: 3.0
Hardware: All
OS: All
urgent
urgent
Target Milestone: z1
: 3.1
Assignee: Patrick Donnelly
QA Contact: Ramakrishnan Periyasamy
Bara Ancincova
URL:
Whiteboard:
Depends On:
Blocks: 1584264
TreeView+ depends on / blocked
 
Reported: 2018-08-09 18:27 UTC by Patrick Donnelly
Modified: 2021-12-10 17:02 UTC (History)
9 users (show)

Fixed In Version: RHEL: ceph-12.2.5-43.el7cp Ubuntu: ceph_12.2.5-27redhat1xenial
Doc Type: Bug Fix
Doc Text:
.Monitors no longer remove MDSs from the MDS Map when processing imported capabilities for too long The Metadata Servers (MDSs) did not reset the heartbeat packets while processing imported capabilities. Monitors interpreted this situation as MDSs being stuck and consequently removed them from the MDS Map. This behavior could cause the MDSs to flap when there were large numbers of inodes to be loaded into cache. This update provides a patch to fix this bug, and Monitors no longer remove MDSs from the MDS Map in this case.
Clone Of:
Environment:
Last Closed: 2018-11-09 00:59:21 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 23636 0 None None None 2018-08-09 18:27:49 UTC
Red Hat Issue Tracker RHCEPH-2677 0 None None None 2021-12-10 17:02:29 UTC
Red Hat Product Errata RHBA-2018:3530 0 None None None 2018-11-09 01:00:03 UTC

Description Patrick Donnelly 2018-08-09 18:27:49 UTC
Description of problem:

The MDS is not resetting the heartbeat while processing imported caps. The mons interpret this as the MDS being stuck and consequently removes it from the MDSMap. This may cause the MDSs to "flap" when there are large numbers of inodes to be loaded into cache.

Version-Release number of selected component (if applicable):

3.0

How reproducible:

Potentially difficult. It is necessary to have many clients with caps and millions of inodes in cache before testing failover.

Comment 22 Ramakrishnan Periyasamy 2018-10-25 11:59:33 UTC
Automation regression runs passed 

http://cistatus.ceph.redhat.com/ui/#cephci/launches/all%7Cpage.page=1&page.size=50&page.sort=start_time,number%2CDESC/5bc8bb4e36d1a000016d7470?page.page=1&page.size=50&page.sort=start_time%2CASC

username: ceph, passwd: ceph

Moving this bug to verified state.

Comment 24 errata-xmlrpc 2018-11-09 00:59:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3530


Note You need to log in before you can comment on or make changes to this bug.