Bug 1614498

Summary: MDS is removed from MDSMap if it is processing imported caps too long
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Patrick Donnelly <pdonnell>
Component: CephFSAssignee: Patrick Donnelly <pdonnell>
Status: CLOSED ERRATA QA Contact: Ramakrishnan Periyasamy <rperiyas>
Severity: urgent Docs Contact: Bara Ancincova <bancinco>
Priority: urgent    
Version: 3.0CC: ceph-eng-bugs, john.spray, kdreyer, mhackett, mmuir, pdonnell, rperiyas, tchandra, tserlin
Target Milestone: z1   
Target Release: 3.1   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: RHEL: ceph-12.2.5-43.el7cp Ubuntu: ceph_12.2.5-27redhat1xenial Doc Type: Bug Fix
Doc Text:
.Monitors no longer remove MDSs from the MDS Map when processing imported capabilities for too long The Metadata Servers (MDSs) did not reset the heartbeat packets while processing imported capabilities. Monitors interpreted this situation as MDSs being stuck and consequently removed them from the MDS Map. This behavior could cause the MDSs to flap when there were large numbers of inodes to be loaded into cache. This update provides a patch to fix this bug, and Monitors no longer remove MDSs from the MDS Map in this case.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-09 00:59:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1584264    

Description Patrick Donnelly 2018-08-09 18:27:49 UTC
Description of problem:

The MDS is not resetting the heartbeat while processing imported caps. The mons interpret this as the MDS being stuck and consequently removes it from the MDSMap. This may cause the MDSs to "flap" when there are large numbers of inodes to be loaded into cache.

Version-Release number of selected component (if applicable):

3.0

How reproducible:

Potentially difficult. It is necessary to have many clients with caps and millions of inodes in cache before testing failover.

Comment 22 Ramakrishnan Periyasamy 2018-10-25 11:59:33 UTC
Automation regression runs passed 

http://cistatus.ceph.redhat.com/ui/#cephci/launches/all%7Cpage.page=1&page.size=50&page.sort=start_time,number%2CDESC/5bc8bb4e36d1a000016d7470?page.page=1&page.size=50&page.sort=start_time%2CASC

username: ceph, passwd: ceph

Moving this bug to verified state.

Comment 24 errata-xmlrpc 2018-11-09 00:59:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3530