Bug 1664468

Summary: MDS hangs and is removed when doing a significant shrink of a large cache
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Patrick Donnelly <pdonnell>
Component: CephFSAssignee: Patrick Donnelly <pdonnell>
Status: CLOSED ERRATA QA Contact: Persona non grata <nobody+410372>
Severity: high Docs Contact: Bara Ancincova <bancinco>
Priority: high    
Version: 3.1CC: ceph-eng-bugs, ceph-qe-bugs, edonnell, pdonnell, rperiyas, sweil, tchandra, tserlin
Target Milestone: z1   
Target Release: 3.2   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: RHEL: ceph-2:12.2.8-72.el7cp Ubuntu: ceph_12.2.8-58redhat1 Doc Type: Bug Fix
Doc Text:
.Shrinking large MDS cache no longer causes the MDS daemon to appear to hang Previously, an attempt to shrink a large Metadata Server (MDS) cache caused the primary MDS daemon to become unresponsive. Consequently, Monitors removed the unresponsive MDS and a standby MDS became the primary MDS. With this update, shrinking large MDS cache no longer causes the primary MDS daemon to hang.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-07 15:51:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1629656    

Description Patrick Donnelly 2019-01-08 22:07:06 UTC
Description of problem:

When the MDS is doing a significant shrink of a large cache, e.g. 96GB -> 64GB, the MDS will spin trying to trim cached objects and asking clients to recall caps. The monitors will remove the MDS because it's missing heartbeat beacons.

Version-Release number of selected component (if applicable):

3.0

How reproducible:

100%

Steps to Reproduce:
1. Create a file system and fill it with ~10 million files. Then have 4-5 clients load those files into memory.
2. Reduce the MDS cache using the `config set` admin socket command.

Actual results:

MDS will be removed from the MDSMap and a standby will take over.

Expected results:

The MDS slowly reduces its cache without service interruption.

Comment 12 errata-xmlrpc 2019-03-07 15:51:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0475