Bug 2230652

Summary: MDS shutdown hung until "mds_bal_interval" was changed from 0 to something non-zero
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Manny <mcaldeir>
Component: CephFSAssignee: Venky Shankar <vshankar>
Status: NEW --- QA Contact: Hemanth Kumar <hyelloji>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 5.3CC: ceph-eng-bugs, cephqe-warriors
Target Milestone: ---   
Target Release: 7.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Manny 2023-08-09 18:55:32 UTC
Description of problem:  MDS shutdown hung until "mds_bal_interval" was changed from 0 to something non-zero

DRW reported this during and upgrade to RHCS 5.3z4:
~~~
Problem Statement	MDS stuck in 'stopping' state during upgrade to 5.3Z4 from 5.3Z2
Description	
What are you experiencing? What are you expecting to happen?
Ceph upgrade in progress, MDS count reducing to 1 but MDS stuck in 'stopping' state for an hour.
~~~

The case was opened as a Sev 1 and then shortly thereafter, the customer (Tyler) reported this:
~~~
Set ceph config mds_bal_interval from 0 to 10 (default) and failed the MDS, it successfully exited and restarted then drained.    

Investigating if we can get all the way to 1 mds, will update case.
~~~

Given changing "mds_bal_interval" to zero for multi-MDS sites is fashionable now, I decided to open this BZ to involve RHCS Engineering
The case was opened (2023-07-29 @ 16:18), but Tyler only provided the MDS logs only recently.

The logs are in Support Shell under case #03574915

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:  That the MDS would shutdown without the need for this sort of intervention.


Additional info: