Bug 2230652 - MDS shutdown hung until "mds_bal_interval" was changed from 0 to something non-zero
Summary: MDS shutdown hung until "mds_bal_interval" was changed from 0 to something no...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: CephFS
Version: 5.3
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 7.1
Assignee: Venky Shankar
QA Contact: Hemanth Kumar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-08-09 18:55 UTC by Manny
Modified: 2023-08-11 16:15 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-7187 0 None None None 2023-08-09 18:57:41 UTC

Description Manny 2023-08-09 18:55:32 UTC
Description of problem:  MDS shutdown hung until "mds_bal_interval" was changed from 0 to something non-zero

DRW reported this during and upgrade to RHCS 5.3z4:
~~~
Problem Statement	MDS stuck in 'stopping' state during upgrade to 5.3Z4 from 5.3Z2
Description	
What are you experiencing? What are you expecting to happen?
Ceph upgrade in progress, MDS count reducing to 1 but MDS stuck in 'stopping' state for an hour.
~~~

The case was opened as a Sev 1 and then shortly thereafter, the customer (Tyler) reported this:
~~~
Set ceph config mds_bal_interval from 0 to 10 (default) and failed the MDS, it successfully exited and restarted then drained.    

Investigating if we can get all the way to 1 mds, will update case.
~~~

Given changing "mds_bal_interval" to zero for multi-MDS sites is fashionable now, I decided to open this BZ to involve RHCS Engineering
The case was opened (2023-07-29 @ 16:18), but Tyler only provided the MDS logs only recently.

The logs are in Support Shell under case #03574915

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:  That the MDS would shutdown without the need for this sort of intervention.


Additional info:


Note You need to log in before you can comment on or make changes to this bug.