Bug 2135990

Summary: [CEE] MDS pods CrashLoopBackoff
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: James Biao <jbiao>
Component: cephAssignee: Venky Shankar <vshankar>
ceph sub component: CephFS QA Contact: Elad <ebenahar>
Status: CLOSED NOTABUG Docs Contact:
Severity: urgent    
Priority: urgent CC: bniver, gfarnum, hyelloji, madam, mmanjuna, muagarwa, ocs-bugs, odf-bz-bot, tnielsen, vshankar, xiubli
Version: 4.9   
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-23 06:27:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description James Biao 2022-10-19 03:09:30 UTC
Description of problem (please be detailed as possible and provide log
snippests):
2 MDS pods in crashloopbackoff. PVCs unaccessible 


Version of all relevant components (if applicable):

OCS 4.9
ceph version 16.2.0-152


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Filesystem inaccessible

Is there any workaround available to the best of your knowledge?
no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
4

Can this issue reproducible?
no

Can this issue reproduce from the UI?
no

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 11 Venky Shankar 2022-10-31 05:49:13 UTC
FWIW, in Server::reconnect_tick():

```
  for (auto session : remaining_sessions) {
    // Keep sessions that have specified timeout. These sessions will prevent                                                                                                                                                                                                                                               
    // mds from going to active. MDS goes to active after they all have been                                                                                                                                                                                                                                                
    // killed or reclaimed.                                                                                                                                                                                                                                                                                                 
    if (session->info.client_metadata.find("timeout") !=
        session->info.client_metadata.end()) {
      dout(1) << "reconnect keeps " << session->info.inst
              << ", need to be reclaimed" << dendl;
      client_reclaim_gather.insert(session->get_client());
      continue;
    }

    dout(1) << "reconnect gives up on " << session->info.inst << dendl;

    mds->clog->warn() << "evicting unresponsive client " << *session
                      << ", after waiting " << elapse1
                      << " seconds during MDS startup";

```

Is the MDS waiting for session to be reclaimed?