2135990 – [CEE] MDS pods CrashLoopBackoff

Bug 2135990 - [CEE] MDS pods CrashLoopBackoff

Summary: [CEE] MDS pods CrashLoopBackoff

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.9
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Venky Shankar
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-10-19 03:09 UTC by James Biao
Modified:	2023-08-09 16:37 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-11-23 06:27:39 UTC
Embargoed:

Attachments	(Terms of Use)

Description James Biao 2022-10-19 03:09:30 UTC

Description of problem (please be detailed as possible and provide log
snippests):
2 MDS pods in crashloopbackoff. PVCs unaccessible 


Version of all relevant components (if applicable):

OCS 4.9
ceph version 16.2.0-152


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Filesystem inaccessible

Is there any workaround available to the best of your knowledge?
no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
4

Can this issue reproducible?
no

Can this issue reproduce from the UI?
no

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 11 Venky Shankar 2022-10-31 05:49:13 UTC

FWIW, in Server::reconnect_tick():

```
  for (auto session : remaining_sessions) {
    // Keep sessions that have specified timeout. These sessions will prevent                                                                                                                                                                                                                                               
    // mds from going to active. MDS goes to active after they all have been                                                                                                                                                                                                                                                
    // killed or reclaimed.                                                                                                                                                                                                                                                                                                 
    if (session->info.client_metadata.find("timeout") !=
        session->info.client_metadata.end()) {
      dout(1) << "reconnect keeps " << session->info.inst
              << ", need to be reclaimed" << dendl;
      client_reclaim_gather.insert(session->get_client());
      continue;
    }

    dout(1) << "reconnect gives up on " << session->info.inst << dendl;

    mds->clog->warn() << "evicting unresponsive client " << *session
                      << ", after waiting " << elapse1
                      << " seconds during MDS startup";

```

Is the MDS waiting for session to be reclaimed?

Note You need to log in before you can comment on or make changes to this bug.