Bug 2034261 - [GSS] Segmentation fault in MDS pod
Summary: [GSS] Segmentation fault in MDS pod
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
unspecified
Target Milestone: ---
: ---
Assignee: Venky Shankar
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-12-20 14:18 UTC by Priya Pandey
Modified: 2023-08-09 16:37 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-01-31 16:17:08 UTC
Embargoed:


Attachments (Terms of Use)

Description Priya Pandey 2021-12-20 14:18:20 UTC
Description of problem (please be detailed as possible and provide log
snippets):


- Cu is facing an issue as the MDS pod is restarting intermittently.

- The MDS pod was recently crashed with a Segmentation fault.

---------------------------------------

    "backtrace": [
        "(()+0x12b20) [0x7fef053e7b20]",
        "(PurgeQueue::_go_readonly(int)+0x46) [0x5613c071c1b6]",
        "(()+0x2e3673) [0x5613c0720673]",
        "(FunctionContext::finish(int)+0x30) [0x5613c0592720]",
        "(Context::complete(int)+0xd) [0x5613c05908ad]",
        "(Finisher::finisher_thread_entry()+0x18d) [0x7fef0768d41d]",
        "(()+0x814a) [0x7fef053dd14a]",
        "(clone()+0x43) [0x7fef03ef6dc3]"
---------------------------------------

Version of all relevant components (if applicable):

v4.8.5

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

- The OCS cluster is in WARN state and MDS pods are restarted 


Is there any workaround available to the best of your knowledge?
N/A

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
N?A

Can this issue reproducible?
No

Can this issue reproduce from the UI?
No

If this is a regression, please provide more details to justify this:
No

Steps to Reproduce:
N/A


Actual results:

MDS daemon is crashed with a segmentation fault.

MDS pod is restarting intermittently.


Expected results:

MDS pod and Dameon should be running fine without any issues.


Additional info:

In the next comments

Comment 6 Greg Farnum 2022-01-04 14:30:44 UTC
Venky, please check this out. :)

Comment 7 Venky Shankar 2022-01-05 07:06:35 UTC
(In reply to Greg Farnum from comment #6)
> Venky, please check this out. :)

ACK.

Comment 8 Venky Shankar 2022-01-05 12:20:32 UTC
The crash seems like a side effect of the MDS daemon getting terminated. 

The MDS is processing a number of client unlink request thereby involving the stray manager and purge queue. Then the active MDS gets a SIGTERM (why?):

```
 -2921> 2021-12-15 19:16:27.620 7fef00832700  5 mds.beacon.ocs-storagecluster-cephfilesystem-a received beacon reply up:active seq 256 rtt 0
 -2920> 2021-12-15 19:16:28.826 7feefe82e700  5 asok(0x5613c2aa8000) AdminSocket: request 'get_command_descriptions' 'escriptions"}' to 0x5613c2a20030 returned 5523 bytes
 -2919> 2021-12-15 19:16:28.837 7feefe82e700  1 mds.ocs-storagecluster-cephfilesystem-a asok_command: status (starting...)
 -2918> 2021-12-15 19:16:30.368 7feefe02d700 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
 -2917> 2021-12-15 19:16:30.369 7feefe02d700 -1 mds.ocs-storagecluster-cephfilesystem-a *** got signal Terminated ***
```

There are a lot many number of completion contexts related messages (PurgeQueue related), but, those look fine since the MDS is terminating:

```
    -7> 2021-12-15 19:16:35.623 7feef8021700  4 MDSIOContextBase::complete: dropping for stopping 21C_IO_PurgeStrayPurged
    -6> 2021-12-15 19:16:35.623 7feef8021700  4 MDSIOContextBase::complete: dropping for stopping 21C_IO_PurgeStrayPurged
    -5> 2021-12-15 19:16:35.623 7feef8021700  4 MDSIOContextBase::complete: dropping for stopping 21C_IO_PurgeStrayPurged
```

PurgeQueue journaler hits a read error causing the PurgeQueue to go read-only - and then we segfault somewhere in PurgeQueue::_go_readonly():

```
    -4> 2021-12-15 19:16:35.623 7feef8021700  0 mds.0.journaler.pq(rw) _finish_read got error -108
    -3> 2021-12-15 19:16:35.623 7feef8021700  1 mds.0.purge_queue _go_readonly: going readonly because internal IO failed: Cannot send after transport endpoint shutdown
```

We may have hit a bug at this point, however, I'm concerned as to why the MDS got a termination signal in the first place? Was this due to OOM? Where can I check this in the case report?

Comment 10 Venky Shankar 2022-01-06 09:24:01 UTC
Hey Priya,

How big are the nodes (in terms of memory, etc..)?

Also, are any MDS config values overridden? (esp. mds_cache_memory_limit)?

Cheers,
Venky


Note You need to log in before you can comment on or make changes to this bug.