Description of problem (please be detailed as possible and provide log snippets): - Cu is facing an issue as the MDS pod is restarting intermittently. - The MDS pod was recently crashed with a Segmentation fault. --------------------------------------- "backtrace": [ "(()+0x12b20) [0x7fef053e7b20]", "(PurgeQueue::_go_readonly(int)+0x46) [0x5613c071c1b6]", "(()+0x2e3673) [0x5613c0720673]", "(FunctionContext::finish(int)+0x30) [0x5613c0592720]", "(Context::complete(int)+0xd) [0x5613c05908ad]", "(Finisher::finisher_thread_entry()+0x18d) [0x7fef0768d41d]", "(()+0x814a) [0x7fef053dd14a]", "(clone()+0x43) [0x7fef03ef6dc3]" --------------------------------------- Version of all relevant components (if applicable): v4.8.5 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? - The OCS cluster is in WARN state and MDS pods are restarted Is there any workaround available to the best of your knowledge? N/A Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? N?A Can this issue reproducible? No Can this issue reproduce from the UI? No If this is a regression, please provide more details to justify this: No Steps to Reproduce: N/A Actual results: MDS daemon is crashed with a segmentation fault. MDS pod is restarting intermittently. Expected results: MDS pod and Dameon should be running fine without any issues. Additional info: In the next comments
Venky, please check this out. :)
(In reply to Greg Farnum from comment #6) > Venky, please check this out. :) ACK.
The crash seems like a side effect of the MDS daemon getting terminated. The MDS is processing a number of client unlink request thereby involving the stray manager and purge queue. Then the active MDS gets a SIGTERM (why?): ``` -2921> 2021-12-15 19:16:27.620 7fef00832700 5 mds.beacon.ocs-storagecluster-cephfilesystem-a received beacon reply up:active seq 256 rtt 0 -2920> 2021-12-15 19:16:28.826 7feefe82e700 5 asok(0x5613c2aa8000) AdminSocket: request 'get_command_descriptions' 'escriptions"}' to 0x5613c2a20030 returned 5523 bytes -2919> 2021-12-15 19:16:28.837 7feefe82e700 1 mds.ocs-storagecluster-cephfilesystem-a asok_command: status (starting...) -2918> 2021-12-15 19:16:30.368 7feefe02d700 -1 received signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0 -2917> 2021-12-15 19:16:30.369 7feefe02d700 -1 mds.ocs-storagecluster-cephfilesystem-a *** got signal Terminated *** ``` There are a lot many number of completion contexts related messages (PurgeQueue related), but, those look fine since the MDS is terminating: ``` -7> 2021-12-15 19:16:35.623 7feef8021700 4 MDSIOContextBase::complete: dropping for stopping 21C_IO_PurgeStrayPurged -6> 2021-12-15 19:16:35.623 7feef8021700 4 MDSIOContextBase::complete: dropping for stopping 21C_IO_PurgeStrayPurged -5> 2021-12-15 19:16:35.623 7feef8021700 4 MDSIOContextBase::complete: dropping for stopping 21C_IO_PurgeStrayPurged ``` PurgeQueue journaler hits a read error causing the PurgeQueue to go read-only - and then we segfault somewhere in PurgeQueue::_go_readonly(): ``` -4> 2021-12-15 19:16:35.623 7feef8021700 0 mds.0.journaler.pq(rw) _finish_read got error -108 -3> 2021-12-15 19:16:35.623 7feef8021700 1 mds.0.purge_queue _go_readonly: going readonly because internal IO failed: Cannot send after transport endpoint shutdown ``` We may have hit a bug at this point, however, I'm concerned as to why the MDS got a termination signal in the first place? Was this due to OOM? Where can I check this in the case report?
Hey Priya, How big are the nodes (in terms of memory, etc..)? Also, are any MDS config values overridden? (esp. mds_cache_memory_limit)? Cheers, Venky