Description of problem (please be as detailed as possible and provide log snippets): One of the following two commands caused a severe storage outage/impact cluster-wide. $ ceph --admin-daemon /var/run/ceph/<ceph-mds>.asok status $ ceph tell mds.<mds-name> status The above two commands either run by the customer or something in ODF caused the MDSs to crash and enter a journal replay state for an excessive amount of time causing ceph-backed workloads to go down and were difficult to get back up/running. the MDS crashed while handling the status admin socket command. 2024-05-27T13:09:42.318488158Z debug -8> 2024-05-27T13:09:42.222+0000 7f7dbf659640 5 mds.0.log _submit_thread 39083975766566~2061 : EUpdate openc [metablob 0x1005b7abe3a, 2 dirs] 2024-05-27T13:09:42.318488158Z debug -7> 2024-05-27T13:09:42.222+0000 7f7dc6667640 4 mds.0.server handle_client_request client_request(client.37787558:180979905 create #0x10012d40363/_7zlb6.nvd 2024-05-27T13:09:40.965503+0000 caller_uid=1001000000, caller_gid=0{0,1001000000,}) v4 2024-05-27T13:09:42.318488158Z debug -6> 2024-05-27T13:09:42.222+0000 7f7dbf659640 5 mds.0.log _submit_thread 39083975768647~2589 : EUpdate openc [metablob 0x10012d40362, 2 dirs] 2024-05-27T13:09:42.318507432Z debug -5> 2024-05-27T13:09:42.222+0000 7f7dc6667640 3 mds.0.server handle_client_session client_session(request_renewcaps seq 324635) from client.38893383 2024-05-27T13:09:42.318507432Z debug -4> 2024-05-27T13:09:42.222+0000 7f7dc6667640 3 mds.0.server handle_client_session client_session(request_renewcaps seq 324635) from client.38893386 2024-05-27T13:09:42.318507432Z debug -3> 2024-05-27T13:09:42.222+0000 7f7dc6667640 3 mds.0.server handle_client_session client_session(request_renewcaps seq 324635) from client.38893389 2024-05-27T13:09:42.318527336Z debug -2> 2024-05-27T13:09:42.222+0000 7f7dc6667640 3 mds.0.server handle_client_session client_session(request_renewcaps seq 324635) from client.39148864 2024-05-27T13:09:42.318527336Z debug -1> 2024-05-27T13:09:42.222+0000 7f7dc6667640 4 mds.0.server handle_client_request client_request(client.39148870:24017996 getattr AsLsXsFs #0x100b88c1123 2024-05-27T13:09:40.311495+0000 caller_uid=1000980000, caller_gid=501{0,500,501,1000980000,}) v4 2024-05-27T13:09:42.318527336Z debug 0> 2024-05-27T13:09:42.222+0000 7f7dc866b640 -1 *** Caught signal (Segmentation fault) ** 2024-05-27T13:09:42.318527336Z in thread 7f7dc866b640 thread_name:admin_socket 2024-05-27T13:09:42.318527336Z 2024-05-27T13:09:42.318527336Z ceph version 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable) 2024-05-27T13:09:42.318527336Z 1: /lib64/libc.so.6(+0x54db0) [0x7f7dcb503db0] 2024-05-27T13:09:42.318527336Z 2: (MDSDaemon::dump_status(ceph::Formatter*)+0x2f6) [0x557b1b442826] 2024-05-27T13:09:42.318527336Z 3: (MDSDaemon::asok_command(std::basic_string_view<char, std::char_traits<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::Formatter*, ceph::buffer::v15_2_0::list const&, std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>)+0x590) [0x557b1b443f40] 2024-05-27T13:09:42.318527336Z 4: ceph-mds(+0x12f7f8) [0x557b1b4447f8] 2024-05-27T13:09:42.318527336Z 5: (AdminSocket::execute_command(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, ceph::buffer::v15_2_0::list const&, std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>)+0x57a) [0x7f7dcbc56d0a] 2024-05-27T13:09:42.318527336Z 6: (AdminSocket::execute_command(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, ceph::buffer::v15_2_0::list const&, std::ostream&, ceph::buffer::v15_2_0::list*)+0x11a) [0x7f7dcbc577aa] 2024-05-27T13:09:42.318527336Z 7: (AdminSocket::do_accept()+0x2b6) [0x7f7dcbc5a976] 2024-05-27T13:09:42.318527336Z 8: (AdminSocket::entry()+0x488) [0x7f7dcbc5b7a8] 2024-05-27T13:09:42.318527336Z 9: /lib64/libstdc++.so.6(+0xdb924) [0x7f7dcb88b924] 2024-05-27T13:09:42.318527336Z 10: /lib64/libc.so.6(+0x9f802) [0x7f7dcb54e802] 2024-05-27T13:09:42.318527336Z 11: /lib64/libc.so.6(+0x3f450) [0x7f7dcb4ee450] 2024-05-27T13:09:42.318527336Z NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. The cluster has since recovered, however, someone or something (customer or ODF resource), ran the admin socket or $ ceph tell mds.<name> status command which caused this crash. The expected behavior is that the MDS should not crash while handling the status command and therefore there is an inconsistency in the status command. Version of all relevant components (if applicable): OCP: NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.24 True False 15d Cluster version is 4.13.24 ODF: NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.13.7-rhodf NooBaa Operator 4.13.7-rhodf mcg-operator.v4.12.11-rhodf Succeeded ocs-operator.v4.13.7-rhodf OpenShift Container Storage 4.13.7-rhodf ocs-operator.v4.12.11-rhodf Succeeded odf-csi-addons-operator.v4.13.7-rhodf CSI Addons 4.13.7-rhodf odf-csi-addons-operator.v4.12.11-rhodf Succeeded odf-operator.v4.13.7-rhodf OpenShift Data Foundation 4.13.7-rhodf odf-operator.v4.12.11-rhodf Succeeded Ceph: { "mon": { "ceph version 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable)": 3 }, "mgr": { "ceph version 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable)": 1 }, "osd": { "ceph version 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable)": 36 }, "mds": { "ceph version 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable)": 2 }, "overall": { "ceph version 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable)": 42 } } Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? This was a heavily escalated production cluster. Even after we were able to get the MDSs stable, there were so many workloads affected that it took a lot of manual intervention in those namespaces to (deleting pods, scaling workloads, etc.) to get those workloads back online again. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 4 Additional info: (See Private Comment)