2292738 – [GSS] The mds.<mds-name> status Command Caused MDSs to Crash and Enter up:replay Yielding Major Storage Outage

Bug 2292738 - [GSS] The mds.<mds-name> status Command Caused MDSs to Crash and Enter up:replay Yielding Major Storage Outage [NEEDINFO]

Summary: [GSS] The mds.<mds-name> status Command Caused MDSs to Crash and Enter up:rep...

Keywords:
Status:	ASSIGNED
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Venky Shankar
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-06-17 19:02 UTC by Craig Wayman
Modified:	2024-09-12 16:57 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
Flags:	crwayman: needinfo- crwayman: needinfo- crwayman: needinfo- crwayman: needinfo- muagarwa: needinfo? (vshankar)

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OCSBZM-8528	0	None	None	None	2024-07-17 00:23:36 UTC

Description Craig Wayman 2024-06-17 19:02:41 UTC

Description of problem (please be as detailed as possible and provide log snippets):

  One of the following two commands caused a severe storage outage/impact cluster-wide.

$ ceph --admin-daemon /var/run/ceph/<ceph-mds>.asok status


$ ceph tell mds.<mds-name> status

  The above two commands either run by the customer or something in ODF caused the MDSs to crash and enter a journal replay state for an excessive amount of time causing ceph-backed workloads to go down and were difficult to get back up/running. 

the MDS crashed while handling the status admin socket command. 

2024-05-27T13:09:42.318488158Z debug     -8> 2024-05-27T13:09:42.222+0000 7f7dbf659640  5 mds.0.log _submit_thread 39083975766566~2061 : EUpdate openc [metablob 0x1005b7abe3a, 2 dirs]
2024-05-27T13:09:42.318488158Z debug     -7> 2024-05-27T13:09:42.222+0000 7f7dc6667640  4 mds.0.server handle_client_request client_request(client.37787558:180979905 create #0x10012d40363/_7zlb6.nvd 2024-05-27T13:09:40.965503+0000 caller_uid=1001000000, caller_gid=0{0,1001000000,}) v4
2024-05-27T13:09:42.318488158Z debug     -6> 2024-05-27T13:09:42.222+0000 7f7dbf659640  5 mds.0.log _submit_thread 39083975768647~2589 : EUpdate openc [metablob 0x10012d40362, 2 dirs]
2024-05-27T13:09:42.318507432Z debug     -5> 2024-05-27T13:09:42.222+0000 7f7dc6667640  3 mds.0.server handle_client_session client_session(request_renewcaps seq 324635) from client.38893383
2024-05-27T13:09:42.318507432Z debug     -4> 2024-05-27T13:09:42.222+0000 7f7dc6667640  3 mds.0.server handle_client_session client_session(request_renewcaps seq 324635) from client.38893386
2024-05-27T13:09:42.318507432Z debug     -3> 2024-05-27T13:09:42.222+0000 7f7dc6667640  3 mds.0.server handle_client_session client_session(request_renewcaps seq 324635) from client.38893389
2024-05-27T13:09:42.318527336Z debug     -2> 2024-05-27T13:09:42.222+0000 7f7dc6667640  3 mds.0.server handle_client_session client_session(request_renewcaps seq 324635) from client.39148864
2024-05-27T13:09:42.318527336Z debug     -1> 2024-05-27T13:09:42.222+0000 7f7dc6667640  4 mds.0.server handle_client_request client_request(client.39148870:24017996 getattr AsLsXsFs #0x100b88c1123 2024-05-27T13:09:40.311495+0000 caller_uid=1000980000, caller_gid=501{0,500,501,1000980000,}) v4
2024-05-27T13:09:42.318527336Z debug      0> 2024-05-27T13:09:42.222+0000 7f7dc866b640 -1 *** Caught signal (Segmentation fault) **
2024-05-27T13:09:42.318527336Z  in thread 7f7dc866b640 thread_name:admin_socket
2024-05-27T13:09:42.318527336Z
2024-05-27T13:09:42.318527336Z  ceph version 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable)
2024-05-27T13:09:42.318527336Z  1: /lib64/libc.so.6(+0x54db0) [0x7f7dcb503db0]
2024-05-27T13:09:42.318527336Z  2: (MDSDaemon::dump_status(ceph::Formatter*)+0x2f6) [0x557b1b442826]
2024-05-27T13:09:42.318527336Z  3: (MDSDaemon::asok_command(std::basic_string_view<char, std::char_traits<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::Formatter*, ceph::buffer::v15_2_0::list const&, std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>)+0x590) [0x557b1b443f40]
2024-05-27T13:09:42.318527336Z  4: ceph-mds(+0x12f7f8) [0x557b1b4447f8]
2024-05-27T13:09:42.318527336Z  5: (AdminSocket::execute_command(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, ceph::buffer::v15_2_0::list const&, std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>)+0x57a) [0x7f7dcbc56d0a]
2024-05-27T13:09:42.318527336Z  6: (AdminSocket::execute_command(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, ceph::buffer::v15_2_0::list const&, std::ostream&, ceph::buffer::v15_2_0::list*)+0x11a) [0x7f7dcbc577aa]
2024-05-27T13:09:42.318527336Z  7: (AdminSocket::do_accept()+0x2b6) [0x7f7dcbc5a976]
2024-05-27T13:09:42.318527336Z  8: (AdminSocket::entry()+0x488) [0x7f7dcbc5b7a8]
2024-05-27T13:09:42.318527336Z  9: /lib64/libstdc++.so.6(+0xdb924) [0x7f7dcb88b924]
2024-05-27T13:09:42.318527336Z  10: /lib64/libc.so.6(+0x9f802) [0x7f7dcb54e802]
2024-05-27T13:09:42.318527336Z  11: /lib64/libc.so.6(+0x3f450) [0x7f7dcb4ee450]
2024-05-27T13:09:42.318527336Z  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

  The cluster has since recovered, however, someone or something (customer or ODF resource), ran the admin socket or $ ceph tell mds.<name> status command which caused this crash. The expected behavior is that the MDS should not crash while handling the status command and therefore there is an inconsistency in the status command.


Version of all relevant components (if applicable):

OCP:

NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.24   True        False         15d     Cluster version is 4.13.24

ODF:

NAME                                    DISPLAY                       VERSION        REPLACES                                 PHASE
mcg-operator.v4.13.7-rhodf              NooBaa Operator               4.13.7-rhodf   mcg-operator.v4.12.11-rhodf              Succeeded
ocs-operator.v4.13.7-rhodf              OpenShift Container Storage   4.13.7-rhodf   ocs-operator.v4.12.11-rhodf              Succeeded
odf-csi-addons-operator.v4.13.7-rhodf   CSI Addons                    4.13.7-rhodf   odf-csi-addons-operator.v4.12.11-rhodf   Succeeded
odf-operator.v4.13.7-rhodf              OpenShift Data Foundation     4.13.7-rhodf   odf-operator.v4.12.11-rhodf              Succeeded


Ceph:

{
    "mon": {
        "ceph version 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable)": 3
    },
    "mgr": {
        "ceph version 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable)": 1
    },
    "osd": {
        "ceph version 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable)": 36
    },
    "mds": {
        "ceph version 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable)": 2
    },
    "overall": {
        "ceph version 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable)": 42
    }
}



Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)?

  This was a heavily escalated production cluster. Even after we were able to get the MDSs stable, there were so many workloads affected that it took a lot of manual intervention in those namespaces to (deleting pods, scaling workloads, etc.) to get those workloads back online again. 


Is there any workaround available to the best of your knowledge?

No


Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)?

4



Additional info:
(See Private Comment)

Note You need to log in before you can comment on or make changes to this bug.