Bug 2282346
Summary: | [RDR] Multiple MDS crashes seen on the surviving cluster post hub recovery | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Aman Agrawal <amagrawa> |
Component: | ceph | Assignee: | Venky Shankar <vshankar> |
ceph sub component: | CephFS | QA Contact: | Elad <ebenahar> |
Status: | ASSIGNED --- | Docs Contact: | |
Severity: | high | ||
Priority: | unspecified | CC: | bniver, khiremat, muagarwa, sheggodu, sostapov, vshankar |
Version: | 4.16 | Flags: | amagrawa:
needinfo?
(vshankar) |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | Type: | Bug | |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Aman Agrawal
2024-05-22 08:55:44 UTC
As per offline discussion with Venky: Since different crash info is seen on different crashes, we don't know yet which crash is causing the failover to remain stuck. We will use this BZ to investigate further and open new BZ if needed based upon the findings. A few crash outputs for reference: bash-5.1$ ceph crash info 2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12 { "archived": "2024-05-14 19:09:40.190367", "backtrace": [ "/lib64/libc.so.6(+0x54db0) [0x7f6d122fcdb0]", "pthread_getname_np()", "(ceph::logging::Log::dump_recent()+0x5c5) [0x7f6d12c84bf5]", "(MDSDaemon::respawn()+0x15a) [0x55d7aeeb0c7a]", "ceph-mds(+0x143e5d) [0x55d7aee98e5d]", "(MDSRank::handle_write_error(int)+0x1af) [0x55d7aeece3ef]", "ceph-mds(+0x1a88e4) [0x55d7aeefd8e4]", "ceph-mds(+0x143e5d) [0x55d7aee98e5d]", "(Finisher::finisher_thread_entry()+0x175) [0x7f6d12a31145]", "/lib64/libc.so.6(+0x9f802) [0x7f6d12347802]", "/lib64/libc.so.6(+0x3f450) [0x7f6d122e7450]" ], "ceph_version": "18.2.1-136.el9cp", "crash_id": "2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12", "entity_name": "mds.ocs-storagecluster-cephfilesystem-b", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.3 (Plow)", "os_version_id": "9.3", "process_name": "ceph-mds", "stack_sig": "238e13b05e2f9c033f80644a94b3324df6b15f78ec4772f7c900a997f8566e3f", "timestamp": "2024-05-14T18:52:04.860762Z", "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd586658k6w8", "utsname_machine": "x86_64", "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" } bash-5.1$ ceph crash info 2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd { "backtrace": [ "/lib64/libc.so.6(+0x54db0) [0x7ff74acafdb0]", "ceph-mds(+0x22d2ed) [0x557c354f42ed]", "ceph-mds(+0x5a7d02) [0x557c3586ed02]", "(EMetaBlob::fullbit::update_inode(MDSRank*, CInode*)+0x51) [0x557c35778fc1]", "(EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x79e) [0x557c3577fb6e]", "(EOpen::replay(MDSRank*)+0x55) [0x557c3578c6c5]", "(MDLog::_replay_thread()+0x75e) [0x557c356ea52e]", "ceph-mds(+0x16cf21) [0x557c35433f21]", "/lib64/libc.so.6(+0x9f802) [0x7ff74acfa802]", "/lib64/libc.so.6(+0x3f450) [0x7ff74ac9a450]" ], "ceph_version": "18.2.1-136.el9cp", "crash_id": "2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd", "entity_name": "mds.ocs-storagecluster-cephfilesystem-b", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.3 (Plow)", "os_version_id": "9.3", "process_name": "ceph-mds", "stack_sig": "534595eadbe3cd5e36a861179a9d229df6085a48ed4bf3ee7982825650a239f5", "timestamp": "2024-05-17T10:56:29.510669Z", "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp", "utsname_machine": "x86_64", "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" } bash-5.1$ ceph crash info 2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07 { "assert_condition": "(bool)_front == (bool)_size", "assert_file": "/builddir/build/BUILD/ceph-18.2.1/src/include/xlist.h", "assert_func": "size_t xlist<T>::size() const [with T = LRUObject*; size_t = long unsigned int]", "assert_line": 87, "assert_msg": "/builddir/build/BUILD/ceph-18.2.1/src/include/xlist.h: In function 'size_t xlist<T>::size() const [with T = LRUObject*; size_t = long unsigned int]' thread 7f3c3d3fc640 time 2024-05-17T20:34:57.806097+0000\n/builddir/build/BUILD/ceph-18.2.1/src/include/xlist.h: 87: FAILED ceph_assert((bool)_front == (bool)_size)\n", "assert_thread_name": "md_log_replay", "backtrace": [ "/lib64/libc.so.6(+0x54db0) [0x7f3c49eccdb0]", "/lib64/libc.so.6(+0xa154c) [0x7f3c49f1954c]", "raise()", "abort()", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f3c4a549068]", "/usr/lib64/ceph/libceph-common.so.2(+0x1631cc) [0x7f3c4a5491cc]", "ceph-mds(+0x1479fe) [0x56357b3779fe]", "(CDir::add_null_dentry(std::basic_string_view<char, std::char_traits<char> >, snapid_t, snapid_t)+0x29a) [0x56357b5a226a]", "(EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x1853) [0x56357b6e9c23]", "(EOpen::replay(MDSRank*)+0x55) [0x56357b6f56c5]", "(MDLog::_replay_thread()+0x75e) [0x56357b65352e]", "ceph-mds(+0x16cf21) [0x56357b39cf21]", "/lib64/libc.so.6(+0x9f802) [0x7f3c49f17802]", "/lib64/libc.so.6(+0x3f450) [0x7f3c49eb7450]" ], "ceph_version": "18.2.1-136.el9cp", "crash_id": "2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07", "entity_name": "mds.ocs-storagecluster-cephfilesystem-a", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.3 (Plow)", "os_version_id": "9.3", "process_name": "ceph-mds", "stack_sig": "ea1ea719ad2b0630e8fcf810ee54c1655e5bc626e9caddabb987b8b82f28bfea", "timestamp": "2024-05-17T20:34:57.807764Z", "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c", "utsname_machine": "x86_64", "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" } bash-5.1$ ceph crash info 2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109 { "backtrace": [ "/lib64/libc.so.6(+0x54db0) [0x7fba580ebdb0]", "ceph-mds(+0x2e5548) [0x5617387fd548]", "(MDCache::finish_uncommitted_fragment(dirfrag_t, int)+0x8c) [0x5617387e8c5c]", "(EFragment::replay(MDSRank*)+0x26b) [0x5617389e374b]", "(MDLog::_replay_thread()+0x75e) [0x56173893b52e]", "ceph-mds(+0x16cf21) [0x561738684f21]", "/lib64/libc.so.6(+0x9f802) [0x7fba58136802]", "/lib64/libc.so.6(+0x3f450) [0x7fba580d6450]" ], "ceph_version": "18.2.1-136.el9cp", "crash_id": "2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109", "entity_name": "mds.ocs-storagecluster-cephfilesystem-a", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.3 (Plow)", "os_version_id": "9.3", "process_name": "ceph-mds", "stack_sig": "9146a54bbc96beab126447db7eb36673a320813b826ed34354414b4148a7d86c", "timestamp": "2024-05-20T00:27:27.924980Z", "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c", "utsname_machine": "x86_64", "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" } This covers all the crashes reported in the Description of problem above. Most of the crashes have similar crash info. so only unique ones are listed here. (In reply to Aman Agrawal from comment #3) > As per offline discussion with Venky: > > Since different crash info is seen on different crashes, we don't know yet > which crash is causing the failover to remain stuck. We will use this BZ to > investigate further and open new BZ if needed based upon the findings. > > > A few crash outputs for reference: > > > bash-5.1$ ceph crash info > 2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12 > { > "archived": "2024-05-14 19:09:40.190367", > "backtrace": [ > "/lib64/libc.so.6(+0x54db0) [0x7f6d122fcdb0]", > "pthread_getname_np()", > "(ceph::logging::Log::dump_recent()+0x5c5) [0x7f6d12c84bf5]", > "(MDSDaemon::respawn()+0x15a) [0x55d7aeeb0c7a]", > "ceph-mds(+0x143e5d) [0x55d7aee98e5d]", > "(MDSRank::handle_write_error(int)+0x1af) [0x55d7aeece3ef]", > "ceph-mds(+0x1a88e4) [0x55d7aeefd8e4]", > "ceph-mds(+0x143e5d) [0x55d7aee98e5d]", > "(Finisher::finisher_thread_entry()+0x175) [0x7f6d12a31145]", > "/lib64/libc.so.6(+0x9f802) [0x7f6d12347802]", > "/lib64/libc.so.6(+0x3f450) [0x7f6d122e7450]" > ], > "ceph_version": "18.2.1-136.el9cp", > "crash_id": > "2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12", > "entity_name": "mds.ocs-storagecluster-cephfilesystem-b", > "os_id": "rhel", > "os_name": "Red Hat Enterprise Linux", > "os_version": "9.3 (Plow)", > "os_version_id": "9.3", > "process_name": "ceph-mds", > "stack_sig": > "238e13b05e2f9c033f80644a94b3324df6b15f78ec4772f7c900a997f8566e3f", > "timestamp": "2024-05-14T18:52:04.860762Z", > "utsname_hostname": > "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd586658k6w8", > "utsname_machine": "x86_64", > "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", > "utsname_sysname": "Linux", > "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" > } > > > > > > bash-5.1$ ceph crash info > 2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd > { > "backtrace": [ > "/lib64/libc.so.6(+0x54db0) [0x7ff74acafdb0]", > "ceph-mds(+0x22d2ed) [0x557c354f42ed]", > "ceph-mds(+0x5a7d02) [0x557c3586ed02]", > "(EMetaBlob::fullbit::update_inode(MDSRank*, CInode*)+0x51) > [0x557c35778fc1]", > "(EMetaBlob::replay(MDSRank*, LogSegment*, int, > MDPeerUpdate*)+0x79e) [0x557c3577fb6e]", > "(EOpen::replay(MDSRank*)+0x55) [0x557c3578c6c5]", > "(MDLog::_replay_thread()+0x75e) [0x557c356ea52e]", > "ceph-mds(+0x16cf21) [0x557c35433f21]", > "/lib64/libc.so.6(+0x9f802) [0x7ff74acfa802]", > "/lib64/libc.so.6(+0x3f450) [0x7ff74ac9a450]" > ], > "ceph_version": "18.2.1-136.el9cp", > "crash_id": > "2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd", > "entity_name": "mds.ocs-storagecluster-cephfilesystem-b", > "os_id": "rhel", > "os_name": "Red Hat Enterprise Linux", > "os_version": "9.3 (Plow)", > "os_version_id": "9.3", > "process_name": "ceph-mds", > "stack_sig": > "534595eadbe3cd5e36a861179a9d229df6085a48ed4bf3ee7982825650a239f5", > "timestamp": "2024-05-17T10:56:29.510669Z", > "utsname_hostname": > "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp", > "utsname_machine": "x86_64", > "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", > "utsname_sysname": "Linux", > "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" > } > > The above crash seems to be initially reported as part of BZ2218759 and there was an ask to repro it hence I added a comment here- https://bugzilla.redhat.com/show_bug.cgi?id=2218759#c74 All other crashes still need investigation. > > > bash-5.1$ ceph crash info > 2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07 > { > "assert_condition": "(bool)_front == (bool)_size", > "assert_file": "/builddir/build/BUILD/ceph-18.2.1/src/include/xlist.h", > "assert_func": "size_t xlist<T>::size() const [with T = LRUObject*; > size_t = long unsigned int]", > "assert_line": 87, > "assert_msg": "/builddir/build/BUILD/ceph-18.2.1/src/include/xlist.h: In > function 'size_t xlist<T>::size() const [with T = LRUObject*; size_t = long > unsigned int]' thread 7f3c3d3fc640 time > 2024-05-17T20:34:57.806097+0000\n/builddir/build/BUILD/ceph-18.2.1/src/ > include/xlist.h: 87: FAILED ceph_assert((bool)_front == (bool)_size)\n", > "assert_thread_name": "md_log_replay", > "backtrace": [ > "/lib64/libc.so.6(+0x54db0) [0x7f3c49eccdb0]", > "/lib64/libc.so.6(+0xa154c) [0x7f3c49f1954c]", > "raise()", > "abort()", > "(ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x188) [0x7f3c4a549068]", > "/usr/lib64/ceph/libceph-common.so.2(+0x1631cc) [0x7f3c4a5491cc]", > "ceph-mds(+0x1479fe) [0x56357b3779fe]", > "(CDir::add_null_dentry(std::basic_string_view<char, > std::char_traits<char> >, snapid_t, snapid_t)+0x29a) [0x56357b5a226a]", > "(EMetaBlob::replay(MDSRank*, LogSegment*, int, > MDPeerUpdate*)+0x1853) [0x56357b6e9c23]", > "(EOpen::replay(MDSRank*)+0x55) [0x56357b6f56c5]", > "(MDLog::_replay_thread()+0x75e) [0x56357b65352e]", > "ceph-mds(+0x16cf21) [0x56357b39cf21]", > "/lib64/libc.so.6(+0x9f802) [0x7f3c49f17802]", > "/lib64/libc.so.6(+0x3f450) [0x7f3c49eb7450]" > ], > "ceph_version": "18.2.1-136.el9cp", > "crash_id": > "2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07", > "entity_name": "mds.ocs-storagecluster-cephfilesystem-a", > "os_id": "rhel", > "os_name": "Red Hat Enterprise Linux", > "os_version": "9.3 (Plow)", > "os_version_id": "9.3", > "process_name": "ceph-mds", > "stack_sig": > "ea1ea719ad2b0630e8fcf810ee54c1655e5bc626e9caddabb987b8b82f28bfea", > "timestamp": "2024-05-17T20:34:57.807764Z", > "utsname_hostname": > "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c", > "utsname_machine": "x86_64", > "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", > "utsname_sysname": "Linux", > "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" > } > > > > bash-5.1$ ceph crash info > 2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109 > { > "backtrace": [ > "/lib64/libc.so.6(+0x54db0) [0x7fba580ebdb0]", > "ceph-mds(+0x2e5548) [0x5617387fd548]", > "(MDCache::finish_uncommitted_fragment(dirfrag_t, int)+0x8c) > [0x5617387e8c5c]", > "(EFragment::replay(MDSRank*)+0x26b) [0x5617389e374b]", > "(MDLog::_replay_thread()+0x75e) [0x56173893b52e]", > "ceph-mds(+0x16cf21) [0x561738684f21]", > "/lib64/libc.so.6(+0x9f802) [0x7fba58136802]", > "/lib64/libc.so.6(+0x3f450) [0x7fba580d6450]" > ], > "ceph_version": "18.2.1-136.el9cp", > "crash_id": > "2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109", > "entity_name": "mds.ocs-storagecluster-cephfilesystem-a", > "os_id": "rhel", > "os_name": "Red Hat Enterprise Linux", > "os_version": "9.3 (Plow)", > "os_version_id": "9.3", > "process_name": "ceph-mds", > "stack_sig": > "9146a54bbc96beab126447db7eb36673a320813b826ed34354414b4148a7d86c", > "timestamp": "2024-05-20T00:27:27.924980Z", > "utsname_hostname": > "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c", > "utsname_machine": "x86_64", > "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", > "utsname_sysname": "Linux", > "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" > } > > > > This covers all the crashes reported in the Description of problem above. > Most of the crashes have similar crash info. so only unique ones are listed > here. Hi Aman, (In reply to Aman Agrawal from comment #3) > As per offline discussion with Venky: > > Since different crash info is seen on different crashes, we don't know yet > which crash is causing the failover to remain stuck. We will use this BZ to > investigate further and open new BZ if needed based upon the findings. > > > A few crash outputs for reference: > > > bash-5.1$ ceph crash info > 2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12 > { > "archived": "2024-05-14 19:09:40.190367", > "backtrace": [ > "/lib64/libc.so.6(+0x54db0) [0x7f6d122fcdb0]", > "pthread_getname_np()", > "(ceph::logging::Log::dump_recent()+0x5c5) [0x7f6d12c84bf5]", > "(MDSDaemon::respawn()+0x15a) [0x55d7aeeb0c7a]", > "ceph-mds(+0x143e5d) [0x55d7aee98e5d]", > "(MDSRank::handle_write_error(int)+0x1af) [0x55d7aeece3ef]", > "ceph-mds(+0x1a88e4) [0x55d7aeefd8e4]", > "ceph-mds(+0x143e5d) [0x55d7aee98e5d]", > "(Finisher::finisher_thread_entry()+0x175) [0x7f6d12a31145]", > "/lib64/libc.so.6(+0x9f802) [0x7f6d12347802]", > "/lib64/libc.so.6(+0x3f450) [0x7f6d122e7450]" > ], > "ceph_version": "18.2.1-136.el9cp", > "crash_id": > "2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12", > "entity_name": "mds.ocs-storagecluster-cephfilesystem-b", > "os_id": "rhel", > "os_name": "Red Hat Enterprise Linux", > "os_version": "9.3 (Plow)", > "os_version_id": "9.3", > "process_name": "ceph-mds", > "stack_sig": > "238e13b05e2f9c033f80644a94b3324df6b15f78ec4772f7c900a997f8566e3f", > "timestamp": "2024-05-14T18:52:04.860762Z", > "utsname_hostname": > "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd586658k6w8", > "utsname_machine": "x86_64", > "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", > "utsname_sysname": "Linux", > "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" > } > We fixed a couple of bugs related to the above crash (backtrace). Do you have the MDS coredump? > > > > > bash-5.1$ ceph crash info > 2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd > { > "backtrace": [ > "/lib64/libc.so.6(+0x54db0) [0x7ff74acafdb0]", > "ceph-mds(+0x22d2ed) [0x557c354f42ed]", > "ceph-mds(+0x5a7d02) [0x557c3586ed02]", > "(EMetaBlob::fullbit::update_inode(MDSRank*, CInode*)+0x51) > [0x557c35778fc1]", > "(EMetaBlob::replay(MDSRank*, LogSegment*, int, > MDPeerUpdate*)+0x79e) [0x557c3577fb6e]", > "(EOpen::replay(MDSRank*)+0x55) [0x557c3578c6c5]", > "(MDLog::_replay_thread()+0x75e) [0x557c356ea52e]", > "ceph-mds(+0x16cf21) [0x557c35433f21]", > "/lib64/libc.so.6(+0x9f802) [0x7ff74acfa802]", > "/lib64/libc.so.6(+0x3f450) [0x7ff74ac9a450]" > ], > "ceph_version": "18.2.1-136.el9cp", > "crash_id": > "2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd", > "entity_name": "mds.ocs-storagecluster-cephfilesystem-b", > "os_id": "rhel", > "os_name": "Red Hat Enterprise Linux", > "os_version": "9.3 (Plow)", > "os_version_id": "9.3", > "process_name": "ceph-mds", > "stack_sig": > "534595eadbe3cd5e36a861179a9d229df6085a48ed4bf3ee7982825650a239f5", > "timestamp": "2024-05-17T10:56:29.510669Z", > "utsname_hostname": > "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp", > "utsname_machine": "x86_64", > "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", > "utsname_sysname": "Linux", > "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" > } > > > > > bash-5.1$ ceph crash info > 2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07 > { > "assert_condition": "(bool)_front == (bool)_size", This is (likely) a new crash - haven't seen this backtrace yet. Again, where can I find the MDS coredump? > "assert_file": "/builddir/build/BUILD/ceph-18.2.1/src/include/xlist.h", > "assert_func": "size_t xlist<T>::size() const [with T = LRUObject*; > size_t = long unsigned int]", > "assert_line": 87, > "assert_msg": "/builddir/build/BUILD/ceph-18.2.1/src/include/xlist.h: In > function 'size_t xlist<T>::size() const [with T = LRUObject*; size_t = long > unsigned int]' thread 7f3c3d3fc640 time > 2024-05-17T20:34:57.806097+0000\n/builddir/build/BUILD/ceph-18.2.1/src/ > include/xlist.h: 87: FAILED ceph_assert((bool)_front == (bool)_size)\n", > "assert_thread_name": "md_log_replay", > "backtrace": [ > "/lib64/libc.so.6(+0x54db0) [0x7f3c49eccdb0]", > "/lib64/libc.so.6(+0xa154c) [0x7f3c49f1954c]", > "raise()", > "abort()", > "(ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x188) [0x7f3c4a549068]", > "/usr/lib64/ceph/libceph-common.so.2(+0x1631cc) [0x7f3c4a5491cc]", > "ceph-mds(+0x1479fe) [0x56357b3779fe]", > "(CDir::add_null_dentry(std::basic_string_view<char, > std::char_traits<char> >, snapid_t, snapid_t)+0x29a) [0x56357b5a226a]", > "(EMetaBlob::replay(MDSRank*, LogSegment*, int, > MDPeerUpdate*)+0x1853) [0x56357b6e9c23]", > "(EOpen::replay(MDSRank*)+0x55) [0x56357b6f56c5]", > "(MDLog::_replay_thread()+0x75e) [0x56357b65352e]", > "ceph-mds(+0x16cf21) [0x56357b39cf21]", > "/lib64/libc.so.6(+0x9f802) [0x7f3c49f17802]", > "/lib64/libc.so.6(+0x3f450) [0x7f3c49eb7450]" > ], > "ceph_version": "18.2.1-136.el9cp", > "crash_id": > "2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07", > "entity_name": "mds.ocs-storagecluster-cephfilesystem-a", > "os_id": "rhel", > "os_name": "Red Hat Enterprise Linux", > "os_version": "9.3 (Plow)", > "os_version_id": "9.3", > "process_name": "ceph-mds", > "stack_sig": > "ea1ea719ad2b0630e8fcf810ee54c1655e5bc626e9caddabb987b8b82f28bfea", > "timestamp": "2024-05-17T20:34:57.807764Z", > "utsname_hostname": > "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c", > "utsname_machine": "x86_64", > "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", > "utsname_sysname": "Linux", > "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" > } > > > > bash-5.1$ ceph crash info > 2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109 > { > "backtrace": [ > "/lib64/libc.so.6(+0x54db0) [0x7fba580ebdb0]", > "ceph-mds(+0x2e5548) [0x5617387fd548]", > "(MDCache::finish_uncommitted_fragment(dirfrag_t, int)+0x8c) > [0x5617387e8c5c]", This crash looks new. > "(EFragment::replay(MDSRank*)+0x26b) [0x5617389e374b]", > "(MDLog::_replay_thread()+0x75e) [0x56173893b52e]", > "ceph-mds(+0x16cf21) [0x561738684f21]", > "/lib64/libc.so.6(+0x9f802) [0x7fba58136802]", > "/lib64/libc.so.6(+0x3f450) [0x7fba580d6450]" > ], > "ceph_version": "18.2.1-136.el9cp", > "crash_id": > "2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109", > "entity_name": "mds.ocs-storagecluster-cephfilesystem-a", > "os_id": "rhel", > "os_name": "Red Hat Enterprise Linux", > "os_version": "9.3 (Plow)", > "os_version_id": "9.3", > "process_name": "ceph-mds", > "stack_sig": > "9146a54bbc96beab126447db7eb36673a320813b826ed34354414b4148a7d86c", > "timestamp": "2024-05-20T00:27:27.924980Z", > "utsname_hostname": > "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c", > "utsname_machine": "x86_64", > "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", > "utsname_sysname": "Linux", > "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" > } > Wow - we are dealing with at least two new MDS crashes. The crash in the BZ description (got_journaled_ack) is a known issue which happens with standby-replay MDS and that's covered in https://tracker.ceph.com/issues/54741. Aman, I assume the core dumps are still accessible since these crashes were seen sometime this week. Could you please find a way to share those with engineering. The failover issue mentioned in this bug with discussed with Ramen team in today's RDR triage meeting and it was concluded that the stuck failover is due to https://bugzilla.redhat.com/show_bug.cgi?id=2283038 and should not be because of MDS crashes reported in this BZ. However, MDS crash issue should still be prioritised and investigated. I am updating the bug title and details accordingly. Venky, must-gather logs shared above should have the coredumps, please check and confirm- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/20may24-mds-high-log-level/ We still have the setup in case logs are missing and we would need to re-visit it hence requesting an ack/nack on the logs provided earlier. I tried collecting pstack for both mds-a and mds-b by making them active with log level ceph config set mds debug_mds 20 ceph config set mds debug_ms 1 but I don't think much info. was collected. Refer- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/27may24/ Pls note, only 1 daemon was available so I had to respin the pods and thereafter collect the logs. Time- Somewhere close to Mon May 27 17:32:39 UTC 2024 bash-5.1$ ceph -s cluster: id: 119cc23a-0ffa-4ed8-ab81-bf8f48c88b8c health: HEALTH_WARN insufficient standby MDS daemons available 1 MDSs behind on trimming 18 daemons have recently crashed services: mon: 3 daemons, quorum d,f,g (age 4h) mgr: b(active, since 12d), standbys: a mds: 1/1 daemons up osd: 3 osds: 3 up (since 12d), 3 in (since 3w) rbd-mirror: 1 daemon active (1 hosts) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 524.65k objects, 37 GiB usage: 112 GiB used, 1.4 TiB / 1.5 TiB avail pgs: 169 active+clean io: client: 1.1 MiB/s rd, 33 MiB/s wr, 194 op/s rd, 26 op/s wr pods|grep mds rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c 2/2 Running 1164 (4d13h ago) 12d 10.128.2.63 compute-2 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp 2/2 Running 1843 (10m ago) 12d 10.131.0.241 compute-1 <none> <none> (In reply to Aman Agrawal from comment #6) > The failover issue mentioned in this bug with discussed with Ramen team in > today's RDR triage meeting and it was concluded that the stuck failover is > due to https://bugzilla.redhat.com/show_bug.cgi?id=2283038 and should not be > because of MDS crashes reported in this BZ. > > However, MDS crash issue should still be prioritised and investigated. > > I am updating the bug title and details accordingly. > > Venky, must-gather logs shared above should have the coredumps, please check > and confirm- > http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/20may24-mds- > high-log-level/ Thank you. I will have a look. (In reply to Venky Shankar from comment #9) > (In reply to Aman Agrawal from comment #6) > > The failover issue mentioned in this bug with discussed with Ramen team in > > today's RDR triage meeting and it was concluded that the stuck failover is > > due to https://bugzilla.redhat.com/show_bug.cgi?id=2283038 and should not be > > because of MDS crashes reported in this BZ. > > > > However, MDS crash issue should still be prioritised and investigated. > > > > I am updating the bug title and details accordingly. > > > > Venky, must-gather logs shared above should have the coredumps, please check > > and confirm- > > http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/20may24-mds- > > high-log-level/ > > Thank you. I will have a look. Please share the latest status. |