Description of problem (please be detailed as possible and provide log snippests): Version of all relevant components (if applicable): ceph version 18.2.1-136.el9cp (e7edde2b655d0dd9f860dda675f9d7954f07e6e3) reef (stable) OCP 4.16.0-0.nightly-2024-04-26-145258 ODF 4.16.0-89.stable ACM 2.10.2 MCE 2.5.2 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: *****Active hub co-situated with primary managed cluster***** 1. On a RDR setup, perform site-failure by bringing active hub and primary managed cluster down and move to passive hub by performing hub recovery. 2. Then failover all the workloads running on down managed cluster to the surviving managed cluster. 3. After successful failover, recover the down managed cluster. 4. Now failover one of the cephfs workloads where peer ready is marked as true but replication destination isn't created due to eviction period which is 24hrs as of now. Successful failover should be possible after BZ2283038 is fixed. During all these operations, keep an eye on health of MDS pods and crashes if any. Actual results: MDS crash is seen on the surviving cluster C2 on which workloads were failedover. pods|grep mds rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c 2/2 Running 1023 (9m33s ago) 7d13h 10.128.2.63 compute-2 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp 2/2 Running 1006 (17m ago) 7d13h 10.131.0.241 compute-1 <none> <none> oc -n openshift-storage rsh "$(oc get po -n openshift-storage -l app=rook-ceph-tools -o name)" ceph crash ls ID ENTITY NEW 2024-05-14T17:33:05.811016Z_b5585b5b-3a3a-4838-93d3-13ccfb04b669 mds.ocs-storagecluster-cephfilesystem-a 2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12 mds.ocs-storagecluster-cephfilesystem-b 2024-05-15T11:11:58.726060Z_bf8483a9-8f22-465f-83c3-93b2f34710f3 mds.ocs-storagecluster-cephfilesystem-a * 2024-05-15T12:26:15.553277Z_eae1ba47-82fd-4bf1-88c1-8a816f67ab65 mds.ocs-storagecluster-cephfilesystem-b * 2024-05-16T07:07:03.233208Z_2a86dc2e-e6cf-4f07-aa8b-9d7b9eec803f mds.ocs-storagecluster-cephfilesystem-a * 2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd mds.ocs-storagecluster-cephfilesystem-b * 2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07 mds.ocs-storagecluster-cephfilesystem-a * 2024-05-17T21:56:39.158029Z_12c1efa9-ecfc-4c32-9024-77423ae09ecf mds.ocs-storagecluster-cephfilesystem-b * 2024-05-18T00:36:35.787255Z_40eb624a-a7ed-4415-b17f-4085bd9eac9b mds.ocs-storagecluster-cephfilesystem-b * 2024-05-18T03:32:14.163891Z_84c9b511-ac45-4bf2-9040-8014883e80a9 mds.ocs-storagecluster-cephfilesystem-b * 2024-05-19T12:56:35.497622Z_b3c014e8-cba6-496d-a673-e2b2f026be21 mds.ocs-storagecluster-cephfilesystem-b * 2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109 mds.ocs-storagecluster-cephfilesystem-a * 2024-05-20T08:06:13.398353Z_5ebabaee-807c-4910-ba27-14eeee4b4fba mds.ocs-storagecluster-cephfilesystem-a * 2024-05-20T17:27:47.267271Z_2c553b1c-d6db-43e8-9421-2994c1745e40 mds.ocs-storagecluster-cephfilesystem-b * 2024-05-20T18:18:11.530034Z_bf609bf3-25c1-4342-ad7b-f942579daeeb mds.ocs-storagecluster-cephfilesystem-a * 2024-05-21T22:21:43.087757Z_06bdcccb-e7f5-4f95-a310-a053aed2fc0a mds.ocs-storagecluster-cephfilesystem-b * 2024-05-22T01:07:49.214137Z_bd01b408-376c-4c80-822e-ba3ae76e371f mds.ocs-storagecluster-cephfilesystem-b * 2024-05-22T01:13:58.832386Z_9da0f105-acb7-4026-a497-0f0da1b77f08 mds.ocs-storagecluster-cephfilesystem-b * 2024-05-22T03:54:30.740192Z_0dc95829-aa4a-4e34-aaeb-39cd3cf4d7c7 mds.ocs-storagecluster-cephfilesystem-a * bash-5.1$ ceph crash info 2024-05-14T17:33:05.811016Z_b5585b5b-3a3a-4838-93d3-13ccfb04b669 { "archived": "2024-05-14 19:09:39.585770", "backtrace": [ "/lib64/libc.so.6(+0x54db0) [0x7f1a1abf3db0]", "(MDSTableClient::got_journaled_ack(unsigned long)+0x123) [0x56062dc65c43]", "(MDLog::_replay_thread()+0x75e) [0x56062dcb852e]", "ceph-mds(+0x16cf21) [0x56062da01f21]", "/lib64/libc.so.6(+0x9f802) [0x7f1a1ac3e802]", "/lib64/libc.so.6(+0x3f450) [0x7f1a1abde450]" ], "ceph_version": "18.2.1-136.el9cp", "crash_id": "2024-05-14T17:33:05.811016Z_b5585b5b-3a3a-4838-93d3-13ccfb04b669", "entity_name": "mds.ocs-storagecluster-cephfilesystem-a", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.3 (Plow)", "os_version_id": "9.3", "process_name": "ceph-mds", "stack_sig": "36c718b7130271b051731a63cab7a55ab268d2ea09f56572013c03a500e81a80", "timestamp": "2024-05-14T17:33:05.811016Z", "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfd6f5m4", "utsname_machine": "x86_64", "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" } bash-5.1$ ceph crash info 2024-05-15T11:11:58.726060Z_bf8483a9-8f22-465f-83c3-93b2f34710f3 { "backtrace": [ "/lib64/libc.so.6(+0x54db0) [0x7f60d8013db0]", "(MDSTableClient::got_journaled_ack(unsigned long)+0x123) [0x55daad309c43]", "(MDLog::_replay_thread()+0x75e) [0x55daad35c52e]", "ceph-mds(+0x16cf21) [0x55daad0a5f21]", "/lib64/libc.so.6(+0x9f802) [0x7f60d805e802]", "/lib64/libc.so.6(+0x3f450) [0x7f60d7ffe450]" ], "ceph_version": "18.2.1-136.el9cp", "crash_id": "2024-05-15T11:11:58.726060Z_bf8483a9-8f22-465f-83c3-93b2f34710f3", "entity_name": "mds.ocs-storagecluster-cephfilesystem-a", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.3 (Plow)", "os_version_id": "9.3", "process_name": "ceph-mds", "stack_sig": "36c718b7130271b051731a63cab7a55ab268d2ea09f56572013c03a500e81a80", "timestamp": "2024-05-15T11:11:58.726060Z", "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c", "utsname_machine": "x86_64", "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" } bash-5.1$ ceph crash info 2024-05-15T12:26:15.553277Z_eae1ba47-82fd-4bf1-88c1-8a816f67ab65 { "backtrace": [ "/lib64/libc.so.6(+0x54db0) [0x7f7cbf29bdb0]", "(MDSTableClient::got_journaled_ack(unsigned long)+0x123) [0x55bd68c5dc43]", "(MDLog::_replay_thread()+0x75e) [0x55bd68cb052e]", "ceph-mds(+0x16cf21) [0x55bd689f9f21]", "/lib64/libc.so.6(+0x9f802) [0x7f7cbf2e6802]", "/lib64/libc.so.6(+0x3f450) [0x7f7cbf286450]" ], "ceph_version": "18.2.1-136.el9cp", "crash_id": "2024-05-15T12:26:15.553277Z_eae1ba47-82fd-4bf1-88c1-8a816f67ab65", "entity_name": "mds.ocs-storagecluster-cephfilesystem-b", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.3 (Plow)", "os_version_id": "9.3", "process_name": "ceph-mds", "stack_sig": "36c718b7130271b051731a63cab7a55ab268d2ea09f56572013c03a500e81a80", "timestamp": "2024-05-15T12:26:15.553277Z", "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp", "utsname_machine": "x86_64", "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" bash-5.1$ ceph crash info 2024-05-16T07:07:03.233208Z_2a86dc2e-e6cf-4f07-aa8b-9d7b9eec803f { "backtrace": [ "/lib64/libc.so.6(+0x54db0) [0x7f8e450e2db0]", "(MDSTableClient::got_journaled_ack(unsigned long)+0x123) [0x5616ea8d8c43]", "(MDLog::_replay_thread()+0x75e) [0x5616ea92b52e]", "ceph-mds(+0x16cf21) [0x5616ea674f21]", "/lib64/libc.so.6(+0x9f802) [0x7f8e4512d802]", "/lib64/libc.so.6(+0x3f450) [0x7f8e450cd450]" ], "ceph_version": "18.2.1-136.el9cp", "crash_id": "2024-05-16T07:07:03.233208Z_2a86dc2e-e6cf-4f07-aa8b-9d7b9eec803f", "entity_name": "mds.ocs-storagecluster-cephfilesystem-a", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.3 (Plow)", "os_version_id": "9.3", "process_name": "ceph-mds", "stack_sig": "36c718b7130271b051731a63cab7a55ab268d2ea09f56572013c03a500e81a80", "timestamp": "2024-05-16T07:07:03.233208Z", "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c", "utsname_machine": "x86_64", "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" } Must-gather logs from the cluster is kept here- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/20may24-mds-high-log-level/ Expected results: MDS shouldn't unexpectedly crash when cluster isn't heavily loaded and failover is performed. Additional info:
As per offline discussion with Venky: Since different crash info is seen on different crashes, we don't know yet which crash is causing the failover to remain stuck. We will use this BZ to investigate further and open new BZ if needed based upon the findings. A few crash outputs for reference: bash-5.1$ ceph crash info 2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12 { "archived": "2024-05-14 19:09:40.190367", "backtrace": [ "/lib64/libc.so.6(+0x54db0) [0x7f6d122fcdb0]", "pthread_getname_np()", "(ceph::logging::Log::dump_recent()+0x5c5) [0x7f6d12c84bf5]", "(MDSDaemon::respawn()+0x15a) [0x55d7aeeb0c7a]", "ceph-mds(+0x143e5d) [0x55d7aee98e5d]", "(MDSRank::handle_write_error(int)+0x1af) [0x55d7aeece3ef]", "ceph-mds(+0x1a88e4) [0x55d7aeefd8e4]", "ceph-mds(+0x143e5d) [0x55d7aee98e5d]", "(Finisher::finisher_thread_entry()+0x175) [0x7f6d12a31145]", "/lib64/libc.so.6(+0x9f802) [0x7f6d12347802]", "/lib64/libc.so.6(+0x3f450) [0x7f6d122e7450]" ], "ceph_version": "18.2.1-136.el9cp", "crash_id": "2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12", "entity_name": "mds.ocs-storagecluster-cephfilesystem-b", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.3 (Plow)", "os_version_id": "9.3", "process_name": "ceph-mds", "stack_sig": "238e13b05e2f9c033f80644a94b3324df6b15f78ec4772f7c900a997f8566e3f", "timestamp": "2024-05-14T18:52:04.860762Z", "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd586658k6w8", "utsname_machine": "x86_64", "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" } bash-5.1$ ceph crash info 2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd { "backtrace": [ "/lib64/libc.so.6(+0x54db0) [0x7ff74acafdb0]", "ceph-mds(+0x22d2ed) [0x557c354f42ed]", "ceph-mds(+0x5a7d02) [0x557c3586ed02]", "(EMetaBlob::fullbit::update_inode(MDSRank*, CInode*)+0x51) [0x557c35778fc1]", "(EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x79e) [0x557c3577fb6e]", "(EOpen::replay(MDSRank*)+0x55) [0x557c3578c6c5]", "(MDLog::_replay_thread()+0x75e) [0x557c356ea52e]", "ceph-mds(+0x16cf21) [0x557c35433f21]", "/lib64/libc.so.6(+0x9f802) [0x7ff74acfa802]", "/lib64/libc.so.6(+0x3f450) [0x7ff74ac9a450]" ], "ceph_version": "18.2.1-136.el9cp", "crash_id": "2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd", "entity_name": "mds.ocs-storagecluster-cephfilesystem-b", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.3 (Plow)", "os_version_id": "9.3", "process_name": "ceph-mds", "stack_sig": "534595eadbe3cd5e36a861179a9d229df6085a48ed4bf3ee7982825650a239f5", "timestamp": "2024-05-17T10:56:29.510669Z", "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp", "utsname_machine": "x86_64", "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" } bash-5.1$ ceph crash info 2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07 { "assert_condition": "(bool)_front == (bool)_size", "assert_file": "/builddir/build/BUILD/ceph-18.2.1/src/include/xlist.h", "assert_func": "size_t xlist<T>::size() const [with T = LRUObject*; size_t = long unsigned int]", "assert_line": 87, "assert_msg": "/builddir/build/BUILD/ceph-18.2.1/src/include/xlist.h: In function 'size_t xlist<T>::size() const [with T = LRUObject*; size_t = long unsigned int]' thread 7f3c3d3fc640 time 2024-05-17T20:34:57.806097+0000\n/builddir/build/BUILD/ceph-18.2.1/src/include/xlist.h: 87: FAILED ceph_assert((bool)_front == (bool)_size)\n", "assert_thread_name": "md_log_replay", "backtrace": [ "/lib64/libc.so.6(+0x54db0) [0x7f3c49eccdb0]", "/lib64/libc.so.6(+0xa154c) [0x7f3c49f1954c]", "raise()", "abort()", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f3c4a549068]", "/usr/lib64/ceph/libceph-common.so.2(+0x1631cc) [0x7f3c4a5491cc]", "ceph-mds(+0x1479fe) [0x56357b3779fe]", "(CDir::add_null_dentry(std::basic_string_view<char, std::char_traits<char> >, snapid_t, snapid_t)+0x29a) [0x56357b5a226a]", "(EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x1853) [0x56357b6e9c23]", "(EOpen::replay(MDSRank*)+0x55) [0x56357b6f56c5]", "(MDLog::_replay_thread()+0x75e) [0x56357b65352e]", "ceph-mds(+0x16cf21) [0x56357b39cf21]", "/lib64/libc.so.6(+0x9f802) [0x7f3c49f17802]", "/lib64/libc.so.6(+0x3f450) [0x7f3c49eb7450]" ], "ceph_version": "18.2.1-136.el9cp", "crash_id": "2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07", "entity_name": "mds.ocs-storagecluster-cephfilesystem-a", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.3 (Plow)", "os_version_id": "9.3", "process_name": "ceph-mds", "stack_sig": "ea1ea719ad2b0630e8fcf810ee54c1655e5bc626e9caddabb987b8b82f28bfea", "timestamp": "2024-05-17T20:34:57.807764Z", "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c", "utsname_machine": "x86_64", "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" } bash-5.1$ ceph crash info 2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109 { "backtrace": [ "/lib64/libc.so.6(+0x54db0) [0x7fba580ebdb0]", "ceph-mds(+0x2e5548) [0x5617387fd548]", "(MDCache::finish_uncommitted_fragment(dirfrag_t, int)+0x8c) [0x5617387e8c5c]", "(EFragment::replay(MDSRank*)+0x26b) [0x5617389e374b]", "(MDLog::_replay_thread()+0x75e) [0x56173893b52e]", "ceph-mds(+0x16cf21) [0x561738684f21]", "/lib64/libc.so.6(+0x9f802) [0x7fba58136802]", "/lib64/libc.so.6(+0x3f450) [0x7fba580d6450]" ], "ceph_version": "18.2.1-136.el9cp", "crash_id": "2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109", "entity_name": "mds.ocs-storagecluster-cephfilesystem-a", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.3 (Plow)", "os_version_id": "9.3", "process_name": "ceph-mds", "stack_sig": "9146a54bbc96beab126447db7eb36673a320813b826ed34354414b4148a7d86c", "timestamp": "2024-05-20T00:27:27.924980Z", "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c", "utsname_machine": "x86_64", "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" } This covers all the crashes reported in the Description of problem above. Most of the crashes have similar crash info. so only unique ones are listed here.
(In reply to Aman Agrawal from comment #3) > As per offline discussion with Venky: > > Since different crash info is seen on different crashes, we don't know yet > which crash is causing the failover to remain stuck. We will use this BZ to > investigate further and open new BZ if needed based upon the findings. > > > A few crash outputs for reference: > > > bash-5.1$ ceph crash info > 2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12 > { > "archived": "2024-05-14 19:09:40.190367", > "backtrace": [ > "/lib64/libc.so.6(+0x54db0) [0x7f6d122fcdb0]", > "pthread_getname_np()", > "(ceph::logging::Log::dump_recent()+0x5c5) [0x7f6d12c84bf5]", > "(MDSDaemon::respawn()+0x15a) [0x55d7aeeb0c7a]", > "ceph-mds(+0x143e5d) [0x55d7aee98e5d]", > "(MDSRank::handle_write_error(int)+0x1af) [0x55d7aeece3ef]", > "ceph-mds(+0x1a88e4) [0x55d7aeefd8e4]", > "ceph-mds(+0x143e5d) [0x55d7aee98e5d]", > "(Finisher::finisher_thread_entry()+0x175) [0x7f6d12a31145]", > "/lib64/libc.so.6(+0x9f802) [0x7f6d12347802]", > "/lib64/libc.so.6(+0x3f450) [0x7f6d122e7450]" > ], > "ceph_version": "18.2.1-136.el9cp", > "crash_id": > "2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12", > "entity_name": "mds.ocs-storagecluster-cephfilesystem-b", > "os_id": "rhel", > "os_name": "Red Hat Enterprise Linux", > "os_version": "9.3 (Plow)", > "os_version_id": "9.3", > "process_name": "ceph-mds", > "stack_sig": > "238e13b05e2f9c033f80644a94b3324df6b15f78ec4772f7c900a997f8566e3f", > "timestamp": "2024-05-14T18:52:04.860762Z", > "utsname_hostname": > "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd586658k6w8", > "utsname_machine": "x86_64", > "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", > "utsname_sysname": "Linux", > "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" > } > > > > > > bash-5.1$ ceph crash info > 2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd > { > "backtrace": [ > "/lib64/libc.so.6(+0x54db0) [0x7ff74acafdb0]", > "ceph-mds(+0x22d2ed) [0x557c354f42ed]", > "ceph-mds(+0x5a7d02) [0x557c3586ed02]", > "(EMetaBlob::fullbit::update_inode(MDSRank*, CInode*)+0x51) > [0x557c35778fc1]", > "(EMetaBlob::replay(MDSRank*, LogSegment*, int, > MDPeerUpdate*)+0x79e) [0x557c3577fb6e]", > "(EOpen::replay(MDSRank*)+0x55) [0x557c3578c6c5]", > "(MDLog::_replay_thread()+0x75e) [0x557c356ea52e]", > "ceph-mds(+0x16cf21) [0x557c35433f21]", > "/lib64/libc.so.6(+0x9f802) [0x7ff74acfa802]", > "/lib64/libc.so.6(+0x3f450) [0x7ff74ac9a450]" > ], > "ceph_version": "18.2.1-136.el9cp", > "crash_id": > "2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd", > "entity_name": "mds.ocs-storagecluster-cephfilesystem-b", > "os_id": "rhel", > "os_name": "Red Hat Enterprise Linux", > "os_version": "9.3 (Plow)", > "os_version_id": "9.3", > "process_name": "ceph-mds", > "stack_sig": > "534595eadbe3cd5e36a861179a9d229df6085a48ed4bf3ee7982825650a239f5", > "timestamp": "2024-05-17T10:56:29.510669Z", > "utsname_hostname": > "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp", > "utsname_machine": "x86_64", > "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", > "utsname_sysname": "Linux", > "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" > } > > The above crash seems to be initially reported as part of BZ2218759 and there was an ask to repro it hence I added a comment here- https://bugzilla.redhat.com/show_bug.cgi?id=2218759#c74 All other crashes still need investigation. > > > bash-5.1$ ceph crash info > 2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07 > { > "assert_condition": "(bool)_front == (bool)_size", > "assert_file": "/builddir/build/BUILD/ceph-18.2.1/src/include/xlist.h", > "assert_func": "size_t xlist<T>::size() const [with T = LRUObject*; > size_t = long unsigned int]", > "assert_line": 87, > "assert_msg": "/builddir/build/BUILD/ceph-18.2.1/src/include/xlist.h: In > function 'size_t xlist<T>::size() const [with T = LRUObject*; size_t = long > unsigned int]' thread 7f3c3d3fc640 time > 2024-05-17T20:34:57.806097+0000\n/builddir/build/BUILD/ceph-18.2.1/src/ > include/xlist.h: 87: FAILED ceph_assert((bool)_front == (bool)_size)\n", > "assert_thread_name": "md_log_replay", > "backtrace": [ > "/lib64/libc.so.6(+0x54db0) [0x7f3c49eccdb0]", > "/lib64/libc.so.6(+0xa154c) [0x7f3c49f1954c]", > "raise()", > "abort()", > "(ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x188) [0x7f3c4a549068]", > "/usr/lib64/ceph/libceph-common.so.2(+0x1631cc) [0x7f3c4a5491cc]", > "ceph-mds(+0x1479fe) [0x56357b3779fe]", > "(CDir::add_null_dentry(std::basic_string_view<char, > std::char_traits<char> >, snapid_t, snapid_t)+0x29a) [0x56357b5a226a]", > "(EMetaBlob::replay(MDSRank*, LogSegment*, int, > MDPeerUpdate*)+0x1853) [0x56357b6e9c23]", > "(EOpen::replay(MDSRank*)+0x55) [0x56357b6f56c5]", > "(MDLog::_replay_thread()+0x75e) [0x56357b65352e]", > "ceph-mds(+0x16cf21) [0x56357b39cf21]", > "/lib64/libc.so.6(+0x9f802) [0x7f3c49f17802]", > "/lib64/libc.so.6(+0x3f450) [0x7f3c49eb7450]" > ], > "ceph_version": "18.2.1-136.el9cp", > "crash_id": > "2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07", > "entity_name": "mds.ocs-storagecluster-cephfilesystem-a", > "os_id": "rhel", > "os_name": "Red Hat Enterprise Linux", > "os_version": "9.3 (Plow)", > "os_version_id": "9.3", > "process_name": "ceph-mds", > "stack_sig": > "ea1ea719ad2b0630e8fcf810ee54c1655e5bc626e9caddabb987b8b82f28bfea", > "timestamp": "2024-05-17T20:34:57.807764Z", > "utsname_hostname": > "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c", > "utsname_machine": "x86_64", > "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", > "utsname_sysname": "Linux", > "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" > } > > > > bash-5.1$ ceph crash info > 2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109 > { > "backtrace": [ > "/lib64/libc.so.6(+0x54db0) [0x7fba580ebdb0]", > "ceph-mds(+0x2e5548) [0x5617387fd548]", > "(MDCache::finish_uncommitted_fragment(dirfrag_t, int)+0x8c) > [0x5617387e8c5c]", > "(EFragment::replay(MDSRank*)+0x26b) [0x5617389e374b]", > "(MDLog::_replay_thread()+0x75e) [0x56173893b52e]", > "ceph-mds(+0x16cf21) [0x561738684f21]", > "/lib64/libc.so.6(+0x9f802) [0x7fba58136802]", > "/lib64/libc.so.6(+0x3f450) [0x7fba580d6450]" > ], > "ceph_version": "18.2.1-136.el9cp", > "crash_id": > "2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109", > "entity_name": "mds.ocs-storagecluster-cephfilesystem-a", > "os_id": "rhel", > "os_name": "Red Hat Enterprise Linux", > "os_version": "9.3 (Plow)", > "os_version_id": "9.3", > "process_name": "ceph-mds", > "stack_sig": > "9146a54bbc96beab126447db7eb36673a320813b826ed34354414b4148a7d86c", > "timestamp": "2024-05-20T00:27:27.924980Z", > "utsname_hostname": > "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c", > "utsname_machine": "x86_64", > "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", > "utsname_sysname": "Linux", > "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" > } > > > > This covers all the crashes reported in the Description of problem above. > Most of the crashes have similar crash info. so only unique ones are listed > here.
Hi Aman, (In reply to Aman Agrawal from comment #3) > As per offline discussion with Venky: > > Since different crash info is seen on different crashes, we don't know yet > which crash is causing the failover to remain stuck. We will use this BZ to > investigate further and open new BZ if needed based upon the findings. > > > A few crash outputs for reference: > > > bash-5.1$ ceph crash info > 2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12 > { > "archived": "2024-05-14 19:09:40.190367", > "backtrace": [ > "/lib64/libc.so.6(+0x54db0) [0x7f6d122fcdb0]", > "pthread_getname_np()", > "(ceph::logging::Log::dump_recent()+0x5c5) [0x7f6d12c84bf5]", > "(MDSDaemon::respawn()+0x15a) [0x55d7aeeb0c7a]", > "ceph-mds(+0x143e5d) [0x55d7aee98e5d]", > "(MDSRank::handle_write_error(int)+0x1af) [0x55d7aeece3ef]", > "ceph-mds(+0x1a88e4) [0x55d7aeefd8e4]", > "ceph-mds(+0x143e5d) [0x55d7aee98e5d]", > "(Finisher::finisher_thread_entry()+0x175) [0x7f6d12a31145]", > "/lib64/libc.so.6(+0x9f802) [0x7f6d12347802]", > "/lib64/libc.so.6(+0x3f450) [0x7f6d122e7450]" > ], > "ceph_version": "18.2.1-136.el9cp", > "crash_id": > "2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12", > "entity_name": "mds.ocs-storagecluster-cephfilesystem-b", > "os_id": "rhel", > "os_name": "Red Hat Enterprise Linux", > "os_version": "9.3 (Plow)", > "os_version_id": "9.3", > "process_name": "ceph-mds", > "stack_sig": > "238e13b05e2f9c033f80644a94b3324df6b15f78ec4772f7c900a997f8566e3f", > "timestamp": "2024-05-14T18:52:04.860762Z", > "utsname_hostname": > "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd586658k6w8", > "utsname_machine": "x86_64", > "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", > "utsname_sysname": "Linux", > "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" > } > We fixed a couple of bugs related to the above crash (backtrace). Do you have the MDS coredump? > > > > > bash-5.1$ ceph crash info > 2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd > { > "backtrace": [ > "/lib64/libc.so.6(+0x54db0) [0x7ff74acafdb0]", > "ceph-mds(+0x22d2ed) [0x557c354f42ed]", > "ceph-mds(+0x5a7d02) [0x557c3586ed02]", > "(EMetaBlob::fullbit::update_inode(MDSRank*, CInode*)+0x51) > [0x557c35778fc1]", > "(EMetaBlob::replay(MDSRank*, LogSegment*, int, > MDPeerUpdate*)+0x79e) [0x557c3577fb6e]", > "(EOpen::replay(MDSRank*)+0x55) [0x557c3578c6c5]", > "(MDLog::_replay_thread()+0x75e) [0x557c356ea52e]", > "ceph-mds(+0x16cf21) [0x557c35433f21]", > "/lib64/libc.so.6(+0x9f802) [0x7ff74acfa802]", > "/lib64/libc.so.6(+0x3f450) [0x7ff74ac9a450]" > ], > "ceph_version": "18.2.1-136.el9cp", > "crash_id": > "2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd", > "entity_name": "mds.ocs-storagecluster-cephfilesystem-b", > "os_id": "rhel", > "os_name": "Red Hat Enterprise Linux", > "os_version": "9.3 (Plow)", > "os_version_id": "9.3", > "process_name": "ceph-mds", > "stack_sig": > "534595eadbe3cd5e36a861179a9d229df6085a48ed4bf3ee7982825650a239f5", > "timestamp": "2024-05-17T10:56:29.510669Z", > "utsname_hostname": > "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp", > "utsname_machine": "x86_64", > "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", > "utsname_sysname": "Linux", > "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" > } > > > > > bash-5.1$ ceph crash info > 2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07 > { > "assert_condition": "(bool)_front == (bool)_size", This is (likely) a new crash - haven't seen this backtrace yet. Again, where can I find the MDS coredump? > "assert_file": "/builddir/build/BUILD/ceph-18.2.1/src/include/xlist.h", > "assert_func": "size_t xlist<T>::size() const [with T = LRUObject*; > size_t = long unsigned int]", > "assert_line": 87, > "assert_msg": "/builddir/build/BUILD/ceph-18.2.1/src/include/xlist.h: In > function 'size_t xlist<T>::size() const [with T = LRUObject*; size_t = long > unsigned int]' thread 7f3c3d3fc640 time > 2024-05-17T20:34:57.806097+0000\n/builddir/build/BUILD/ceph-18.2.1/src/ > include/xlist.h: 87: FAILED ceph_assert((bool)_front == (bool)_size)\n", > "assert_thread_name": "md_log_replay", > "backtrace": [ > "/lib64/libc.so.6(+0x54db0) [0x7f3c49eccdb0]", > "/lib64/libc.so.6(+0xa154c) [0x7f3c49f1954c]", > "raise()", > "abort()", > "(ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x188) [0x7f3c4a549068]", > "/usr/lib64/ceph/libceph-common.so.2(+0x1631cc) [0x7f3c4a5491cc]", > "ceph-mds(+0x1479fe) [0x56357b3779fe]", > "(CDir::add_null_dentry(std::basic_string_view<char, > std::char_traits<char> >, snapid_t, snapid_t)+0x29a) [0x56357b5a226a]", > "(EMetaBlob::replay(MDSRank*, LogSegment*, int, > MDPeerUpdate*)+0x1853) [0x56357b6e9c23]", > "(EOpen::replay(MDSRank*)+0x55) [0x56357b6f56c5]", > "(MDLog::_replay_thread()+0x75e) [0x56357b65352e]", > "ceph-mds(+0x16cf21) [0x56357b39cf21]", > "/lib64/libc.so.6(+0x9f802) [0x7f3c49f17802]", > "/lib64/libc.so.6(+0x3f450) [0x7f3c49eb7450]" > ], > "ceph_version": "18.2.1-136.el9cp", > "crash_id": > "2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07", > "entity_name": "mds.ocs-storagecluster-cephfilesystem-a", > "os_id": "rhel", > "os_name": "Red Hat Enterprise Linux", > "os_version": "9.3 (Plow)", > "os_version_id": "9.3", > "process_name": "ceph-mds", > "stack_sig": > "ea1ea719ad2b0630e8fcf810ee54c1655e5bc626e9caddabb987b8b82f28bfea", > "timestamp": "2024-05-17T20:34:57.807764Z", > "utsname_hostname": > "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c", > "utsname_machine": "x86_64", > "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", > "utsname_sysname": "Linux", > "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" > } > > > > bash-5.1$ ceph crash info > 2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109 > { > "backtrace": [ > "/lib64/libc.so.6(+0x54db0) [0x7fba580ebdb0]", > "ceph-mds(+0x2e5548) [0x5617387fd548]", > "(MDCache::finish_uncommitted_fragment(dirfrag_t, int)+0x8c) > [0x5617387e8c5c]", This crash looks new. > "(EFragment::replay(MDSRank*)+0x26b) [0x5617389e374b]", > "(MDLog::_replay_thread()+0x75e) [0x56173893b52e]", > "ceph-mds(+0x16cf21) [0x561738684f21]", > "/lib64/libc.so.6(+0x9f802) [0x7fba58136802]", > "/lib64/libc.so.6(+0x3f450) [0x7fba580d6450]" > ], > "ceph_version": "18.2.1-136.el9cp", > "crash_id": > "2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109", > "entity_name": "mds.ocs-storagecluster-cephfilesystem-a", > "os_id": "rhel", > "os_name": "Red Hat Enterprise Linux", > "os_version": "9.3 (Plow)", > "os_version_id": "9.3", > "process_name": "ceph-mds", > "stack_sig": > "9146a54bbc96beab126447db7eb36673a320813b826ed34354414b4148a7d86c", > "timestamp": "2024-05-20T00:27:27.924980Z", > "utsname_hostname": > "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c", > "utsname_machine": "x86_64", > "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", > "utsname_sysname": "Linux", > "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" > } > Wow - we are dealing with at least two new MDS crashes. The crash in the BZ description (got_journaled_ack) is a known issue which happens with standby-replay MDS and that's covered in https://tracker.ceph.com/issues/54741. Aman, I assume the core dumps are still accessible since these crashes were seen sometime this week. Could you please find a way to share those with engineering.
The failover issue mentioned in this bug with discussed with Ramen team in today's RDR triage meeting and it was concluded that the stuck failover is due to https://bugzilla.redhat.com/show_bug.cgi?id=2283038 and should not be because of MDS crashes reported in this BZ. However, MDS crash issue should still be prioritised and investigated. I am updating the bug title and details accordingly. Venky, must-gather logs shared above should have the coredumps, please check and confirm- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/20may24-mds-high-log-level/
We still have the setup in case logs are missing and we would need to re-visit it hence requesting an ack/nack on the logs provided earlier.
I tried collecting pstack for both mds-a and mds-b by making them active with log level ceph config set mds debug_mds 20 ceph config set mds debug_ms 1 but I don't think much info. was collected. Refer- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/27may24/ Pls note, only 1 daemon was available so I had to respin the pods and thereafter collect the logs. Time- Somewhere close to Mon May 27 17:32:39 UTC 2024 bash-5.1$ ceph -s cluster: id: 119cc23a-0ffa-4ed8-ab81-bf8f48c88b8c health: HEALTH_WARN insufficient standby MDS daemons available 1 MDSs behind on trimming 18 daemons have recently crashed services: mon: 3 daemons, quorum d,f,g (age 4h) mgr: b(active, since 12d), standbys: a mds: 1/1 daemons up osd: 3 osds: 3 up (since 12d), 3 in (since 3w) rbd-mirror: 1 daemon active (1 hosts) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 524.65k objects, 37 GiB usage: 112 GiB used, 1.4 TiB / 1.5 TiB avail pgs: 169 active+clean io: client: 1.1 MiB/s rd, 33 MiB/s wr, 194 op/s rd, 26 op/s wr pods|grep mds rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c 2/2 Running 1164 (4d13h ago) 12d 10.128.2.63 compute-2 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp 2/2 Running 1843 (10m ago) 12d 10.131.0.241 compute-1 <none> <none>
(In reply to Aman Agrawal from comment #6) > The failover issue mentioned in this bug with discussed with Ramen team in > today's RDR triage meeting and it was concluded that the stuck failover is > due to https://bugzilla.redhat.com/show_bug.cgi?id=2283038 and should not be > because of MDS crashes reported in this BZ. > > However, MDS crash issue should still be prioritised and investigated. > > I am updating the bug title and details accordingly. > > Venky, must-gather logs shared above should have the coredumps, please check > and confirm- > http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/20may24-mds- > high-log-level/ Thank you. I will have a look.
(In reply to Venky Shankar from comment #9) > (In reply to Aman Agrawal from comment #6) > > The failover issue mentioned in this bug with discussed with Ramen team in > > today's RDR triage meeting and it was concluded that the stuck failover is > > due to https://bugzilla.redhat.com/show_bug.cgi?id=2283038 and should not be > > because of MDS crashes reported in this BZ. > > > > However, MDS crash issue should still be prioritised and investigated. > > > > I am updating the bug title and details accordingly. > > > > Venky, must-gather logs shared above should have the coredumps, please check > > and confirm- > > http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/20may24-mds- > > high-log-level/ > > Thank you. I will have a look. Please share the latest status.