Description of problem (please be detailed as possible and provide log snippests): On a RDR setup, while running tier4 node failure tests (1 worker node failure at a time) it was observed that the Ceph health did not recover at the end due to daemon crash sh-5.1$ ceph health HEALTH_WARN 1 daemons have recently crashed sh-5.1$ ceph crash ls ID ENTITY NEW 2024-02-20T05:07:42.092421Z_a7bef93f-0350-4d34-a7dc-c386dcb7e762 mds.ocs-storagecluster-cephfilesystem-b * sh-5.1$ ceph crash info 2024-02-20T05:07:42.092421Z_a7bef93f-0350-4d34-a7dc-c386dcb7e762 { "assert_condition": "segments.size() >= pre_segments_size", "assert_file": "/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc", "assert_func": "void MDLog::trim(int)", "assert_line": 651, "assert_msg": "/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc: In function 'void MDLog::trim(int)' thread 7f8d48e72640 time 2024-02-20T05:07:42.091421+0000\n/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc: 651: FAILED ceph_assert(segments.size() >= pre_segments_size)\n", "assert_thread_name": "safe_timer", "backtrace": [ "/lib64/libc.so.6(+0x54db0) [0x7f8d4f511db0]", "/lib64/libc.so.6(+0xa154c) [0x7f8d4f55e54c]", "raise()", "abort()", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f8d4fb6db4b]", "/usr/lib64/ceph/libceph-common.so.2(+0x142caf) [0x7f8d4fb6dcaf]", "(MDLog::trim(int)+0xb06) [0x563488982f96]", "(MDSRankDispatcher::tick()+0x365) [0x563488705515]", "ceph-mds(+0x11c9bd) [0x5634886d79bd]", "(CommonSafeTimer<ceph::fair_mutex>::timer_thread()+0x15e) [0x7f8d4fc5749e]", "/usr/lib64/ceph/libceph-common.so.2(+0x22cd91) [0x7f8d4fc57d91]", "/lib64/libc.so.6(+0x9f802) [0x7f8d4f55c802]", "/lib64/libc.so.6(+0x3f450) [0x7f8d4f4fc450]" ], "ceph_version": "17.2.6-196.el9cp", "crash_id": "2024-02-20T05:07:42.092421Z_a7bef93f-0350-4d34-a7dc-c386dcb7e762", "entity_name": "mds.ocs-storagecluster-cephfilesystem-b", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.3 (Plow)", "os_version_id": "9.3", "process_name": "ceph-mds", "stack_sig": "21cf82abf00a9a80ef194472005415a53e94d6965c4e910d756a9f711243f498", "timestamp": "2024-02-20T05:07:42.092421Z", "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-69b58cdcwcpv2", "utsname_machine": "x86_64", "utsname_release": "5.14.0-284.52.1.el9_2.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Tue Jan 30 08:35:38 EST 2024" } Version of all relevant components (if applicable): ODF: 4.15.0-144.stable (ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)) OCP: 4.15.0-0.nightly-2024-02-16-235514 ACM: 2.10.0-78 (2.10.0-DOWNSTREAM-2024-02-18-03-53-23) Submariner: 0.17.0 (iib:666535) VolSync: 0.8.0 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? ceph health in warning state Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: a. Deploy RDR setup b. Run automated tier4 tests tests/functional/disaster-recovery/regional-dr/test_managed_cluster_node_failure.py Issue was hit during the test - tests/functional/disaster-recovery/regional-dr/test_managed_cluster_node_failure.py::TestManagedClusterNodeFailure::test_single_managed_cluster_node_failure[rbd-mirror] The automated test executes below steps, 1. Deploy an application containing 20 PVCs/Pods on C1 (RBD based workloads) 2. Fail the C1 cluster node (Power off the VM) where rbd-mirror pod is running 3. Wait for old rbd-mirror pod to be deleted and new pod to start 4. Start the node and wait for node to come up 5. Wait for ODF, DR and submariner related pods to reach running state 6. Check mirroring status is OK 7. Repeat the above steps from 2 to 6 on cluster C2 8. Check ceph health on both cluster at the end 9. Observed Ceph health does not become OK on C1 and C2 cluster On C1, health warn due to another bug 2214499#c35 On C2, health warn due to ceph-mds crash (this bug) Important Node related events for C2 during the test: 05:02:05 - Power off compute-2 where rook-ceph-rbd-mirror-a-6b9f797df9-r99n6 is hosted 05:02:47 - Node compute-2 reached status NotReady 05:04:24 - Powered on compute-2 node 05:05:26 - Node reached Ready state Testrun console logs: https://url.corp.redhat.com/1cdcbc4 Actual results: ceph-mds crash with "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f8d4fb6db4b]" Expected results: Ceph should remain healthy without any crashes Additional info:
*** This bug has been marked as a duplicate of bug 2258950 ***