Bug 2265110 - [RDR] ceph-mds crash with "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f8d4fb6db4b]"
Summary: [RDR] ceph-mds crash with "(ceph::__ceph_assert_fail(char const*, char const*...
Keywords:
Status: CLOSED DUPLICATE of bug 2258950
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.15
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Venky Shankar
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-02-20 13:06 UTC by Sidhant Agrawal
Modified: 2024-02-20 13:24 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-02-20 13:24:14 UTC
Embargoed:


Attachments (Terms of Use)

Description Sidhant Agrawal 2024-02-20 13:06:14 UTC
Description of problem (please be detailed as possible and provide log
snippests):
On a RDR setup, while running tier4 node failure tests (1 worker node failure at a time) it was observed that the Ceph health did not recover at the end due to daemon crash

sh-5.1$ ceph health
HEALTH_WARN 1 daemons have recently crashed

sh-5.1$ ceph crash ls
ID                                                                ENTITY                                   NEW
2024-02-20T05:07:42.092421Z_a7bef93f-0350-4d34-a7dc-c386dcb7e762  mds.ocs-storagecluster-cephfilesystem-b   *

sh-5.1$ ceph crash info 2024-02-20T05:07:42.092421Z_a7bef93f-0350-4d34-a7dc-c386dcb7e762
{
    "assert_condition": "segments.size() >= pre_segments_size",
    "assert_file": "/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc",
    "assert_func": "void MDLog::trim(int)",
    "assert_line": 651,
    "assert_msg": "/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc: In function 'void MDLog::trim(int)' thread 7f8d48e72640 time 2024-02-20T05:07:42.091421+0000\n/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc: 651: FAILED ceph_assert(segments.size() >= pre_segments_size)\n",
    "assert_thread_name": "safe_timer",
    "backtrace": [
        "/lib64/libc.so.6(+0x54db0) [0x7f8d4f511db0]",
        "/lib64/libc.so.6(+0xa154c) [0x7f8d4f55e54c]",
        "raise()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f8d4fb6db4b]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x142caf) [0x7f8d4fb6dcaf]",
        "(MDLog::trim(int)+0xb06) [0x563488982f96]",
        "(MDSRankDispatcher::tick()+0x365) [0x563488705515]",
        "ceph-mds(+0x11c9bd) [0x5634886d79bd]",
        "(CommonSafeTimer<ceph::fair_mutex>::timer_thread()+0x15e) [0x7f8d4fc5749e]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x22cd91) [0x7f8d4fc57d91]",
        "/lib64/libc.so.6(+0x9f802) [0x7f8d4f55c802]",
        "/lib64/libc.so.6(+0x3f450) [0x7f8d4f4fc450]"
    ],
    "ceph_version": "17.2.6-196.el9cp",
    "crash_id": "2024-02-20T05:07:42.092421Z_a7bef93f-0350-4d34-a7dc-c386dcb7e762",
    "entity_name": "mds.ocs-storagecluster-cephfilesystem-b",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.3 (Plow)",
    "os_version_id": "9.3",
    "process_name": "ceph-mds",
    "stack_sig": "21cf82abf00a9a80ef194472005415a53e94d6965c4e910d756a9f711243f498",
    "timestamp": "2024-02-20T05:07:42.092421Z",
    "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-69b58cdcwcpv2",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-284.52.1.el9_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Tue Jan 30 08:35:38 EST 2024"
}

Version of all relevant components (if applicable):
ODF: 4.15.0-144.stable
(ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable))
OCP: 4.15.0-0.nightly-2024-02-16-235514
ACM: 2.10.0-78 (2.10.0-DOWNSTREAM-2024-02-18-03-53-23)
Submariner: 0.17.0 (iib:666535)
VolSync: 0.8.0

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
ceph health in warning state

Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
a. Deploy RDR setup
b. Run automated tier4 tests tests/functional/disaster-recovery/regional-dr/test_managed_cluster_node_failure.py

Issue was hit during the test - tests/functional/disaster-recovery/regional-dr/test_managed_cluster_node_failure.py::TestManagedClusterNodeFailure::test_single_managed_cluster_node_failure[rbd-mirror]

The automated test executes below steps,
1. Deploy an application containing 20 PVCs/Pods on C1 (RBD based workloads) 
2. Fail the C1 cluster node (Power off the VM) where rbd-mirror pod is running  
3. Wait for old rbd-mirror pod to be deleted and new pod to start  
4. Start the node and wait for node to come up  
5. Wait for ODF, DR and submariner related pods to reach running state  
6. Check mirroring status is OK  
7. Repeat the above steps from 2 to 6 on cluster C2    
8. Check ceph health on both cluster at the end
9. Observed Ceph health does not become OK on C1 and C2 cluster

On C1, health warn due to another bug 2214499#c35
On C2, health warn due to ceph-mds crash (this bug)

Important Node related events for C2 during the test:
05:02:05 - Power off compute-2 where rook-ceph-rbd-mirror-a-6b9f797df9-r99n6 is hosted
05:02:47 - Node compute-2 reached status NotReady
05:04:24 - Powered on compute-2 node
05:05:26 - Node reached Ready state

Testrun console logs: https://url.corp.redhat.com/1cdcbc4


Actual results:
ceph-mds crash with "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f8d4fb6db4b]"

Expected results:
Ceph should remain healthy without any crashes

Additional info:

Comment 3 Venky Shankar 2024-02-20 13:24:14 UTC

*** This bug has been marked as a duplicate of bug 2258950 ***


Note You need to log in before you can comment on or make changes to this bug.