Bug 2265110

Summary:	[RDR] ceph-mds crash with "(ceph::__ceph_assert_fail(char const, char const, int, char const*)+0x188) [0x7f8d4fb6db4b]"
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Sidhant Agrawal <sagrawal>
Component:	ceph	Assignee:	Venky Shankar <vshankar>
ceph sub component:	CephFS	QA Contact:	Elad <ebenahar>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	bniver, kseeger, muagarwa, sostapov
Version:	4.15	Keywords:	Automation
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-02-20 13:24:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sidhant Agrawal 2024-02-20 13:06:14 UTC

Description of problem (please be detailed as possible and provide log
snippests):
On a RDR setup, while running tier4 node failure tests (1 worker node failure at a time) it was observed that the Ceph health did not recover at the end due to daemon crash

sh-5.1$ ceph health
HEALTH_WARN 1 daemons have recently crashed

sh-5.1$ ceph crash ls
ID                                                                ENTITY                                   NEW
2024-02-20T05:07:42.092421Z_a7bef93f-0350-4d34-a7dc-c386dcb7e762  mds.ocs-storagecluster-cephfilesystem-b   *

sh-5.1$ ceph crash info 2024-02-20T05:07:42.092421Z_a7bef93f-0350-4d34-a7dc-c386dcb7e762
{
    "assert_condition": "segments.size() >= pre_segments_size",
    "assert_file": "/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc",
    "assert_func": "void MDLog::trim(int)",
    "assert_line": 651,
    "assert_msg": "/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc: In function 'void MDLog::trim(int)' thread 7f8d48e72640 time 2024-02-20T05:07:42.091421+0000\n/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc: 651: FAILED ceph_assert(segments.size() >= pre_segments_size)\n",
    "assert_thread_name": "safe_timer",
    "backtrace": [
        "/lib64/libc.so.6(+0x54db0) [0x7f8d4f511db0]",
        "/lib64/libc.so.6(+0xa154c) [0x7f8d4f55e54c]",
        "raise()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f8d4fb6db4b]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x142caf) [0x7f8d4fb6dcaf]",
        "(MDLog::trim(int)+0xb06) [0x563488982f96]",
        "(MDSRankDispatcher::tick()+0x365) [0x563488705515]",
        "ceph-mds(+0x11c9bd) [0x5634886d79bd]",
        "(CommonSafeTimer<ceph::fair_mutex>::timer_thread()+0x15e) [0x7f8d4fc5749e]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x22cd91) [0x7f8d4fc57d91]",
        "/lib64/libc.so.6(+0x9f802) [0x7f8d4f55c802]",
        "/lib64/libc.so.6(+0x3f450) [0x7f8d4f4fc450]"
    ],
    "ceph_version": "17.2.6-196.el9cp",
    "crash_id": "2024-02-20T05:07:42.092421Z_a7bef93f-0350-4d34-a7dc-c386dcb7e762",
    "entity_name": "mds.ocs-storagecluster-cephfilesystem-b",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.3 (Plow)",
    "os_version_id": "9.3",
    "process_name": "ceph-mds",
    "stack_sig": "21cf82abf00a9a80ef194472005415a53e94d6965c4e910d756a9f711243f498",
    "timestamp": "2024-02-20T05:07:42.092421Z",
    "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-69b58cdcwcpv2",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-284.52.1.el9_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Tue Jan 30 08:35:38 EST 2024"
}

Version of all relevant components (if applicable):
ODF: 4.15.0-144.stable
(ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable))
OCP: 4.15.0-0.nightly-2024-02-16-235514
ACM: 2.10.0-78 (2.10.0-DOWNSTREAM-2024-02-18-03-53-23)
Submariner: 0.17.0 (iib:666535)
VolSync: 0.8.0

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
ceph health in warning state

Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
a. Deploy RDR setup
b. Run automated tier4 tests tests/functional/disaster-recovery/regional-dr/test_managed_cluster_node_failure.py

Issue was hit during the test - tests/functional/disaster-recovery/regional-dr/test_managed_cluster_node_failure.py::TestManagedClusterNodeFailure::test_single_managed_cluster_node_failure[rbd-mirror]

The automated test executes below steps,
1. Deploy an application containing 20 PVCs/Pods on C1 (RBD based workloads) 
2. Fail the C1 cluster node (Power off the VM) where rbd-mirror pod is running  
3. Wait for old rbd-mirror pod to be deleted and new pod to start  
4. Start the node and wait for node to come up  
5. Wait for ODF, DR and submariner related pods to reach running state  
6. Check mirroring status is OK  
7. Repeat the above steps from 2 to 6 on cluster C2    
8. Check ceph health on both cluster at the end
9. Observed Ceph health does not become OK on C1 and C2 cluster

On C1, health warn due to another bug 2214499#c35
On C2, health warn due to ceph-mds crash (this bug)

Important Node related events for C2 during the test:
05:02:05 - Power off compute-2 where rook-ceph-rbd-mirror-a-6b9f797df9-r99n6 is hosted
05:02:47 - Node compute-2 reached status NotReady
05:04:24 - Powered on compute-2 node
05:05:26 - Node reached Ready state

Testrun console logs: https://url.corp.redhat.com/1cdcbc4


Actual results:
ceph-mds crash with "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f8d4fb6db4b]"

Expected results:
Ceph should remain healthy without any crashes

Additional info:

Comment 3 Venky Shankar 2024-02-20 13:24:14 UTC


*** This bug has been marked as a duplicate of bug 2258950 ***