Bug 2259180

Summary:	[Ceph 6 clone] [GSS] mds crash: void MDLog::trim(int): assert(segments.size() >= pre_segments_size)
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Mudit Agarwal <muagarwa>
Component:	CephFS	Assignee:	Venky Shankar <vshankar>
Status:	CLOSED ERRATA	QA Contact:	Amarnath <amk>
Severity:	medium	Docs Contact:	Akash Raj <akraj>
Priority:	unspecified
Version:	6.1	CC:	akraj, bkunal, bniver, ceph-eng-bugs, cephqe-warriors, ebenahar, gfarnum, gjose, hyelloji, mcaldeir, muagarwa, nagreddy, sostapov, tserlin, vereddy, vshankar
Target Milestone:	---
Target Release:	6.1z5
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:	ceph-17.2.6-202.el9cp	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:	2258950	Environment:
Last Closed:	2024-04-01 10:19:55 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2258950, 2259179, 2267617

Description Mudit Agarwal 2024-01-19 11:12:58 UTC

+++ This bug was initially created as a clone of Bug #2258950 +++

Description of problem (please be detailed as possible and provide log
snippests):
---------------------------------------------------------------------
- mds daemon is crashing with the below:
~~~
{
    "assert_condition": "segments.size() >= pre_segments_size",
    "assert_file": "/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc",
    "assert_func": "void MDLog::trim(int)",
    "assert_line": 651,
    "assert_msg": "/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc: In function 'void MDLog::trim(int)' thread 7f814f2b7640 time 2024-01-16T05:59:33.686299+0000\n/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc: 651: FAILED ceph_assert(segments.size() >= pre_segments_size)\n",
    "assert_thread_name": "safe_timer",
    "backtrace": [
        "/lib64/libc.so.6(+0x54db0) [0x7f8155956db0]",
        "/lib64/libc.so.6(+0xa154c) [0x7f81559a354c]",
        "raise()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f8155fb2ae1]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x142c45) [0x7f8155fb2c45]",
        "(MDLog::trim(int)+0xb06) [0x558086dbb2a6]",
        "(MDSRankDispatcher::tick()+0x365) [0x558086b3dc65]",
        "ceph-mds(+0x11c71d) [0x558086b1071d]",
        "(CommonSafeTimer<ceph::fair_mutex>::timer_thread()+0x15e) [0x7f815609c4ae]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x22cda1) [0x7f815609cda1]",
        "/lib64/libc.so.6(+0x9f802) [0x7f81559a1802]",
        "/lib64/libc.so.6(+0x3f450) [0x7f8155941450]"
    ],
    "ceph_version": "17.2.6-170.el9cp",
    "crash_id": "2024-01-16T05:59:33.687563Z_6f26298d-0162-4124-b2a7-06bbbc676df6",
    "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.3 (Plow)",
    "os_version_id": "9.3",
    "process_name": "ceph-mds",
    "stack_sig": "21cf82abf00a9a80ef194472005415a53e94d6965c4e910d756a9f711243f498",
    "timestamp": "2024-01-16T05:59:33.687563Z",
    "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-69756fd5mdvcz",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-284.43.1.el9_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Thu Nov 23 09:44:01 EST 2023"
}
~~~

Version of all relevant components (if applicable):
--------------------------------------------------
- RHODF 4.14.3
- ceph version 17.2.6-170.el9cp / RHCS 6.1.z3 Async - 6.1.3 Async 

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
------------------------------------------------------------------------
N/A. as of now, mds crashed only once.

Is there any workaround available to the best of your knowledge?
----------------------------------------------------------------
N/A


Can this issue reproducible?
---------------------------
Customer specific.

Can this issue reproduce from the UI?
-------------------------------------
N/A

Additional info:
---------------
- Upstream tracker: https://tracker.ceph.com/issues/59833

--- Additional comment from RHEL Program Management on 2024-01-18 07:48:26 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.15.0' to '?', and so is being proposed to be fixed at the ODF 4.15.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from Venky Shankar on 2024-01-18 10:20:59 UTC ---


This has been fixed in recent versions of ceph. See: https://tracker.ceph.com/issues/59833

--- Additional comment from Mudit Agarwal on 2024-01-19 08:54:31 UTC ---

Venky, which downstream version of ceph has this fix?

--- Additional comment from Venky Shankar on 2024-01-19 10:07:07 UTC ---

(In reply to Mudit Agarwal from comment #3)
> Venky, which downstream version of ceph has this fix?

The upstream backports are merged. The commits need to be ported downstream. Will push a MR for RHCS6/7.

Comment 5 Nagendra Reddy 2024-03-07 14:58:22 UTC

Issue reproduced with below versions. Logs available at http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/sosreports/nagendra/2259180/

odf:  4.15.0-155
ocp: 4.15.0-0.nightly-2024-03-04-052802

--------------------------
I observed MDS crash during node reboot.
 
Test case executed: tests/functional/workloads/ocp/registry/test_registry_reboot_node.py::TestRegistryRebootNode::test_registry_rolling_reboot_node[worker]
 
sh-5.1$ ceph crash ls
ID                                                                ENTITY                                   NEW
2024-03-07T12:11:09.752163Z_b01a4e55-3d48-45aa-bf8b-f473e870b062  mds.ocs-storagecluster-cephfilesystem-a   *
sh-5.1$ ceph crash info 2024-03-07T12:11:09.752163Z_b01a4e55-3d48-45aa-bf8b-f473e870b062
{
    "assert_condition": "segments.size() >= pre_segments_size",
    "assert_file": "/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc",
    "assert_func": "void MDLog::trim(int)",
    "assert_line": 651,
    "assert_msg": "/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc: In function 'void MDLog::trim(int)' thread 7f97e7aec640 time 2024-03-07T12:11:09.750831+0000\n/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc: 651: FAILED ceph_assert(segments.size() >= pre_segments_size)\n",
    "assert_thread_name": "safe_timer",
    "backtrace": [
        "/lib64/libc.so.6(+0x54db0) [0x7f97ee18bdb0]",
        "/lib64/libc.so.6(+0xa154c) [0x7f97ee1d854c]",
        "raise()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f97ee7e7b4b]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x142caf) [0x7f97ee7e7caf]",
        "(MDLog::trim(int)+0xb06) [0x55797e08ef96]",
        "(MDSRankDispatcher::tick()+0x365) [0x55797de11515]",
        "ceph-mds(+0x11c9bd) [0x55797dde39bd]",
        "(CommonSafeTimer<ceph::fair_mutex>::timer_thread()+0x15e) [0x7f97ee8d149e]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x22cd91) [0x7f97ee8d1d91]",
        "/lib64/libc.so.6(+0x9f802) [0x7f97ee1d6802]",
        "/lib64/libc.so.6(+0x3f450) [0x7f97ee176450]"
    ],
    "ceph_version": "17.2.6-196.el9cp",
    "crash_id": "2024-03-07T12:11:09.752163Z_b01a4e55-3d48-45aa-bf8b-f473e870b062",
    "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.3 (Plow)",
    "os_version_id": "9.3",
    "process_name": "ceph-mds",
    "stack_sig": "21cf82abf00a9a80ef194472005415a53e94d6965c4e910d756a9f711243f498",
    "timestamp": "2024-03-07T12:11:09.752163Z",
    "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-575dbc6cvmd7v",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-284.55.1.el9_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Mon Feb 19 16:57:59 EST 2024"
}
sh-5.1$ date
Thu Mar  7 12:31:05 UTC 2024
sh-5.1$
 
 
 
 
 
17:56:08 - MainThread - ocs_ci.utility.retry - WARNING  - Ceph cluster health is not OK. Health: HEALTH_WARN 1 filesystem is degraded; insufficient standby MDS daemons available; 1 osds down; 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set; 1 host (1 osds) down; 1 zone (1 osds) down; Degraded data redundancy: 3739976/11219928 objects degraded (33.333%), 113 pgs degraded, 113 pgs undersized; 1 daemons have recently crashed
, Retrying in 30 seconds...
17:56:38 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc --kubeconfig /Users/nnagendravaraprasadreddy/cnv_bm/new2/auth/kubeconfig -n openshift-storage get Pod  -n openshift-storage --selector=app=rook-ceph-tools -o yaml
17:56:39 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc --kubeconfig /Users/nnagendravaraprasadreddy/cnv_bm/new2/auth/kubeconfig -n openshift-storage get Pod  -n openshift-storage --selector=app=rook-ceph-tools -o yaml
17:56:41 - MainThread - ocs_ci.ocs.resources.pod - INFO  - These are the ceph tool box pods: ['rook-ceph-tools-dbddf8896-sbvbv']
17:56:41 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc --kubeconfig /Users/nnagendravaraprasadreddy/cnv_bm/new2/auth/kubeconfig -n openshift-storage get Pod rook-ceph-tools-dbddf8896-sbvbv -n openshift-storage
17:56:42 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc --kubeconfig /Users/nnagendravaraprasadreddy/cnv_bm/new2/auth/kubeconfig -n openshift-storage get Pod  -n openshift-storage -o yaml
17:56:47 - MainThread - ocs_ci.ocs.resources.pod - INFO  - Pod name: rook-ceph-tools-dbddf8896-sbvbv
17:56:47 - MainThread - ocs_ci.ocs.resources.pod - INFO  - Pod status: Running
17:56:47 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage rsh rook-ceph-tools-dbddf8896-sbvbv ceph health
17:56:48 - MainThread - ocs_ci.utility.utils - INFO  - searching for plugin: _n
17:56:51 - MainThread - ocs_ci.utility.retry - WARNING  - Ceph cluster health is not OK. Health: HEALTH_WARN 1 daemons have recently crashed

Comment 12 Amarnath 2024-03-20 12:52:56 UTC

Hi All,

As per the comment https://bugzilla.redhat.com/show_bug.cgi?id=2259179#c6

We ran upgrade suite from 5.3(16.2.10-248.el8cp) --> 6.1(17.2.6-205.el9cp)

Logs : http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-95T6OH/

Regards,
Amarnath

Comment 14 errata-xmlrpc 2024-04-01 10:19:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 6.1 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:1580