2259180 – [Ceph 6 clone] [GSS] mds crash: void MDLog::trim(int): assert(segments.size() >= pre_segments_size)

Bug 2259180 - [Ceph 6 clone] [GSS] mds crash: void MDLog::trim(int): assert(segments.size() >= pre_segments_size)

Summary: [Ceph 6 clone] [GSS] mds crash: void MDLog::trim(int): assert(segments.size(...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	6.1
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	6.1z5
Assignee:	Venky Shankar
QA Contact:	Amarnath
Docs Contact:	Akash Raj
URL:
Whiteboard:
Depends On:
Blocks:	2258950 2259179 2267617
TreeView+	depends on / blocked

Reported:	2024-01-19 11:12 UTC by Mudit Agarwal
Modified:	2024-05-05 18:50 UTC (History)
CC List:	16 users (show)
Fixed In Version:	ceph-17.2.6-202.el9cp
Doc Type:	No Doc Update
Doc Text:
Clone Of:	2258950
Environment:
Last Closed:	2024-04-01 10:19:55 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	59833	None	None	None	2024-01-22 10:12:31 UTC
Red Hat Issue Tracker	RHCEPH-8202	None	None	None	2024-01-19 11:18:02 UTC
Red Hat Knowledge Base (Solution)	7068654	None	None	None	2024-05-05 18:50:05 UTC
Red Hat Product Errata	RHBA-2024:1580	None	None	None	2024-04-01 10:20:05 UTC

Description Mudit Agarwal 2024-01-19 11:12:58 UTC

+++ This bug was initially created as a clone of Bug #2258950 +++

Description of problem (please be detailed as possible and provide log
snippests):
---------------------------------------------------------------------
- mds daemon is crashing with the below:
~~~
{
    "assert_condition": "segments.size() >= pre_segments_size",
    "assert_file": "/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc",
    "assert_func": "void MDLog::trim(int)",
    "assert_line": 651,
    "assert_msg": "/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc: In function 'void MDLog::trim(int)' thread 7f814f2b7640 time 2024-01-16T05:59:33.686299+0000\n/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc: 651: FAILED ceph_assert(segments.size() >= pre_segments_size)\n",
    "assert_thread_name": "safe_timer",
    "backtrace": [
        "/lib64/libc.so.6(+0x54db0) [0x7f8155956db0]",
        "/lib64/libc.so.6(+0xa154c) [0x7f81559a354c]",
        "raise()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f8155fb2ae1]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x142c45) [0x7f8155fb2c45]",
        "(MDLog::trim(int)+0xb06) [0x558086dbb2a6]",
        "(MDSRankDispatcher::tick()+0x365) [0x558086b3dc65]",
        "ceph-mds(+0x11c71d) [0x558086b1071d]",
        "(CommonSafeTimer<ceph::fair_mutex>::timer_thread()+0x15e) [0x7f815609c4ae]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x22cda1) [0x7f815609cda1]",
        "/lib64/libc.so.6(+0x9f802) [0x7f81559a1802]",
        "/lib64/libc.so.6(+0x3f450) [0x7f8155941450]"
    ],
    "ceph_version": "17.2.6-170.el9cp",
    "crash_id": "2024-01-16T05:59:33.687563Z_6f26298d-0162-4124-b2a7-06bbbc676df6",
    "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.3 (Plow)",
    "os_version_id": "9.3",
    "process_name": "ceph-mds",
    "stack_sig": "21cf82abf00a9a80ef194472005415a53e94d6965c4e910d756a9f711243f498",
    "timestamp": "2024-01-16T05:59:33.687563Z",
    "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-69756fd5mdvcz",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-284.43.1.el9_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Thu Nov 23 09:44:01 EST 2023"
}
~~~

Version of all relevant components (if applicable):
--------------------------------------------------
- RHODF 4.14.3
- ceph version 17.2.6-170.el9cp / RHCS 6.1.z3 Async - 6.1.3 Async 

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
------------------------------------------------------------------------
N/A. as of now, mds crashed only once.

Is there any workaround available to the best of your knowledge?
----------------------------------------------------------------
N/A


Can this issue reproducible?
---------------------------
Customer specific.

Can this issue reproduce from the UI?
-------------------------------------
N/A

Additional info:
---------------
- Upstream tracker: https://tracker.ceph.com/issues/59833

--- Additional comment from RHEL Program Management on 2024-01-18 07:48:26 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.15.0' to '?', and so is being proposed to be fixed at the ODF 4.15.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from Venky Shankar on 2024-01-18 10:20:59 UTC ---


This has been fixed in recent versions of ceph. See: https://tracker.ceph.com/issues/59833

--- Additional comment from Mudit Agarwal on 2024-01-19 08:54:31 UTC ---

Venky, which downstream version of ceph has this fix?

--- Additional comment from Venky Shankar on 2024-01-19 10:07:07 UTC ---

(In reply to Mudit Agarwal from comment #3)
> Venky, which downstream version of ceph has this fix?

The upstream backports are merged. The commits need to be ported downstream. Will push a MR for RHCS6/7.

Comment 5 Nagendra Reddy 2024-03-07 14:58:22 UTC

Issue reproduced with below versions. Logs available at http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/sosreports/nagendra/2259180/

odf:  4.15.0-155
ocp: 4.15.0-0.nightly-2024-03-04-052802

--------------------------
I observed MDS crash during node reboot.
 
Test case executed: tests/functional/workloads/ocp/registry/test_registry_reboot_node.py::TestRegistryRebootNode::test_registry_rolling_reboot_node[worker]
 
sh-5.1$ ceph crash ls
ID                                                                ENTITY                                   NEW
2024-03-07T12:11:09.752163Z_b01a4e55-3d48-45aa-bf8b-f473e870b062  mds.ocs-storagecluster-cephfilesystem-a   *
sh-5.1$ ceph crash info 2024-03-07T12:11:09.752163Z_b01a4e55-3d48-45aa-bf8b-f473e870b062
{
    "assert_condition": "segments.size() >= pre_segments_size",
    "assert_file": "/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc",
    "assert_func": "void MDLog::trim(int)",
    "assert_line": 651,
    "assert_msg": "/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc: In function 'void MDLog::trim(int)' thread 7f97e7aec640 time 2024-03-07T12:11:09.750831+0000\n/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc: 651: FAILED ceph_assert(segments.size() >= pre_segments_size)\n",
    "assert_thread_name": "safe_timer",
    "backtrace": [
        "/lib64/libc.so.6(+0x54db0) [0x7f97ee18bdb0]",
        "/lib64/libc.so.6(+0xa154c) [0x7f97ee1d854c]",
        "raise()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f97ee7e7b4b]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x142caf) [0x7f97ee7e7caf]",
        "(MDLog::trim(int)+0xb06) [0x55797e08ef96]",
        "(MDSRankDispatcher::tick()+0x365) [0x55797de11515]",
        "ceph-mds(+0x11c9bd) [0x55797dde39bd]",
        "(CommonSafeTimer<ceph::fair_mutex>::timer_thread()+0x15e) [0x7f97ee8d149e]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x22cd91) [0x7f97ee8d1d91]",
        "/lib64/libc.so.6(+0x9f802) [0x7f97ee1d6802]",
        "/lib64/libc.so.6(+0x3f450) [0x7f97ee176450]"
    ],
    "ceph_version": "17.2.6-196.el9cp",
    "crash_id": "2024-03-07T12:11:09.752163Z_b01a4e55-3d48-45aa-bf8b-f473e870b062",
    "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.3 (Plow)",
    "os_version_id": "9.3",
    "process_name": "ceph-mds",
    "stack_sig": "21cf82abf00a9a80ef194472005415a53e94d6965c4e910d756a9f711243f498",
    "timestamp": "2024-03-07T12:11:09.752163Z",
    "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-575dbc6cvmd7v",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-284.55.1.el9_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Mon Feb 19 16:57:59 EST 2024"
}
sh-5.1$ date
Thu Mar  7 12:31:05 UTC 2024
sh-5.1$
 
 
 
 
 
17:56:08 - MainThread - ocs_ci.utility.retry - WARNING  - Ceph cluster health is not OK. Health: HEALTH_WARN 1 filesystem is degraded; insufficient standby MDS daemons available; 1 osds down; 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set; 1 host (1 osds) down; 1 zone (1 osds) down; Degraded data redundancy: 3739976/11219928 objects degraded (33.333%), 113 pgs degraded, 113 pgs undersized; 1 daemons have recently crashed
, Retrying in 30 seconds...
17:56:38 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc --kubeconfig /Users/nnagendravaraprasadreddy/cnv_bm/new2/auth/kubeconfig -n openshift-storage get Pod  -n openshift-storage --selector=app=rook-ceph-tools -o yaml
17:56:39 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc --kubeconfig /Users/nnagendravaraprasadreddy/cnv_bm/new2/auth/kubeconfig -n openshift-storage get Pod  -n openshift-storage --selector=app=rook-ceph-tools -o yaml
17:56:41 - MainThread - ocs_ci.ocs.resources.pod - INFO  - These are the ceph tool box pods: ['rook-ceph-tools-dbddf8896-sbvbv']
17:56:41 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc --kubeconfig /Users/nnagendravaraprasadreddy/cnv_bm/new2/auth/kubeconfig -n openshift-storage get Pod rook-ceph-tools-dbddf8896-sbvbv -n openshift-storage
17:56:42 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc --kubeconfig /Users/nnagendravaraprasadreddy/cnv_bm/new2/auth/kubeconfig -n openshift-storage get Pod  -n openshift-storage -o yaml
17:56:47 - MainThread - ocs_ci.ocs.resources.pod - INFO  - Pod name: rook-ceph-tools-dbddf8896-sbvbv
17:56:47 - MainThread - ocs_ci.ocs.resources.pod - INFO  - Pod status: Running
17:56:47 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage rsh rook-ceph-tools-dbddf8896-sbvbv ceph health
17:56:48 - MainThread - ocs_ci.utility.utils - INFO  - searching for plugin: _n
17:56:51 - MainThread - ocs_ci.utility.retry - WARNING  - Ceph cluster health is not OK. Health: HEALTH_WARN 1 daemons have recently crashed

Comment 12 Amarnath 2024-03-20 12:52:56 UTC

Hi All,

As per the comment https://bugzilla.redhat.com/show_bug.cgi?id=2259179#c6

We ran upgrade suite from 5.3(16.2.10-248.el8cp) --> 6.1(17.2.6-205.el9cp)

Logs : http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-95T6OH/

Regards,
Amarnath

Comment 14 errata-xmlrpc 2024-04-01 10:19:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 6.1 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:1580

Note You need to log in before you can comment on or make changes to this bug.