2259179 – [Ceph 7 clone] [CEE/SD][cephfs] mds crash: void MDLog::trim(int): assert(segments.size() >= pre_segments_size)

Bug 2259179 - [Ceph 7 clone] [CEE/SD][cephfs] mds crash: void MDLog::trim(int): assert(segments.size() >= pre_segments_size)

Summary: [Ceph 7 clone] [CEE/SD][cephfs] mds crash: void MDLog::trim(int): assert(segm...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	6.1
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	7.1
Assignee:	Venky Shankar
QA Contact:	Amarnath
Docs Contact:	Akash Raj
URL:
Whiteboard:
Depends On:	2259180
Blocks:	2267614 2298578 2298579
TreeView+	depends on / blocked

Reported:	2024-01-19 11:12 UTC by Mudit Agarwal
Modified:	2024-07-18 07:59 UTC (History)
CC List:	14 users (show)
Fixed In Version:	ceph-18.2.1-1.el9cp
Doc Type:	No Doc Update
Doc Text:
Clone Of:	2258950
Environment:
Last Closed:	2024-06-13 14:24:51 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	59833	None	None	None	2024-01-22 10:12:23 UTC
Red Hat Issue Tracker	RHCEPH-8201	None	None	None	2024-01-19 11:18:01 UTC
Red Hat Knowledge Base (Solution)	7068654	None	None	None	2024-05-05 18:50:28 UTC
Red Hat Product Errata	RHSA-2024:3925	None	None	None	2024-06-13 14:24:57 UTC

Description Mudit Agarwal 2024-01-19 11:12:14 UTC

+++ This bug was initially created as a clone of Bug #2258950 +++

Description of problem (please be detailed as possible and provide log
snippests):
---------------------------------------------------------------------
- mds daemon is crashing with the below:
~~~
{
    "assert_condition": "segments.size() >= pre_segments_size",
    "assert_file": "/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc",
    "assert_func": "void MDLog::trim(int)",
    "assert_line": 651,
    "assert_msg": "/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc: In function 'void MDLog::trim(int)' thread 7f814f2b7640 time 2024-01-16T05:59:33.686299+0000\n/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc: 651: FAILED ceph_assert(segments.size() >= pre_segments_size)\n",
    "assert_thread_name": "safe_timer",
    "backtrace": [
        "/lib64/libc.so.6(+0x54db0) [0x7f8155956db0]",
        "/lib64/libc.so.6(+0xa154c) [0x7f81559a354c]",
        "raise()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f8155fb2ae1]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x142c45) [0x7f8155fb2c45]",
        "(MDLog::trim(int)+0xb06) [0x558086dbb2a6]",
        "(MDSRankDispatcher::tick()+0x365) [0x558086b3dc65]",
        "ceph-mds(+0x11c71d) [0x558086b1071d]",
        "(CommonSafeTimer<ceph::fair_mutex>::timer_thread()+0x15e) [0x7f815609c4ae]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x22cda1) [0x7f815609cda1]",
        "/lib64/libc.so.6(+0x9f802) [0x7f81559a1802]",
        "/lib64/libc.so.6(+0x3f450) [0x7f8155941450]"
    ],
    "ceph_version": "17.2.6-170.el9cp",
    "crash_id": "2024-01-16T05:59:33.687563Z_6f26298d-0162-4124-b2a7-06bbbc676df6",
    "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.3 (Plow)",
    "os_version_id": "9.3",
    "process_name": "ceph-mds",
    "stack_sig": "21cf82abf00a9a80ef194472005415a53e94d6965c4e910d756a9f711243f498",
    "timestamp": "2024-01-16T05:59:33.687563Z",
    "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-69756fd5mdvcz",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-284.43.1.el9_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Thu Nov 23 09:44:01 EST 2023"
}
~~~

Version of all relevant components (if applicable):
--------------------------------------------------
- RHODF 4.14.3
- ceph version 17.2.6-170.el9cp / RHCS 6.1.z3 Async - 6.1.3 Async 

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
------------------------------------------------------------------------
N/A. as of now, mds crashed only once.

Is there any workaround available to the best of your knowledge?
----------------------------------------------------------------
N/A


Can this issue reproducible?
---------------------------
Customer specific.

Can this issue reproduce from the UI?
-------------------------------------
N/A

Additional info:
---------------
- Upstream tracker: https://tracker.ceph.com/issues/59833

--- Additional comment from RHEL Program Management on 2024-01-18 07:48:26 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.15.0' to '?', and so is being proposed to be fixed at the ODF 4.15.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from Venky Shankar on 2024-01-18 10:20:59 UTC ---


This has been fixed in recent versions of ceph. See: https://tracker.ceph.com/issues/59833

--- Additional comment from Mudit Agarwal on 2024-01-19 08:54:31 UTC ---

Venky, which downstream version of ceph has this fix?

--- Additional comment from Venky Shankar on 2024-01-19 10:07:07 UTC ---

(In reply to Mudit Agarwal from comment #3)
> Venky, which downstream version of ceph has this fix?

The upstream backports are merged. The commits need to be ported downstream. Will push a MR for RHCS6/7.

Comment 5 Amarnath 2024-02-20 05:24:27 UTC

Hi Venky, 

Could you please assist me with the steps required to reproduce this particular issue?

Regards,
Amarnath

Comment 6 Venky Shankar 2024-02-26 07:32:08 UTC

(In reply to Amarnath from comment #5)
> Hi Venky, 
> 
> Could you please assist me with the steps required to reproduce this
> particular issue?

Please run this through sanity.

Comment 7 Amarnath 2024-02-26 07:37:43 UTC

Hi,

As suggested, We ran sanity, Regression and weekly runs on the builds greater than the fix available.

Sanity :http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/18.2.1-14/Sanity/217/tier-0_fs/

Regression : http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/18.2.1-11/Regression/cephfs/53/

Weekly : http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/18.2.1-11/Weekly/cephfs/27/

We haven't seen anything weird apart from regular failures we observe.


Regards,
Amarnath

Comment 11 errata-xmlrpc 2024-06-13 14:24:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925

Note You need to log in before you can comment on or make changes to this bug.