+++ This bug was initially created as a clone of Bug #2258950 +++ Description of problem (please be detailed as possible and provide log snippests): --------------------------------------------------------------------- - mds daemon is crashing with the below: ~~~ { "assert_condition": "segments.size() >= pre_segments_size", "assert_file": "/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc", "assert_func": "void MDLog::trim(int)", "assert_line": 651, "assert_msg": "/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc: In function 'void MDLog::trim(int)' thread 7f814f2b7640 time 2024-01-16T05:59:33.686299+0000\n/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc: 651: FAILED ceph_assert(segments.size() >= pre_segments_size)\n", "assert_thread_name": "safe_timer", "backtrace": [ "/lib64/libc.so.6(+0x54db0) [0x7f8155956db0]", "/lib64/libc.so.6(+0xa154c) [0x7f81559a354c]", "raise()", "abort()", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f8155fb2ae1]", "/usr/lib64/ceph/libceph-common.so.2(+0x142c45) [0x7f8155fb2c45]", "(MDLog::trim(int)+0xb06) [0x558086dbb2a6]", "(MDSRankDispatcher::tick()+0x365) [0x558086b3dc65]", "ceph-mds(+0x11c71d) [0x558086b1071d]", "(CommonSafeTimer<ceph::fair_mutex>::timer_thread()+0x15e) [0x7f815609c4ae]", "/usr/lib64/ceph/libceph-common.so.2(+0x22cda1) [0x7f815609cda1]", "/lib64/libc.so.6(+0x9f802) [0x7f81559a1802]", "/lib64/libc.so.6(+0x3f450) [0x7f8155941450]" ], "ceph_version": "17.2.6-170.el9cp", "crash_id": "2024-01-16T05:59:33.687563Z_6f26298d-0162-4124-b2a7-06bbbc676df6", "entity_name": "mds.ocs-storagecluster-cephfilesystem-a", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.3 (Plow)", "os_version_id": "9.3", "process_name": "ceph-mds", "stack_sig": "21cf82abf00a9a80ef194472005415a53e94d6965c4e910d756a9f711243f498", "timestamp": "2024-01-16T05:59:33.687563Z", "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-69756fd5mdvcz", "utsname_machine": "x86_64", "utsname_release": "5.14.0-284.43.1.el9_2.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Thu Nov 23 09:44:01 EST 2023" } ~~~ Version of all relevant components (if applicable): -------------------------------------------------- - RHODF 4.14.3 - ceph version 17.2.6-170.el9cp / RHCS 6.1.z3 Async - 6.1.3 Async Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? ------------------------------------------------------------------------ N/A. as of now, mds crashed only once. Is there any workaround available to the best of your knowledge? ---------------------------------------------------------------- N/A Can this issue reproducible? --------------------------- Customer specific. Can this issue reproduce from the UI? ------------------------------------- N/A Additional info: --------------- - Upstream tracker: https://tracker.ceph.com/issues/59833 --- Additional comment from RHEL Program Management on 2024-01-18 07:48:26 UTC --- This bug having no release flag set previously, is now set with release flag 'odf‑4.15.0' to '?', and so is being proposed to be fixed at the ODF 4.15.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag. --- Additional comment from Venky Shankar on 2024-01-18 10:20:59 UTC --- This has been fixed in recent versions of ceph. See: https://tracker.ceph.com/issues/59833 --- Additional comment from Mudit Agarwal on 2024-01-19 08:54:31 UTC --- Venky, which downstream version of ceph has this fix? --- Additional comment from Venky Shankar on 2024-01-19 10:07:07 UTC --- (In reply to Mudit Agarwal from comment #3) > Venky, which downstream version of ceph has this fix? The upstream backports are merged. The commits need to be ported downstream. Will push a MR for RHCS6/7.
Issue reproduced with below versions. Logs available at http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/sosreports/nagendra/2259180/ odf: 4.15.0-155 ocp: 4.15.0-0.nightly-2024-03-04-052802 -------------------------- I observed MDS crash during node reboot. Test case executed: tests/functional/workloads/ocp/registry/test_registry_reboot_node.py::TestRegistryRebootNode::test_registry_rolling_reboot_node[worker] sh-5.1$ ceph crash ls ID ENTITY NEW 2024-03-07T12:11:09.752163Z_b01a4e55-3d48-45aa-bf8b-f473e870b062 mds.ocs-storagecluster-cephfilesystem-a * sh-5.1$ ceph crash info 2024-03-07T12:11:09.752163Z_b01a4e55-3d48-45aa-bf8b-f473e870b062 { "assert_condition": "segments.size() >= pre_segments_size", "assert_file": "/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc", "assert_func": "void MDLog::trim(int)", "assert_line": 651, "assert_msg": "/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc: In function 'void MDLog::trim(int)' thread 7f97e7aec640 time 2024-03-07T12:11:09.750831+0000\n/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc: 651: FAILED ceph_assert(segments.size() >= pre_segments_size)\n", "assert_thread_name": "safe_timer", "backtrace": [ "/lib64/libc.so.6(+0x54db0) [0x7f97ee18bdb0]", "/lib64/libc.so.6(+0xa154c) [0x7f97ee1d854c]", "raise()", "abort()", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f97ee7e7b4b]", "/usr/lib64/ceph/libceph-common.so.2(+0x142caf) [0x7f97ee7e7caf]", "(MDLog::trim(int)+0xb06) [0x55797e08ef96]", "(MDSRankDispatcher::tick()+0x365) [0x55797de11515]", "ceph-mds(+0x11c9bd) [0x55797dde39bd]", "(CommonSafeTimer<ceph::fair_mutex>::timer_thread()+0x15e) [0x7f97ee8d149e]", "/usr/lib64/ceph/libceph-common.so.2(+0x22cd91) [0x7f97ee8d1d91]", "/lib64/libc.so.6(+0x9f802) [0x7f97ee1d6802]", "/lib64/libc.so.6(+0x3f450) [0x7f97ee176450]" ], "ceph_version": "17.2.6-196.el9cp", "crash_id": "2024-03-07T12:11:09.752163Z_b01a4e55-3d48-45aa-bf8b-f473e870b062", "entity_name": "mds.ocs-storagecluster-cephfilesystem-a", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.3 (Plow)", "os_version_id": "9.3", "process_name": "ceph-mds", "stack_sig": "21cf82abf00a9a80ef194472005415a53e94d6965c4e910d756a9f711243f498", "timestamp": "2024-03-07T12:11:09.752163Z", "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-575dbc6cvmd7v", "utsname_machine": "x86_64", "utsname_release": "5.14.0-284.55.1.el9_2.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Mon Feb 19 16:57:59 EST 2024" } sh-5.1$ date Thu Mar 7 12:31:05 UTC 2024 sh-5.1$ 17:56:08 - MainThread - ocs_ci.utility.retry - WARNING - Ceph cluster health is not OK. Health: HEALTH_WARN 1 filesystem is degraded; insufficient standby MDS daemons available; 1 osds down; 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set; 1 host (1 osds) down; 1 zone (1 osds) down; Degraded data redundancy: 3739976/11219928 objects degraded (33.333%), 113 pgs degraded, 113 pgs undersized; 1 daemons have recently crashed , Retrying in 30 seconds... 17:56:38 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /Users/nnagendravaraprasadreddy/cnv_bm/new2/auth/kubeconfig -n openshift-storage get Pod -n openshift-storage --selector=app=rook-ceph-tools -o yaml 17:56:39 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /Users/nnagendravaraprasadreddy/cnv_bm/new2/auth/kubeconfig -n openshift-storage get Pod -n openshift-storage --selector=app=rook-ceph-tools -o yaml 17:56:41 - MainThread - ocs_ci.ocs.resources.pod - INFO - These are the ceph tool box pods: ['rook-ceph-tools-dbddf8896-sbvbv'] 17:56:41 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /Users/nnagendravaraprasadreddy/cnv_bm/new2/auth/kubeconfig -n openshift-storage get Pod rook-ceph-tools-dbddf8896-sbvbv -n openshift-storage 17:56:42 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /Users/nnagendravaraprasadreddy/cnv_bm/new2/auth/kubeconfig -n openshift-storage get Pod -n openshift-storage -o yaml 17:56:47 - MainThread - ocs_ci.ocs.resources.pod - INFO - Pod name: rook-ceph-tools-dbddf8896-sbvbv 17:56:47 - MainThread - ocs_ci.ocs.resources.pod - INFO - Pod status: Running 17:56:47 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage rsh rook-ceph-tools-dbddf8896-sbvbv ceph health 17:56:48 - MainThread - ocs_ci.utility.utils - INFO - searching for plugin: _n 17:56:51 - MainThread - ocs_ci.utility.retry - WARNING - Ceph cluster health is not OK. Health: HEALTH_WARN 1 daemons have recently crashed
Hi All, As per the comment https://bugzilla.redhat.com/show_bug.cgi?id=2259179#c6 We ran upgrade suite from 5.3(16.2.10-248.el8cp) --> 6.1(17.2.6-205.el9cp) Logs : http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-95T6OH/ Regards, Amarnath
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 6.1 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2024:1580