Bug 2259180
| Summary: | [Ceph 6 clone] [GSS] mds crash: void MDLog::trim(int): assert(segments.size() >= pre_segments_size) | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Mudit Agarwal <muagarwa> |
| Component: | CephFS | Assignee: | Venky Shankar <vshankar> |
| Status: | CLOSED ERRATA | QA Contact: | Amarnath <amk> |
| Severity: | medium | Docs Contact: | Akash Raj <akraj> |
| Priority: | unspecified | ||
| Version: | 6.1 | CC: | akraj, bkunal, bniver, ceph-eng-bugs, cephqe-warriors, ebenahar, gfarnum, gjose, hyelloji, mcaldeir, muagarwa, nagreddy, sostapov, tserlin, vereddy, vshankar |
| Target Milestone: | --- | ||
| Target Release: | 6.1z5 | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | ceph-17.2.6-202.el9cp | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 2258950 | Environment: | |
| Last Closed: | 2024-04-01 10:19:55 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2258950, 2259179, 2267617 | ||
|
Description
Mudit Agarwal
2024-01-19 11:12:58 UTC
Issue reproduced with below versions. Logs available at http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/sosreports/nagendra/2259180/ odf: 4.15.0-155 ocp: 4.15.0-0.nightly-2024-03-04-052802 -------------------------- I observed MDS crash during node reboot. Test case executed: tests/functional/workloads/ocp/registry/test_registry_reboot_node.py::TestRegistryRebootNode::test_registry_rolling_reboot_node[worker] sh-5.1$ ceph crash ls ID ENTITY NEW 2024-03-07T12:11:09.752163Z_b01a4e55-3d48-45aa-bf8b-f473e870b062 mds.ocs-storagecluster-cephfilesystem-a * sh-5.1$ ceph crash info 2024-03-07T12:11:09.752163Z_b01a4e55-3d48-45aa-bf8b-f473e870b062 { "assert_condition": "segments.size() >= pre_segments_size", "assert_file": "/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc", "assert_func": "void MDLog::trim(int)", "assert_line": 651, "assert_msg": "/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc: In function 'void MDLog::trim(int)' thread 7f97e7aec640 time 2024-03-07T12:11:09.750831+0000\n/builddir/build/BUILD/ceph-17.2.6/src/mds/MDLog.cc: 651: FAILED ceph_assert(segments.size() >= pre_segments_size)\n", "assert_thread_name": "safe_timer", "backtrace": [ "/lib64/libc.so.6(+0x54db0) [0x7f97ee18bdb0]", "/lib64/libc.so.6(+0xa154c) [0x7f97ee1d854c]", "raise()", "abort()", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f97ee7e7b4b]", "/usr/lib64/ceph/libceph-common.so.2(+0x142caf) [0x7f97ee7e7caf]", "(MDLog::trim(int)+0xb06) [0x55797e08ef96]", "(MDSRankDispatcher::tick()+0x365) [0x55797de11515]", "ceph-mds(+0x11c9bd) [0x55797dde39bd]", "(CommonSafeTimer<ceph::fair_mutex>::timer_thread()+0x15e) [0x7f97ee8d149e]", "/usr/lib64/ceph/libceph-common.so.2(+0x22cd91) [0x7f97ee8d1d91]", "/lib64/libc.so.6(+0x9f802) [0x7f97ee1d6802]", "/lib64/libc.so.6(+0x3f450) [0x7f97ee176450]" ], "ceph_version": "17.2.6-196.el9cp", "crash_id": "2024-03-07T12:11:09.752163Z_b01a4e55-3d48-45aa-bf8b-f473e870b062", "entity_name": "mds.ocs-storagecluster-cephfilesystem-a", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.3 (Plow)", "os_version_id": "9.3", "process_name": "ceph-mds", "stack_sig": "21cf82abf00a9a80ef194472005415a53e94d6965c4e910d756a9f711243f498", "timestamp": "2024-03-07T12:11:09.752163Z", "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-575dbc6cvmd7v", "utsname_machine": "x86_64", "utsname_release": "5.14.0-284.55.1.el9_2.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Mon Feb 19 16:57:59 EST 2024" } sh-5.1$ date Thu Mar 7 12:31:05 UTC 2024 sh-5.1$ 17:56:08 - MainThread - ocs_ci.utility.retry - WARNING - Ceph cluster health is not OK. Health: HEALTH_WARN 1 filesystem is degraded; insufficient standby MDS daemons available; 1 osds down; 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set; 1 host (1 osds) down; 1 zone (1 osds) down; Degraded data redundancy: 3739976/11219928 objects degraded (33.333%), 113 pgs degraded, 113 pgs undersized; 1 daemons have recently crashed , Retrying in 30 seconds... 17:56:38 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /Users/nnagendravaraprasadreddy/cnv_bm/new2/auth/kubeconfig -n openshift-storage get Pod -n openshift-storage --selector=app=rook-ceph-tools -o yaml 17:56:39 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /Users/nnagendravaraprasadreddy/cnv_bm/new2/auth/kubeconfig -n openshift-storage get Pod -n openshift-storage --selector=app=rook-ceph-tools -o yaml 17:56:41 - MainThread - ocs_ci.ocs.resources.pod - INFO - These are the ceph tool box pods: ['rook-ceph-tools-dbddf8896-sbvbv'] 17:56:41 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /Users/nnagendravaraprasadreddy/cnv_bm/new2/auth/kubeconfig -n openshift-storage get Pod rook-ceph-tools-dbddf8896-sbvbv -n openshift-storage 17:56:42 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /Users/nnagendravaraprasadreddy/cnv_bm/new2/auth/kubeconfig -n openshift-storage get Pod -n openshift-storage -o yaml 17:56:47 - MainThread - ocs_ci.ocs.resources.pod - INFO - Pod name: rook-ceph-tools-dbddf8896-sbvbv 17:56:47 - MainThread - ocs_ci.ocs.resources.pod - INFO - Pod status: Running 17:56:47 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage rsh rook-ceph-tools-dbddf8896-sbvbv ceph health 17:56:48 - MainThread - ocs_ci.utility.utils - INFO - searching for plugin: _n 17:56:51 - MainThread - ocs_ci.utility.retry - WARNING - Ceph cluster health is not OK. Health: HEALTH_WARN 1 daemons have recently crashed Hi All, As per the comment https://bugzilla.redhat.com/show_bug.cgi?id=2259179#c6 We ran upgrade suite from 5.3(16.2.10-248.el8cp) --> 6.1(17.2.6-205.el9cp) Logs : http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-95T6OH/ Regards, Amarnath Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 6.1 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2024:1580 |