Description of problem: ==================== MDS daemons getting crash post creating cephfs volume -------------------- # ceph -s cluster: id: 1c537092-72b2-11ee-899e-fa163eb88291 health: HEALTH_WARN 2 failed cephadm daemon(s) insufficient standby MDS daemons available 10 daemons have recently crashed services: mon: 3 daemons, quorum ceph-mani-o7fdxp-node1-installer,ceph-mani-o7fdxp-node3,ceph-mani-o7fdxp-node2 (age 20m) mgr: ceph-mani-o7fdxp-node1-installer.zidoxy(active, since 84m), standbys: ceph-mani-o7fdxp-node3.epdzad mds: 1/1 daemons up osd: 18 osds: 18 up (since 20m), 18 in (since 21m) data: volumes: 1/1 healthy pools: 3 pools, 49 pgs objects: 24 objects, 451 KiB usage: 494 MiB used, 269 GiB / 270 GiB avail pgs: 49 active+clean ----------------- # ceph health detail HEALTH_WARN 2 failed cephadm daemon(s); insufficient standby MDS daemons available; 10 daemons have recently crashed [WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s) daemon mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha on ceph-mani-o7fdxp-node1-installer is in error state daemon mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw on ceph-mani-o7fdxp-node3 is in error state [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available have 0; want 1 more [WRN] RECENT_CRASH: 10 daemons have recently crashed mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha crashed on host ceph-mani-o7fdxp-node1-installer at 2023-10-24T22:19:14.007629Z mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha crashed on host ceph-mani-o7fdxp-node1-installer at 2023-10-24T22:19:39.938760Z mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha crashed on host ceph-mani-o7fdxp-node1-installer at 2023-10-24T22:19:55.901449Z mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha crashed on host ceph-mani-o7fdxp-node1-installer at 2023-10-24T22:20:11.644713Z mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha crashed on host ceph-mani-o7fdxp-node1-installer at 2023-10-24T22:20:27.396603Z mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw crashed on host ceph-mani-o7fdxp-node3 at 2023-10-24T22:19:21.148486Z mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw crashed on host ceph-mani-o7fdxp-node3 at 2023-10-24T22:20:47.368772Z mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw crashed on host ceph-mani-o7fdxp-node3 at 2023-10-24T22:21:03.613015Z mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw crashed on host ceph-mani-o7fdxp-node3 at 2023-10-24T22:21:19.828147Z mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw crashed on host ceph-mani-o7fdxp-node3 at 2023-10-24T22:21:36.090236Z ---------------- # ceph crash ls ID ENTITY NEW 2023-10-24T22:19:14.007629Z_7939c431-6264-4fe8-b1a3-90020172da3b mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha * 2023-10-24T22:19:21.148486Z_a049e9d6-20ac-4f97-aba7-c5b54dd7690c mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw * 2023-10-24T22:19:39.938760Z_9d3f892d-5eec-4020-b9af-a22f85d38c55 mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha * 2023-10-24T22:19:55.901449Z_68e82f22-3d6e-44de-9bfa-6327571baa9f mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha * 2023-10-24T22:20:11.644713Z_63851c60-2b91-4f18-ac4b-67f57304c693 mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha * 2023-10-24T22:20:27.396603Z_c18abc01-d54e-49c0-8269-a0f4a0557d05 mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha * 2023-10-24T22:20:47.368772Z_daf6ef2c-7dd2-4317-a869-3be915537f5d mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw * 2023-10-24T22:21:03.613015Z_b1610818-b2ce-44ea-b1c4-c313ec28108d mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw * 2023-10-24T22:21:19.828147Z_96e1a843-7879-4835-b649-a42ebdcb0a76 mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw * 2023-10-24T22:21:36.090236Z_b823b211-c44c-45fe-8f28-a0654c8088dd mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw * --------- # ceph crash info 2023-10-24T22:20:27.396603Z_c18abc01-d54e-49c0-8269-a0f4a0557d05 { "backtrace": [ "/lib64/libc.so.6(+0x54df0) [0x7fcc29b9fdf0]", "/lib64/libc.so.6(+0xa154c) [0x7fcc29bec54c]", "raise()", "abort()", "/lib64/libstdc++.so.6(+0xa1a01) [0x7fcc29eeda01]", "/lib64/libstdc++.so.6(+0xad37c) [0x7fcc29ef937c]", "/lib64/libstdc++.so.6(+0xad3e7) [0x7fcc29ef93e7]", "/lib64/libstdc++.so.6(+0xad649) [0x7fcc29ef9649]", "/usr/bin/ceph-mds(+0xc2c4f) [0x565170eeac4f]", "/usr/bin/ceph-mds(+0xc2c73) [0x565170eeac73]", "/usr/bin/ceph-mds(+0xc403f) [0x565170eec03f]", "(Server::find_idle_sessions()+0x1bb) [0x565170fc7c0b]", "(MDSRankDispatcher::tick()+0x25e) [0x565170f7690e]", "/usr/bin/ceph-mds(+0x1200fd) [0x565170f480fd]", "(CommonSafeTimer<ceph::fair_mutex>::timer_thread()+0x15e) [0x7fcc2a30bb7e]", "/usr/lib64/ceph/libceph-common.so.2(+0x253441) [0x7fcc2a30c441]", "/lib64/libc.so.6(+0x9f802) [0x7fcc29bea802]", "/lib64/libc.so.6(+0x3f450) [0x7fcc29b8a450]" ], "ceph_version": "18.2.0-98.el9cp", "crash_id": "2023-10-24T22:20:27.396603Z_c18abc01-d54e-49c0-8269-a0f4a0557d05", "entity_name": "mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.2 (Plow)", "os_version_id": "9.2", "process_name": "ceph-mds", "stack_sig": "c6412d54539859c8fc7f05a3dc8e6e1bf6c32b0bdce9b27d39107a402f4a617a", "timestamp": "2023-10-24T22:20:27.396603Z", "utsname_hostname": "ceph-mani-o7fdxp-node1-installer", "utsname_machine": "x86_64", "utsname_release": "5.14.0-284.30.1.el9_2.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Fri Aug 25 09:13:12 EDT 2023" } ------- # ceph orch ls NAME PORTS RUNNING REFRESHED AGE PLACEMENT alertmanager ?:9093,9094 1/1 7m ago 88m count:1 ceph-exporter 3/3 7m ago 88m * crash 3/3 7m ago 88m * grafana ?:3000 1/1 7m ago 88m count:1 mds.cephfs01 0/2 7m ago 23m count:2 mgr 2/2 7m ago 88m count:2 mon 3/5 7m ago 88m count:5 node-exporter ?:9100 3/3 7m ago 88m * osd.all-available-devices 18 7m ago 25m * prometheus ?:9095 1/1 7m ago 88m count:1 -------- # ceph orch ps | grep mds mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha ceph-mani-o7fdxp-node1-installer error 7m ago 23m - - <unknown> <unknown> <unknown> mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw ceph-mani-o7fdxp-node3 error 5m ago 23m - - <unknown> <unknown> <unknown> Version-Release number of selected component (if applicable): ======================= # ceph --version ceph version 18.2.0-98.el9cp (3f7cc32d87ca3dcf8fd0ace7c3d6c63dd734c195) reef (stable) How reproducible: ============== 2/2 Steps to Reproduce: ============= 1. Create ceph cluster # ceph -s cluster: id: 1c537092-72b2-11ee-899e-fa163eb88291 health: HEALTH_OK services: mon: 3 daemons, quorum ceph-mani-o7fdxp-node1-installer,ceph-mani-o7fdxp-node3,ceph-mani-o7fdxp-node2 (age 2s) mgr: ceph-mani-o7fdxp-node1-installer.zidoxy(active, since 64m), standbys: ceph-mani-o7fdxp-node3.epdzad osd: 18 osds: 18 up (since 2s), 18 in (since 46s) data: pools: 1 pools, 1 pgs objects: 2 objects, 449 KiB usage: 1.2 GiB used, 254 GiB / 255 GiB avail pgs: 1 active+clean io: recovery: 59 KiB/s, 0 objects/s 2. Create cephfs volume # ceph fs volume create cephfs01 # ceph fs volume ls [ { "name": "cephfs01" } ] Actual results: =========== MDS daemons crashed Expected results: =========== No crashes should be observed Additional info:
This issue is getting hit frequently in our automation runs. This crash occurs frequently in our NFS testing as well. We encounter this crash when creating a CEPHFS volume for NFS exports. It acts as a critical test blocker for NFS.
@pdonnell, @gfarnum Need help to get fix, workaround for https://bugzilla.redhat.com/show_bug.cgi?id=2246002. It is a test blocker for both NFS and CephFS testing.
Thanks to Amarnath, we have figured out how to install gdb and are working on this again.
This was caused by incorrect conflict resolution in https://bugzilla.redhat.com/show_bug.cgi?id=2238663, so it is resolved and that bug is being tracked on its own.