Description of problem: Observed that Manager daemon had crashed on the cluster, where cephadm module had crashed. The crash was observed when Host drain operations were ongoing. # ceph crash ls ID ENTITY NEW 2024-08-19T08:56:54.306261Z_6efc09d5-8852-4555-abee-1ff774ec3759 mgr.depressa002.vqyavc * # ceph crash info 2024-08-19T08:56:54.306261Z_6efc09d5-8852-4555-abee-1ff774ec3759 { "backtrace": [ " File \"/usr/share/ceph/mgr/cephadm/module.py\", line 662, in __init__\n self.to_remove_osds.load_from_store()", " File \"/usr/share/ceph/mgr/cephadm/services/osd.py\", line 924, in load_from_store\n osd_obj = OSD.from_json(osd, rm_util=self.rm_util)", " File \"/usr/share/ceph/mgr/cephadm/services/osd.py\", line 789, in from_json\n return cls(**inp)", "TypeError: __init__() got an unexpected keyword argument 'original_weight'" ], "ceph_version": "19.1.0-22.el9cp", "crash_id": "2024-08-19T08:56:54.306261Z_6efc09d5-8852-4555-abee-1ff774ec3759", "entity_name": "mgr.depressa002.vqyavc", "mgr_module": "cephadm", "mgr_module_caller": "ActivePyModule::load", "mgr_python_exception": "TypeError", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.4 (Plow)", "os_version_id": "9.4", "process_name": "ceph-mgr", "stack_sig": "318554a153e6097db1902dc0c5178844b1f60826897cacac30764136243e78a4", "timestamp": "2024-08-19T08:56:54.306261Z", "utsname_hostname": "depressa002", "utsname_machine": "x86_64", "utsname_release": "5.14.0-427.31.1.el9_4.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Fri Aug 9 14:06:03 EDT 2024" } After the crash, none of the "orch" commands are working. # ceph orch ls Error ENOENT: Module not found # ceph orch host ls Error ENOENT: Module not found I Tried stopping and restarting the cephadm module, but the issue was not resolved. After stopping and starting the module, Crash on the module was seen again. # ceph mgr module ls | more MODULE balancer on (always on) crash on (always on) devicehealth on (always on) orchestrator on (always on) pg_autoscaler on (always on) progress on (always on) rbd_support on (always on) status on (always on) telemetry on (always on) volumes on (always on) cephadm on dashboard on iostat on nfs on prometheus on restful on alerts - diskprediction_local - influx - insights - k8sevents - localpool - mds_autoscaler - mirroring - osd_perf_query - osd_support - rgw - rook - selftest - smb - snap_schedule - stats - telegraf - test_orchestrator - zabbix - # ceph mgr module disable cephadm [root@depressa002 0b712884-5bbc-11ef-bd39-ac1f6b5628fe]# ceph mgr module enable cephadm # ceph orch ls Error ENOTSUP: Module 'orchestrator' is not enabled/loaded (required by command 'orch ls'): use `ceph mgr module enable orchestrator` to enable it # ceph mgr module enable orchestrator module 'orchestrator' is already enabled (always-on) # ceph orch ls Error ENOENT: Module not found [root@depressa002 0b712884-5bbc-11ef-bd39-ac1f6b5628fe]# ceph -s cluster: id: 0b712884-5bbc-11ef-bd39-ac1f6b5628fe health: HEALTH_WARN 12 osds down 1 host (12 osds) down Degraded data redundancy: 370323/1883851 objects degraded (19.658%), 132 pgs degraded, 735 pgs undersized 1 mgr modules have recently crashed services: mon: 3 daemons, quorum depressa002,depressa003,depressa007 (age 52m) mgr: depressa002.vqyavc(active, since 23s), standbys: depressa005.diszxt, depressa003.bkuwbu, depressa007.iefbva mds: 1/1 daemons up, 2 standby osd: 59 osds: 45 up (since 65m), 57 in (since 72m); 708 remapped pgs rgw: 4 daemons active (4 hosts, 1 zones) data: volumes: 1/1 healthy pools: 8 pools, 2353 pgs objects: 134.76k objects, 4.1 GiB usage: 2.5 TiB used, 63 TiB / 66 TiB avail pgs: 370323/1883851 objects degraded (19.658%) 25358/1883851 objects misplaced (1.346%) 980 active+clean 638 active+clean+remapped 603 active+undersized 132 active+undersized+degraded Version-Release number of selected component (if applicable): # ceph version ceph version 19.1.0-22.el9cp (e5b7dfedb7d8a66d166eb0f98361f71bdb7905ad) squid (rc) How reproducible: 1/1 Steps to Reproduce: 1. Deploy RHCS 8.0 cluster, with user created LVMs. OSD spec used for deployment: service_type: osd service_id: osds placement: label: osd spec: data_devices: paths: ['/dev/data-vg/data-lv1', '/dev/data-vg/data-lv2', '/dev/data-vg/data-lv3', '/dev/data-vg/data-lv4', '/dev/data-vg/data-lv5', '/dev/data-vg/data-lv6', '/dev/data-vg/data-lv7', '/dev/data-vg/data-lv8', '/dev/data-vg/data-lv9', '/dev/data-vg/data-lv10', '/dev/data-vg/data-lv11', '/dev/data-vg/data-lv12'] db_devices: paths: ['/dev/db-vg/db-lv1', '/dev/db-vg/db-lv2', '/dev/db-vg/db-lv3', '/dev/db-vg/db-lv4', '/dev/db-vg/db-lv5', '/dev/db-vg/db-lv6', '/dev/db-vg/db-lv7', '/dev/db-vg/db-lv8', '/dev/db-vg/db-lv9', '/dev/db-vg/db-lv10', '/dev/db-vg/db-lv11', '/dev/db-vg/db-lv12'] 2. Bring down 2 OSDs from 1 host. Wait for the OSDs to be marked out, and PGs to be drained from the OSDs. 3. Start host removal, and to begin, start with host drain operation. ceph orch host drain depressa005 --zap-osd-devices 4. Sometime during the drain, observe that the Mgr daemon has crashed, and ceph orch commands are not working. Actual results: cephadm ceph-mgr plugin crashed, and no ceph orch commands are working on the cluster Expected results: No crashes Additional info:
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 8.0 security, bug fix, and enhancement updates), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2024:10216