2305678 – Manager daemon crash observed with mgr module: "cephadm" during OSD removal

Bug 2305678 - Manager daemon crash observed with mgr module: "cephadm" during OSD removal

Summary: Manager daemon crash observed with mgr module: "cephadm" during OSD removal

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Cephadm
Sub Component:
Version:	8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	8.0
Assignee:	Adam King
QA Contact:	Mohit Bisht
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2305677 2317218
TreeView+	depends on / blocked

Reported:	2024-08-19 10:02 UTC by Pawan
Modified:	2024-11-25 09:06 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ceph-19.1.0-39
Doc Type:	Bug Fix
Doc Text:	.The `original_weight` field is added as an attribute for the OSD removal queue Previously, cephadm osd removal queue did not have a parameter for original_weight. As a result, the cephadm module would crash during OSD removal. With this fix, the original_weight field is added as an attribute for the osd removal queue and the cephadm no longer crashes during OSD removal.
Clone Of:
Environment:
Last Closed:	2024-11-25 09:05:57 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-9459	0	None	None	None	2024-08-21 07:13:59 UTC
Red Hat Product Errata	RHBA-2024:10216	0	None	None	None	2024-11-25 09:06:07 UTC

Description Pawan 2024-08-19 10:02:28 UTC

Description of problem:
Observed that Manager daemon had crashed on the cluster, where cephadm module had crashed. The crash was observed when Host drain operations were ongoing.

# ceph crash ls
ID                                                                ENTITY                  NEW
2024-08-19T08:56:54.306261Z_6efc09d5-8852-4555-abee-1ff774ec3759  mgr.depressa002.vqyavc   *

# ceph crash info 2024-08-19T08:56:54.306261Z_6efc09d5-8852-4555-abee-1ff774ec3759
{
    "backtrace": [
        "  File \"/usr/share/ceph/mgr/cephadm/module.py\", line 662, in __init__\n    self.to_remove_osds.load_from_store()",
        "  File \"/usr/share/ceph/mgr/cephadm/services/osd.py\", line 924, in load_from_store\n    osd_obj = OSD.from_json(osd, rm_util=self.rm_util)",
        "  File \"/usr/share/ceph/mgr/cephadm/services/osd.py\", line 789, in from_json\n    return cls(**inp)",
        "TypeError: __init__() got an unexpected keyword argument 'original_weight'"
    ],
    "ceph_version": "19.1.0-22.el9cp",
    "crash_id": "2024-08-19T08:56:54.306261Z_6efc09d5-8852-4555-abee-1ff774ec3759",
    "entity_name": "mgr.depressa002.vqyavc",
    "mgr_module": "cephadm",
    "mgr_module_caller": "ActivePyModule::load",
    "mgr_python_exception": "TypeError",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.4 (Plow)",
    "os_version_id": "9.4",
    "process_name": "ceph-mgr",
    "stack_sig": "318554a153e6097db1902dc0c5178844b1f60826897cacac30764136243e78a4",
    "timestamp": "2024-08-19T08:56:54.306261Z",
    "utsname_hostname": "depressa002",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-427.31.1.el9_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Fri Aug 9 14:06:03 EDT 2024"
}

After the crash, none of the "orch" commands are working.
# ceph orch ls
Error ENOENT: Module not found

# ceph orch host ls
Error ENOENT: Module not found

I Tried stopping and restarting the cephadm module, but the issue was not resolved. After stopping and starting the module, Crash on the module was seen again.

# ceph mgr module ls | more
MODULE
balancer              on (always on)
crash                 on (always on)
devicehealth          on (always on)
orchestrator          on (always on)
pg_autoscaler         on (always on)
progress              on (always on)
rbd_support           on (always on)
status                on (always on)
telemetry             on (always on)
volumes               on (always on)
cephadm               on
dashboard             on
iostat                on
nfs                   on
prometheus            on
restful               on
alerts                -
diskprediction_local  -
influx                -
insights              -
k8sevents             -
localpool             -
mds_autoscaler        -
mirroring             -
osd_perf_query        -
osd_support           -
rgw                   -
rook                  -
selftest              -
smb                   -
snap_schedule         -
stats                 -
telegraf              -
test_orchestrator     -
zabbix                -

# ceph mgr module disable cephadm
[root@depressa002 0b712884-5bbc-11ef-bd39-ac1f6b5628fe]# ceph mgr module enable cephadm

# ceph orch ls
Error ENOTSUP: Module 'orchestrator' is not enabled/loaded (required by command 'orch ls'): use `ceph mgr module enable orchestrator` to enable it

# ceph mgr module enable orchestrator
module 'orchestrator' is already enabled (always-on)

# ceph orch ls
Error ENOENT: Module not found

[root@depressa002 0b712884-5bbc-11ef-bd39-ac1f6b5628fe]# ceph -s
  cluster:
    id:     0b712884-5bbc-11ef-bd39-ac1f6b5628fe
    health: HEALTH_WARN
            12 osds down
            1 host (12 osds) down
            Degraded data redundancy: 370323/1883851 objects degraded (19.658%), 132 pgs degraded, 735 pgs undersized
            1 mgr modules have recently crashed

  services:
    mon: 3 daemons, quorum depressa002,depressa003,depressa007 (age 52m)
    mgr: depressa002.vqyavc(active, since 23s), standbys: depressa005.diszxt, depressa003.bkuwbu, depressa007.iefbva
    mds: 1/1 daemons up, 2 standby
    osd: 59 osds: 45 up (since 65m), 57 in (since 72m); 708 remapped pgs
    rgw: 4 daemons active (4 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   8 pools, 2353 pgs
    objects: 134.76k objects, 4.1 GiB
    usage:   2.5 TiB used, 63 TiB / 66 TiB avail
    pgs:     370323/1883851 objects degraded (19.658%)
             25358/1883851 objects misplaced (1.346%)
             980 active+clean
             638 active+clean+remapped
             603 active+undersized
             132 active+undersized+degraded




Version-Release number of selected component (if applicable):
# ceph version
ceph version 19.1.0-22.el9cp (e5b7dfedb7d8a66d166eb0f98361f71bdb7905ad) squid (rc)

How reproducible:
1/1

Steps to Reproduce:
1. Deploy RHCS 8.0 cluster, with user created LVMs.

OSD spec used for deployment:
service_type: osd
service_id: osds
placement:
  label: osd
spec:
  data_devices:
    paths: ['/dev/data-vg/data-lv1', '/dev/data-vg/data-lv2', '/dev/data-vg/data-lv3', '/dev/data-vg/data-lv4', '/dev/data-vg/data-lv5', '/dev/data-vg/data-lv6', '/dev/data-vg/data-lv7', '/dev/data-vg/data-lv8', '/dev/data-vg/data-lv9', '/dev/data-vg/data-lv10', '/dev/data-vg/data-lv11', '/dev/data-vg/data-lv12']
  db_devices:
    paths: ['/dev/db-vg/db-lv1', '/dev/db-vg/db-lv2', '/dev/db-vg/db-lv3', '/dev/db-vg/db-lv4', '/dev/db-vg/db-lv5', '/dev/db-vg/db-lv6', '/dev/db-vg/db-lv7', '/dev/db-vg/db-lv8', '/dev/db-vg/db-lv9', '/dev/db-vg/db-lv10', '/dev/db-vg/db-lv11', '/dev/db-vg/db-lv12']

2. Bring down 2 OSDs from 1 host. Wait for the OSDs to be marked out, and PGs to be drained from the OSDs.

3. Start host removal, and to begin, start with host drain operation.
ceph orch host drain depressa005 --zap-osd-devices

4. Sometime during the drain, observe that the Mgr daemon has crashed, and ceph orch commands are not working.


Actual results:
cephadm ceph-mgr plugin crashed, and no ceph orch commands are working on the cluster

Expected results:
No crashes

Additional info:

Comment 14 errata-xmlrpc 2024-11-25 09:05:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 8.0 security, bug fix, and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:10216

Note You need to log in before you can comment on or make changes to this bug.