Description of problem: ====================== - CEPHADM_STRAY_DAEMON warning while replacing the osd Version-Release number of selected component (if applicable): ============================================================ - RHCS 6.1z2 / 17.2.6-148.el9cp How reproducible: ================ - During the osd disk replacement("ceph orch osd rm ${OSD} --zap --replace") activity, getting the warning "HEALTH_WARN" due to "CEPHADM_STRAY_DAEMON". Due to this health warning, missing the other important warnings during these time. Steps to Reproduce: ================== 1. On RHCS 6 cluster, remove the osd with --replace option. 2. Check the ceph health status. Actual results: ============== - Seeing "HEALTH_WARN" due to "CEPHADM_STRAY_DAEMON". Expected results: ================ - During the disk replacement/osd re-deployment, this warning should not come.
Additional Info =============== ### Workaround - Use mute functionality to ignore CEPHADM_STRAY_DAEMON warning ### Workaround from test lab: 1. Test cluster: ``` [ceph: root@rhcs6node1 /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.07794 root default -9 0.01949 host rhcs6client 3 hdd 0.01949 osd.3 up 1.00000 1.00000 -3 0.01949 host rhcs6node1 0 hdd 0.01949 osd.0 up 1.00000 1.00000 -5 0.01949 host rhcs6node2 1 hdd 0.01949 osd.1 up 1.00000 1.00000 -7 0.01949 host rhcs6node3 2 hdd 0.01949 osd.2 up 1.00000 1.00000 [ceph: root@rhcs6node1 /]# ``` 2. To simulate a disk error, removed the disk from scsi layer: ``` [root@rhcs6node3 ~]# lvs -ao+devices | grep ceph osd-block-6e6869db-0bd8-4b5b-8409-21faf8b95900 ceph-8df43ddf-6a82-4fc2-a51f-59818837f2b6 -wi-ao---- <20.00g /dev/sda(0) [root@rhcs6node3 ~]# lsblk /dev/sda NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 20G 0 disk └─ceph--8df43ddf--6a82--4fc2--a51f--59818837f2b6-osd--block--6e6869db--0bd8--4b5b--8409--21faf8b95900 253:2 0 20G 0 lvm [root@rhcs6node3 ~]# echo 1 > /sys/block/sda/device/delete [root@rhcs6node3 ~]# lsblk /dev/sda lsblk: /dev/sda: not a block device [root@rhcs6node3 ~]# [root@rhcs6node1 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.07794 root default -9 0.01949 host rhcs6client 3 hdd 0.01949 osd.3 up 1.00000 1.00000 -3 0.01949 host rhcs6node1 0 hdd 0.01949 osd.0 up 1.00000 1.00000 -5 0.01949 host rhcs6node2 1 hdd 0.01949 osd.1 up 1.00000 1.00000 -7 0.01949 host rhcs6node3 2 hdd 0.01949 osd.2 down 1.00000 1.00000 [root@rhcs6node1 ~]# [root@rhcs6node1 ~]# ceph -s cluster: id: d6a48172-dc64-11ee-87e4-525400f15327 health: HEALTH_WARN Failed to apply 1 service(s): osd.initial_osds 1 failed cephadm daemon(s) 1 daemons have recently crashed services: mon: 3 daemons, quorum rhcs6node1,rhcs6node3,rhcs6node2 (age 86m) mgr: rhcs6node3.aquyuo(active, since 12m), standbys: rhcs6node2.rgadnw, rhcs6node1.vojptl osd: 4 osds: 3 up (since 12m), 3 in (since 2m) data: pools: 1 pools, 1 pgs objects: 2 objects, 577 KiB usage: 65 MiB used, 60 GiB / 60 GiB avail pgs: 1 active+clean [root@rhcs6node1 ~]# ``` 3. Removing/Replacing the disk: ``` [root@rhcs6node1 ~]# ceph orch osd rm 2 --zap --replace --force Scheduled OSD(s) for removal. [root@rhcs6node1 ~]# ceph orch osd rm status OSD HOST STATE PGS REPLACE FORCE ZAP DRAIN STARTED AT 2 rhcs6node3 done, waiting for purge 0 True True True [root@rhcs6node1 ~]# [root@rhcs6node1 ~]# ceph orch osd rm status No OSD remove/replace operations reported [root@rhcs6node1 ~]# [root@rhcs6node1 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.07794 root default -9 0.01949 host rhcs6client 3 hdd 0.01949 osd.3 up 1.00000 1.00000 -3 0.01949 host rhcs6node1 0 hdd 0.01949 osd.0 up 1.00000 1.00000 -5 0.01949 host rhcs6node2 1 hdd 0.01949 osd.1 up 1.00000 1.00000 -7 0.01949 host rhcs6node3 2 hdd 0.01949 osd.2 destroyed 0 1.00000 [root@rhcs6node1 ~]# ``` 4. To remove the health warning, muted(for testing muted for 60min) and removed the crashes: ``` [root@rhcs6node1 ~]# ceph health HEALTH_WARN Failed to apply 1 service(s): osd.initial_osds; 1 stray daemon(s) not managed by cephadm; 1 daemons have recently crashed [root@rhcs6node1 ~]# [root@rhcs6node1 ~]# ceph health mute CEPHADM_STRAY_DAEMON 60m [root@rhcs6node1 ~]# ceph health HEALTH_WARN Failed to apply 1 service(s): osd.initial_osds; 1 daemons have recently crashed; (muted: CEPHADM_STRAY_DAEMON(59m)) [root@rhcs6node1 ~]# [root@rhcs6node1 ~]# ceph crash ls-new ID ENTITY NEW 2024-03-11T10:38:40.331866Z_77d8f5cc-dd10-4a4a-931b-2d9ace23b8fd osd.2 * [root@rhcs6node1 ~]# ceph crash archive-all [root@rhcs6node1 ~]# ceph health HEALTH_WARN Failed to apply 1 service(s): osd.initial_osds; (muted: CEPHADM_STRAY_DAEMON(57m)) [root@rhcs6node1 ~]# ``` 5. After the maintenance/disk replacement activity, unmute: ``` [root@rhcs6node1 ~]# ceph health unmute CEPHADM_STRAY_DAEMON [root@rhcs6node1 ~]# ceph health detail HEALTH_OK [root@rhcs6node1 ~]# ```
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat Ceph Storage 8.1 security, bug fix, and enhancement updates), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2025:9775