2269003 – [cee/sd][cephadm][RFE] CEPHADM_STRAY_DAEMON warning while replacing the osd

Bug 2269003 - [cee/sd][cephadm][RFE] CEPHADM_STRAY_DAEMON warning while replacing the osd

Summary: [cee/sd][cephadm][RFE] CEPHADM_STRAY_DAEMON warning while replacing the osd

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	6.1
Hardware:	x86_64
OS:	All
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	8.1
Assignee:	Nitzan mordechai
QA Contact:	DIVYA
Docs Contact:	Rivka Pollack
URL:
Whiteboard:
Depends On:
Blocks:	2355044 2351689 2355037
TreeView+	depends on / blocked

Reported:	2024-03-11 13:07 UTC by Geo Jose
Modified:	2025-06-26 12:12 UTC (History)
CC List:	15 users (show)
Fixed In Version:	ceph-19.2.1-92.el9cp
Doc Type:	Bug Fix
Doc Text:	.Destroyed OSDs are no longer listed by the `ceph node ls` command Previously, destroyed OSDs were listed without any indication of their status, leading to user confusion and causing cephadm to incorrectly report them as stray. With this fix, the command filters out destroyed OSDs by checking their status before displaying them, ensuring accurate and reliable output.
Clone Of:
Clones:	2355037 2355044 (view as bug list)
Environment:
Last Closed:	2025-06-26 12:12:12 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	70400	None	None	None	2025-03-11 15:32:45 UTC
Github	ceph ceph pull 62327	None	Merged	squid: OSDMonitor: exclude destroyed OSDs from "ceph node ls" output	2025-04-01 15:28:59 UTC
Red Hat Issue Tracker	RHCEPH-8487	None	None	None	2024-03-11 13:09:46 UTC
Red Hat Knowledge Base (Solution)	7080587	None	None	None	2024-07-26 04:17:45 UTC
Red Hat Product Errata	RHSA-2025:9775	None	None	None	2025-06-26 12:12:40 UTC

Description Geo Jose 2024-03-11 13:07:53 UTC

Description of problem:
======================
-  CEPHADM_STRAY_DAEMON warning while replacing the osd

Version-Release number of selected component (if applicable):
============================================================
- RHCS 6.1z2 / 17.2.6-148.el9cp

How reproducible:
================
- During the osd disk replacement("ceph orch osd rm ${OSD} --zap --replace") activity, getting the warning "HEALTH_WARN" due to "CEPHADM_STRAY_DAEMON". Due to this health warning, missing the other important warnings during these time. 

Steps to Reproduce:
==================
1. On RHCS 6 cluster, remove the osd with --replace option.
2. Check the ceph health status.


Actual results:
==============
- Seeing "HEALTH_WARN" due to "CEPHADM_STRAY_DAEMON".

Expected results:
================
- During the disk replacement/osd re-deployment, this warning should not come.

Comment 1 Geo Jose 2024-03-11 13:12:14 UTC

Additional Info
===============

### Workaround
- Use mute functionality to ignore CEPHADM_STRAY_DAEMON warning 

### Workaround from test lab:

1. Test cluster:

```
[ceph: root@rhcs6node1 /]# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME             STATUS  REWEIGHT  PRI-AFF
-1         0.07794  root default                                   
-9         0.01949      host rhcs6client                           
 3    hdd  0.01949          osd.3             up   1.00000  1.00000
-3         0.01949      host rhcs6node1                            
 0    hdd  0.01949          osd.0             up   1.00000  1.00000
-5         0.01949      host rhcs6node2                            
 1    hdd  0.01949          osd.1             up   1.00000  1.00000
-7         0.01949      host rhcs6node3                            
 2    hdd  0.01949          osd.2             up   1.00000  1.00000
[ceph: root@rhcs6node1 /]# 
```

2. To simulate a disk error, removed the disk from scsi layer:

```
[root@rhcs6node3 ~]# lvs -ao+devices | grep ceph
  osd-block-6e6869db-0bd8-4b5b-8409-21faf8b95900 ceph-8df43ddf-6a82-4fc2-a51f-59818837f2b6 -wi-ao---- <20.00g                                                     /dev/sda(0)   
[root@rhcs6node3 ~]# lsblk /dev/sda 
NAME                                                                                                  MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda                                                                                                     8:0    0  20G  0 disk 
└─ceph--8df43ddf--6a82--4fc2--a51f--59818837f2b6-osd--block--6e6869db--0bd8--4b5b--8409--21faf8b95900 253:2    0  20G  0 lvm  
[root@rhcs6node3 ~]# echo 1 > /sys/block/sda/device/delete 
[root@rhcs6node3 ~]# lsblk /dev/sda 
lsblk: /dev/sda: not a block device
[root@rhcs6node3 ~]# 


[root@rhcs6node1 ~]# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME             STATUS  REWEIGHT  PRI-AFF
-1         0.07794  root default                                   
-9         0.01949      host rhcs6client                           
 3    hdd  0.01949          osd.3             up   1.00000  1.00000
-3         0.01949      host rhcs6node1                            
 0    hdd  0.01949          osd.0             up   1.00000  1.00000
-5         0.01949      host rhcs6node2                            
 1    hdd  0.01949          osd.1             up   1.00000  1.00000
-7         0.01949      host rhcs6node3                            
 2    hdd  0.01949          osd.2           down   1.00000  1.00000
[root@rhcs6node1 ~]# 


[root@rhcs6node1 ~]# ceph -s
  cluster:
    id:     d6a48172-dc64-11ee-87e4-525400f15327
    health: HEALTH_WARN
            Failed to apply 1 service(s): osd.initial_osds
            1 failed cephadm daemon(s)
            1 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum rhcs6node1,rhcs6node3,rhcs6node2 (age 86m)
    mgr: rhcs6node3.aquyuo(active, since 12m), standbys: rhcs6node2.rgadnw, rhcs6node1.vojptl
    osd: 4 osds: 3 up (since 12m), 3 in (since 2m)
 
  data:
    pools:   1 pools, 1 pgs
    objects: 2 objects, 577 KiB
    usage:   65 MiB used, 60 GiB / 60 GiB avail
    pgs:     1 active+clean
 
[root@rhcs6node1 ~]# 
```


3. Removing/Replacing the disk:

```
[root@rhcs6node1 ~]# ceph orch osd rm 2 --zap --replace --force
Scheduled OSD(s) for removal.
[root@rhcs6node1 ~]# ceph orch osd rm status
OSD  HOST        STATE                    PGS  REPLACE  FORCE  ZAP   DRAIN STARTED AT  
2    rhcs6node3  done, waiting for purge    0  True     True   True                    
[root@rhcs6node1 ~]#

[root@rhcs6node1 ~]# ceph orch osd rm status
No OSD remove/replace operations reported
[root@rhcs6node1 ~]# 
[root@rhcs6node1 ~]# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME             STATUS     REWEIGHT  PRI-AFF
-1         0.07794  root default                                      
-9         0.01949      host rhcs6client                              
 3    hdd  0.01949          osd.3                up   1.00000  1.00000
-3         0.01949      host rhcs6node1                               
 0    hdd  0.01949          osd.0                up   1.00000  1.00000
-5         0.01949      host rhcs6node2                               
 1    hdd  0.01949          osd.1                up   1.00000  1.00000
-7         0.01949      host rhcs6node3                               
 2    hdd  0.01949          osd.2         destroyed         0  1.00000
[root@rhcs6node1 ~]# 
```

4. To remove the health warning, muted(for testing muted for 60min) and removed the crashes:

```
[root@rhcs6node1 ~]# ceph health
HEALTH_WARN Failed to apply 1 service(s): osd.initial_osds; 1 stray daemon(s) not managed by cephadm; 1 daemons have recently crashed
[root@rhcs6node1 ~]# 
[root@rhcs6node1 ~]# ceph health mute CEPHADM_STRAY_DAEMON 60m
[root@rhcs6node1 ~]# ceph health 
HEALTH_WARN Failed to apply 1 service(s): osd.initial_osds; 1 daemons have recently crashed; (muted: CEPHADM_STRAY_DAEMON(59m))
[root@rhcs6node1 ~]# 
[root@rhcs6node1 ~]# ceph crash ls-new
ID                                                                ENTITY  NEW  
2024-03-11T10:38:40.331866Z_77d8f5cc-dd10-4a4a-931b-2d9ace23b8fd  osd.2    *   
[root@rhcs6node1 ~]# ceph crash archive-all
[root@rhcs6node1 ~]# ceph health 
HEALTH_WARN Failed to apply 1 service(s): osd.initial_osds; (muted: CEPHADM_STRAY_DAEMON(57m))
[root@rhcs6node1 ~]# 
```

5. After the maintenance/disk replacement activity, unmute:

```
[root@rhcs6node1 ~]# ceph health unmute CEPHADM_STRAY_DAEMON
[root@rhcs6node1 ~]# ceph health detail
HEALTH_OK
[root@rhcs6node1 ~]# 
```

Comment 25 errata-xmlrpc 2025-06-26 12:12:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 8.1 security, bug fix, and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2025:9775

Note You need to log in before you can comment on or make changes to this bug.