Bug 2269003

Summary: [cee/sd][cephadm][RFE] CEPHADM_STRAY_DAEMON warning while replacing the osd
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Geo Jose <gjose>
Component: RADOSAssignee: Nitzan mordechai <nmordech>
Status: CLOSED ERRATA QA Contact: DIVYA <dpentako>
Severity: medium Docs Contact: Rivka Pollack <rpollack>
Priority: unspecified    
Version: 6.1CC: adking, bhkaur, bhubbard, bkunal, ceph-eng-bugs, cephqe-warriors, lsanders, ngangadh, nmordech, nojha, rpollack, rsachere, rzarzyns, tserlin, vumrao
Target Milestone: ---Keywords: FutureFeature, Reopened
Target Release: 8.1   
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: ceph-19.2.1-92.el9cp Doc Type: Bug Fix
Doc Text:
.Destroyed OSDs are no longer listed by the `ceph node ls` command Previously, destroyed OSDs were listed without any indication of their status, leading to user confusion and causing cephadm to incorrectly report them as stray. With this fix, the command filters out destroyed OSDs by checking their status before displaying them, ensuring accurate and reliable output.
Story Points: ---
Clone Of:
: 2355037 2355044 (view as bug list) Environment:
Last Closed: 2025-06-26 12:12:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2355044, 2351689, 2355037    

Description Geo Jose 2024-03-11 13:07:53 UTC
Description of problem:
======================
-  CEPHADM_STRAY_DAEMON warning while replacing the osd

Version-Release number of selected component (if applicable):
============================================================
- RHCS 6.1z2 / 17.2.6-148.el9cp

How reproducible:
================
- During the osd disk replacement("ceph orch osd rm ${OSD} --zap --replace") activity, getting the warning "HEALTH_WARN" due to "CEPHADM_STRAY_DAEMON". Due to this health warning, missing the other important warnings during these time. 

Steps to Reproduce:
==================
1. On RHCS 6 cluster, remove the osd with --replace option.
2. Check the ceph health status.


Actual results:
==============
- Seeing "HEALTH_WARN" due to "CEPHADM_STRAY_DAEMON".

Expected results:
================
- During the disk replacement/osd re-deployment, this warning should not come.

Comment 1 Geo Jose 2024-03-11 13:12:14 UTC
Additional Info
===============

### Workaround
- Use mute functionality to ignore CEPHADM_STRAY_DAEMON warning 

### Workaround from test lab:

1. Test cluster:

```
[ceph: root@rhcs6node1 /]# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME             STATUS  REWEIGHT  PRI-AFF
-1         0.07794  root default                                   
-9         0.01949      host rhcs6client                           
 3    hdd  0.01949          osd.3             up   1.00000  1.00000
-3         0.01949      host rhcs6node1                            
 0    hdd  0.01949          osd.0             up   1.00000  1.00000
-5         0.01949      host rhcs6node2                            
 1    hdd  0.01949          osd.1             up   1.00000  1.00000
-7         0.01949      host rhcs6node3                            
 2    hdd  0.01949          osd.2             up   1.00000  1.00000
[ceph: root@rhcs6node1 /]# 
```

2. To simulate a disk error, removed the disk from scsi layer:

```
[root@rhcs6node3 ~]# lvs -ao+devices | grep ceph
  osd-block-6e6869db-0bd8-4b5b-8409-21faf8b95900 ceph-8df43ddf-6a82-4fc2-a51f-59818837f2b6 -wi-ao---- <20.00g                                                     /dev/sda(0)   
[root@rhcs6node3 ~]# lsblk /dev/sda 
NAME                                                                                                  MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda                                                                                                     8:0    0  20G  0 disk 
└─ceph--8df43ddf--6a82--4fc2--a51f--59818837f2b6-osd--block--6e6869db--0bd8--4b5b--8409--21faf8b95900 253:2    0  20G  0 lvm  
[root@rhcs6node3 ~]# echo 1 > /sys/block/sda/device/delete 
[root@rhcs6node3 ~]# lsblk /dev/sda 
lsblk: /dev/sda: not a block device
[root@rhcs6node3 ~]# 


[root@rhcs6node1 ~]# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME             STATUS  REWEIGHT  PRI-AFF
-1         0.07794  root default                                   
-9         0.01949      host rhcs6client                           
 3    hdd  0.01949          osd.3             up   1.00000  1.00000
-3         0.01949      host rhcs6node1                            
 0    hdd  0.01949          osd.0             up   1.00000  1.00000
-5         0.01949      host rhcs6node2                            
 1    hdd  0.01949          osd.1             up   1.00000  1.00000
-7         0.01949      host rhcs6node3                            
 2    hdd  0.01949          osd.2           down   1.00000  1.00000
[root@rhcs6node1 ~]# 


[root@rhcs6node1 ~]# ceph -s
  cluster:
    id:     d6a48172-dc64-11ee-87e4-525400f15327
    health: HEALTH_WARN
            Failed to apply 1 service(s): osd.initial_osds
            1 failed cephadm daemon(s)
            1 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum rhcs6node1,rhcs6node3,rhcs6node2 (age 86m)
    mgr: rhcs6node3.aquyuo(active, since 12m), standbys: rhcs6node2.rgadnw, rhcs6node1.vojptl
    osd: 4 osds: 3 up (since 12m), 3 in (since 2m)
 
  data:
    pools:   1 pools, 1 pgs
    objects: 2 objects, 577 KiB
    usage:   65 MiB used, 60 GiB / 60 GiB avail
    pgs:     1 active+clean
 
[root@rhcs6node1 ~]# 
```


3. Removing/Replacing the disk:

```
[root@rhcs6node1 ~]# ceph orch osd rm 2 --zap --replace --force
Scheduled OSD(s) for removal.
[root@rhcs6node1 ~]# ceph orch osd rm status
OSD  HOST        STATE                    PGS  REPLACE  FORCE  ZAP   DRAIN STARTED AT  
2    rhcs6node3  done, waiting for purge    0  True     True   True                    
[root@rhcs6node1 ~]#

[root@rhcs6node1 ~]# ceph orch osd rm status
No OSD remove/replace operations reported
[root@rhcs6node1 ~]# 
[root@rhcs6node1 ~]# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME             STATUS     REWEIGHT  PRI-AFF
-1         0.07794  root default                                      
-9         0.01949      host rhcs6client                              
 3    hdd  0.01949          osd.3                up   1.00000  1.00000
-3         0.01949      host rhcs6node1                               
 0    hdd  0.01949          osd.0                up   1.00000  1.00000
-5         0.01949      host rhcs6node2                               
 1    hdd  0.01949          osd.1                up   1.00000  1.00000
-7         0.01949      host rhcs6node3                               
 2    hdd  0.01949          osd.2         destroyed         0  1.00000
[root@rhcs6node1 ~]# 
```

4. To remove the health warning, muted(for testing muted for 60min) and removed the crashes:

```
[root@rhcs6node1 ~]# ceph health
HEALTH_WARN Failed to apply 1 service(s): osd.initial_osds; 1 stray daemon(s) not managed by cephadm; 1 daemons have recently crashed
[root@rhcs6node1 ~]# 
[root@rhcs6node1 ~]# ceph health mute CEPHADM_STRAY_DAEMON 60m
[root@rhcs6node1 ~]# ceph health 
HEALTH_WARN Failed to apply 1 service(s): osd.initial_osds; 1 daemons have recently crashed; (muted: CEPHADM_STRAY_DAEMON(59m))
[root@rhcs6node1 ~]# 
[root@rhcs6node1 ~]# ceph crash ls-new
ID                                                                ENTITY  NEW  
2024-03-11T10:38:40.331866Z_77d8f5cc-dd10-4a4a-931b-2d9ace23b8fd  osd.2    *   
[root@rhcs6node1 ~]# ceph crash archive-all
[root@rhcs6node1 ~]# ceph health 
HEALTH_WARN Failed to apply 1 service(s): osd.initial_osds; (muted: CEPHADM_STRAY_DAEMON(57m))
[root@rhcs6node1 ~]# 
```

5. After the maintenance/disk replacement activity, unmute:

```
[root@rhcs6node1 ~]# ceph health unmute CEPHADM_STRAY_DAEMON
[root@rhcs6node1 ~]# ceph health detail
HEALTH_OK
[root@rhcs6node1 ~]# 
```

Comment 25 errata-xmlrpc 2025-06-26 12:12:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 8.1 security, bug fix, and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2025:9775