Bug 2269003
| Summary: | [cee/sd][cephadm][RFE] CEPHADM_STRAY_DAEMON warning while replacing the osd | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Geo Jose <gjose> | |
| Component: | RADOS | Assignee: | Nitzan mordechai <nmordech> | |
| Status: | CLOSED ERRATA | QA Contact: | DIVYA <dpentako> | |
| Severity: | medium | Docs Contact: | Rivka Pollack <rpollack> | |
| Priority: | unspecified | |||
| Version: | 6.1 | CC: | adking, bhkaur, bhubbard, bkunal, ceph-eng-bugs, cephqe-warriors, lsanders, ngangadh, nmordech, nojha, rpollack, rsachere, rzarzyns, tserlin, vumrao | |
| Target Milestone: | --- | Keywords: | FutureFeature, Reopened | |
| Target Release: | 8.1 | |||
| Hardware: | x86_64 | |||
| OS: | All | |||
| Whiteboard: | ||||
| Fixed In Version: | ceph-19.2.1-92.el9cp | Doc Type: | Bug Fix | |
| Doc Text: |
.Destroyed OSDs are no longer listed by the `ceph node ls` command
Previously, destroyed OSDs were listed without any indication of their status, leading to user confusion and causing cephadm to incorrectly report them as stray.
With this fix, the command filters out destroyed OSDs by checking their status before displaying them, ensuring accurate and reliable output.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 2355037 2355044 (view as bug list) | Environment: | ||
| Last Closed: | 2025-06-26 12:12:12 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2355044, 2351689, 2355037 | |||
Additional Info
===============
### Workaround
- Use mute functionality to ignore CEPHADM_STRAY_DAEMON warning
### Workaround from test lab:
1. Test cluster:
```
[ceph: root@rhcs6node1 /]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.07794 root default
-9 0.01949 host rhcs6client
3 hdd 0.01949 osd.3 up 1.00000 1.00000
-3 0.01949 host rhcs6node1
0 hdd 0.01949 osd.0 up 1.00000 1.00000
-5 0.01949 host rhcs6node2
1 hdd 0.01949 osd.1 up 1.00000 1.00000
-7 0.01949 host rhcs6node3
2 hdd 0.01949 osd.2 up 1.00000 1.00000
[ceph: root@rhcs6node1 /]#
```
2. To simulate a disk error, removed the disk from scsi layer:
```
[root@rhcs6node3 ~]# lvs -ao+devices | grep ceph
osd-block-6e6869db-0bd8-4b5b-8409-21faf8b95900 ceph-8df43ddf-6a82-4fc2-a51f-59818837f2b6 -wi-ao---- <20.00g /dev/sda(0)
[root@rhcs6node3 ~]# lsblk /dev/sda
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 20G 0 disk
└─ceph--8df43ddf--6a82--4fc2--a51f--59818837f2b6-osd--block--6e6869db--0bd8--4b5b--8409--21faf8b95900 253:2 0 20G 0 lvm
[root@rhcs6node3 ~]# echo 1 > /sys/block/sda/device/delete
[root@rhcs6node3 ~]# lsblk /dev/sda
lsblk: /dev/sda: not a block device
[root@rhcs6node3 ~]#
[root@rhcs6node1 ~]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.07794 root default
-9 0.01949 host rhcs6client
3 hdd 0.01949 osd.3 up 1.00000 1.00000
-3 0.01949 host rhcs6node1
0 hdd 0.01949 osd.0 up 1.00000 1.00000
-5 0.01949 host rhcs6node2
1 hdd 0.01949 osd.1 up 1.00000 1.00000
-7 0.01949 host rhcs6node3
2 hdd 0.01949 osd.2 down 1.00000 1.00000
[root@rhcs6node1 ~]#
[root@rhcs6node1 ~]# ceph -s
cluster:
id: d6a48172-dc64-11ee-87e4-525400f15327
health: HEALTH_WARN
Failed to apply 1 service(s): osd.initial_osds
1 failed cephadm daemon(s)
1 daemons have recently crashed
services:
mon: 3 daemons, quorum rhcs6node1,rhcs6node3,rhcs6node2 (age 86m)
mgr: rhcs6node3.aquyuo(active, since 12m), standbys: rhcs6node2.rgadnw, rhcs6node1.vojptl
osd: 4 osds: 3 up (since 12m), 3 in (since 2m)
data:
pools: 1 pools, 1 pgs
objects: 2 objects, 577 KiB
usage: 65 MiB used, 60 GiB / 60 GiB avail
pgs: 1 active+clean
[root@rhcs6node1 ~]#
```
3. Removing/Replacing the disk:
```
[root@rhcs6node1 ~]# ceph orch osd rm 2 --zap --replace --force
Scheduled OSD(s) for removal.
[root@rhcs6node1 ~]# ceph orch osd rm status
OSD HOST STATE PGS REPLACE FORCE ZAP DRAIN STARTED AT
2 rhcs6node3 done, waiting for purge 0 True True True
[root@rhcs6node1 ~]#
[root@rhcs6node1 ~]# ceph orch osd rm status
No OSD remove/replace operations reported
[root@rhcs6node1 ~]#
[root@rhcs6node1 ~]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.07794 root default
-9 0.01949 host rhcs6client
3 hdd 0.01949 osd.3 up 1.00000 1.00000
-3 0.01949 host rhcs6node1
0 hdd 0.01949 osd.0 up 1.00000 1.00000
-5 0.01949 host rhcs6node2
1 hdd 0.01949 osd.1 up 1.00000 1.00000
-7 0.01949 host rhcs6node3
2 hdd 0.01949 osd.2 destroyed 0 1.00000
[root@rhcs6node1 ~]#
```
4. To remove the health warning, muted(for testing muted for 60min) and removed the crashes:
```
[root@rhcs6node1 ~]# ceph health
HEALTH_WARN Failed to apply 1 service(s): osd.initial_osds; 1 stray daemon(s) not managed by cephadm; 1 daemons have recently crashed
[root@rhcs6node1 ~]#
[root@rhcs6node1 ~]# ceph health mute CEPHADM_STRAY_DAEMON 60m
[root@rhcs6node1 ~]# ceph health
HEALTH_WARN Failed to apply 1 service(s): osd.initial_osds; 1 daemons have recently crashed; (muted: CEPHADM_STRAY_DAEMON(59m))
[root@rhcs6node1 ~]#
[root@rhcs6node1 ~]# ceph crash ls-new
ID ENTITY NEW
2024-03-11T10:38:40.331866Z_77d8f5cc-dd10-4a4a-931b-2d9ace23b8fd osd.2 *
[root@rhcs6node1 ~]# ceph crash archive-all
[root@rhcs6node1 ~]# ceph health
HEALTH_WARN Failed to apply 1 service(s): osd.initial_osds; (muted: CEPHADM_STRAY_DAEMON(57m))
[root@rhcs6node1 ~]#
```
5. After the maintenance/disk replacement activity, unmute:
```
[root@rhcs6node1 ~]# ceph health unmute CEPHADM_STRAY_DAEMON
[root@rhcs6node1 ~]# ceph health detail
HEALTH_OK
[root@rhcs6node1 ~]#
```
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat Ceph Storage 8.1 security, bug fix, and enhancement updates), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2025:9775 |
Description of problem: ====================== - CEPHADM_STRAY_DAEMON warning while replacing the osd Version-Release number of selected component (if applicable): ============================================================ - RHCS 6.1z2 / 17.2.6-148.el9cp How reproducible: ================ - During the osd disk replacement("ceph orch osd rm ${OSD} --zap --replace") activity, getting the warning "HEALTH_WARN" due to "CEPHADM_STRAY_DAEMON". Due to this health warning, missing the other important warnings during these time. Steps to Reproduce: ================== 1. On RHCS 6 cluster, remove the osd with --replace option. 2. Check the ceph health status. Actual results: ============== - Seeing "HEALTH_WARN" due to "CEPHADM_STRAY_DAEMON". Expected results: ================ - During the disk replacement/osd re-deployment, this warning should not come.