Bug 2059027

Summary: Device Replacement with FORCE_OSD_REMOVAL, OSD moved to "destroyed" state.
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Oded <oviner>
Component: rookAssignee: Travis Nielsen <tnielsen>
Status: CLOSED CURRENTRELEASE QA Contact: Oded <oviner>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.10CC: hnallurv, madam, mmuench, muagarwa, nberry, ocs-bugs, odf-bz-bot
Target Milestone: ---   
Target Release: ODF 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.10.0-175 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-04-21 09:12:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Oded 2022-02-27 23:46:44 UTC
Description of problem (please be detailed as possible and provide log
snippests):
After OSD removal job failed, I added parameter FORCE_OSD_REMOVAL.
The OSD removal job was completed but the OSD moved to "destroyed" state.

Version of all relevant components (if applicable):
OCP Version:4.10.0-0.nightly-2022-02-22-093600
ODF Version: 4.10.0-166
LSO Version: local-storage-operator.4.11.0-202202221716
Provider: Vmware

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Remove the old OSD from the cluster [without FORCE_OSD_REMOVAL] -> ocs-osd-removal-job pod stuck on Running state
oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} |oc create -n openshift-storage -f 
2022-02-23 09:10:05.259280 I | cephosd: osd.2 is marked 'DOWN'
2022-02-23 09:10:05.259296 D | exec: Running command: ceph osd safe-to-destroy 2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-02-23 09:10:05.649466 W | cephosd: osd.2 is NOT be ok to destroy, retrying in 1m until success
2022-02-23 09:11:05.650412 D | exec: Running command: ceph osd safe-to-destroy 2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json


2.Delete the ocs-osd-removal-job
$ oc delete job ocs-osd-removal-job
job.batch "ocs-osd-removal-job" deleted

 
3.Remove the old OSD from the cluster [with FORCE_OSD_REMOVAL] -> ocs-osd-removal-job moved to completed state.
$ oc process -n openshift-storage ocs-osd-removal -p FORCE_OSD_REMOVAL=true -p FAILED_OSD_IDS=${osd_id_to_remove} |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created


$ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NAME                        READY   STATUS      RESTARTS   AGE
ocs-osd-removal-job-xvsj2   0/1     Completed   0          6m57s

$  oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
2022-02-23 22:49:18.141374 I | rookcmd: starting Rook v4.10.0-0.e43e46bc94063280e8d782b01674a68cacc4e8bc with arguments '/usr/local/bin/rook ceph osd remove --osd-ids=2 --force-osd-removal true'
2022-02-23 22:49:18.141476 I | rookcmd: flag values: --force-osd-removal=true, --help=false, --log-level=DEBUG, --operator-image=, --osd-ids=2, --preserve-pvc=false, --service-account=
2022-02-23 22:49:18.141481 I | op-mon: parsing mon endpoints: b=172.30.150.138:6789,c=172.30.42.121:6789,a=172.30.150.9:6789
2022-02-23 22:49:19.222219 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config
2022-02-23 22:49:19.222386 I | cephclient: generated admin config in /var/lib/rook/openshift-storage
2022-02-23 22:49:19.222470 D | cephclient: config file @ /etc/ceph/ceph.conf:
[global]
fsid                        = 0149cef0-9902-4336-b08a-c825e0b56687
mon initial members         = b c a
mon host                    = [v2:172.30.150.138:3300,v1:172.30.150.138:6789],[v2:172.30.42.121:3300,v1:172.30.42.121:6789],[v2:172.30.150.9:3300,v1:172.30.150.9:6789]
bdev_flock_retry            = 20
mon_osd_full_ratio          = .85
mon_osd_backfillfull_ratio  = .8
mon_osd_nearfull_ratio      = .75
mon_max_pg_per_osd          = 600
mon_pg_warn_max_object_skew = 0
mon_data_avail_warn         = 15

[osd]
osd_memory_target_cgroup_limit_ratio = 0.8

[client.admin]
keyring = /var/lib/rook/openshift-storage/client.admin.keyring

2022-02-23 22:49:19.222489 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-02-23 22:49:19.580090 I | cephosd: validating status of osd.2
2022-02-23 22:49:19.580116 I | cephosd: osd.2 is marked 'DOWN'
2022-02-23 22:49:19.580131 D | exec: Running command: ceph osd safe-to-destroy 2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-02-23 22:49:19.928676 I | cephosd: osd.2 is NOT be ok to destroy but force removal is enabled so proceeding with removal
2022-02-23 22:49:19.928727 D | exec: Running command: ceph osd find 2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-02-23 22:49:20.256448 I | cephosd: marking osd.2 out
2022-02-23 22:49:20.256508 D | exec: Running command: ceph osd out osd.2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-02-23 22:49:20.756771 I | cephosd: removing the OSD deployment "rook-ceph-osd-2"
2022-02-23 22:49:20.756805 D | op-k8sutil: removing rook-ceph-osd-2 deployment if it exists
2022-02-23 22:49:20.756809 I | op-k8sutil: removing deployment rook-ceph-osd-2 if it exists
2022-02-23 22:49:20.769692 I | op-k8sutil: Removed deployment rook-ceph-osd-2
2022-02-23 22:49:20.779769 I | op-k8sutil: "rook-ceph-osd-2" still found. waiting...
2022-02-23 22:49:22.788297 I | op-k8sutil: confirmed rook-ceph-osd-2 does not exist
2022-02-23 22:49:22.798924 I | cephosd: removing the osd prepare job "rook-ceph-osd-prepare-b59c39dd57cd891848ca9de1e242595b"
2022-02-23 22:49:22.809935 I | cephosd: removing the OSD PVC "ocs-deviceset-localblock-0-data-2qqh27"
2022-02-23 22:49:22.826913 I | cephosd: destroying osd.2
2022-02-23 22:49:22.826951 D | exec: Running command: ceph osd destroy osd.2 --yes-i-really-mean-it --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-02-23 22:49:23.200485 I | cephosd: removing osd.2 from ceph
2022-02-23 22:49:23.200518 D | exec: Running command: ceph osd crush rm compute-0 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-02-23 22:49:23.549683 E | cephosd: failed to remove CRUSH host "compute-0". exit status 39
2022-02-23 22:49:23.549715 D | exec: Running command: ceph crash ls --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-02-23 22:49:23.893552 I | cephosd: no ceph crash to silence
2022-02-23 22:49:23.893583 I | cephosd: completed removal of OSD 2


4.Check Ceph status:
sh-4.4$ ceph osd tree 
ID  CLASS  WEIGHT   TYPE NAME           STATUS     REWEIGHT  PRI-AFF
-1         0.78149  root default                                    
-7         0.39075      host compute-0                              
 2    hdd  0.09769          osd.2       destroyed         0  1.00000
 3    hdd  0.09769          osd.3       destroyed         0  1.00000
 4    hdd  0.09769          osd.4              up   1.00000  1.00000
 5    hdd  0.09769          osd.5              up   1.00000  1.00000
-5         0.19537      host compute-1                              
 0    hdd  0.09769          osd.0              up   1.00000  1.00000
 6    hdd  0.09769          osd.6              up   1.00000  1.00000
-3         0.19537      host compute-2                              
 1    hdd  0.09769          osd.1              up   1.00000  1.00000
 7    hdd  0.09769          osd.7              up   1.00000  1.00000
 

sh-4.4$ ceph status
  cluster:
    id:     0149cef0-9902-4336-b08a-c825e0b56687
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 3d)
    mgr: a(active, since 3d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 8 osds: 6 up (since 3d), 6 in (since 3d)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   11 pools, 273 pgs
    objects: 27.22k objects, 104 GiB
    usage:   317 GiB used, 283 GiB / 600 GiB avail
    pgs:     273 active+clean
 
  io:
    client:   1.2 KiB/s rd, 9.7 KiB/s wr, 2 op/s rd, 1 op/s wr
 
sh-4.4$ ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS   
 2    hdd  0.09769         0      0 B      0 B      0 B      0 B      0 B      0 B      0     0    0  destroyed
 3    hdd  0.09769         0      0 B      0 B      0 B      0 B      0 B      0 B      0     0    0  destroyed
 4    hdd  0.09769   1.00000  100 GiB   50 GiB   50 GiB   32 KiB  797 MiB   50 GiB  50.31  0.95  124         up
 5    hdd  0.09769   1.00000  100 GiB   55 GiB   55 GiB   92 KiB  838 MiB   45 GiB  55.41  1.05  149         up
 0    hdd  0.09769   1.00000  100 GiB   50 GiB   50 GiB   78 KiB  994 MiB   50 GiB  50.50  0.96  135         up
 6    hdd  0.09769   1.00000  100 GiB   55 GiB   55 GiB  119 KiB  665 MiB   45 GiB  55.25  1.05  138         up
 1    hdd  0.09769   1.00000  100 GiB   51 GiB   50 GiB  130 KiB  728 MiB   49 GiB  50.87  0.96  138         up
 7    hdd  0.09769   1.00000  100 GiB   55 GiB   54 GiB    7 KiB  949 MiB   45 GiB  54.89  1.04  135         up
                       TOTAL  600 GiB  317 GiB  312 GiB  460 KiB  4.9 GiB  283 GiB  52.87                      
MIN/MAX VAR: 0.95/1.05  STDDEV: 2.32

Actual results:
OLD OSD moved to "destroyed" state.

Expected results:
OLD OSD removed

Additional info:
MG:
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2059027.tar.gz

Comment 4 Oded 2022-03-06 18:35:01 UTC
Device Replacement with FORCE_OSD_REMOVAL, old OSD deleted 

SetUp:
Provider: Vmware
OCP Versoin: 4.10.0-0.nightly-2022-03-05-023708
ODF Version: 4.10.0-177

Test Process:
1.Identify the OSD that needs to be replaced and the OpenShift Container Platform node that has the OSD scheduled on it.
$ oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide
NAME                               READY   STATUS    RESTARTS   AGE     IP            NODE        NOMINATED NODE   READINESS GATES
rook-ceph-osd-0-6d87d76d7-p575f    2/2     Running   0          4h52m   10.129.2.26   compute-2   <none>           <none>
rook-ceph-osd-1-79666595d9-srvp7   2/2     Running   0          4h52m   10.128.2.23   compute-0   <none>           <none>
rook-ceph-osd-2-65c497cfc8-4sxnr   2/2     Running   0          4h52m   10.131.0.25   compute-1   <none>           <none>

$ oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide
NAME                               READY   STATUS             RESTARTS     AGE     IP            NODE        NOMINATED NODE   READINESS GATES
rook-ceph-osd-0-6d87d76d7-p575f    2/2     Running            0            4h54m   10.129.2.26   compute-2   <none>           <none>
rook-ceph-osd-1-79666595d9-srvp7   1/2     CrashLoopBackOff   1 (9s ago)   4h54m   10.128.2.23   compute-0   <none>           <none>
rook-ceph-osd-2-65c497cfc8-4sxnr   2/2     Running            0            4h54m   10.131.0.25   compute-1   <none>           <none>

2.Scale down the OSD deployment for the OSD to be replaced.
$ osd_id_to_remove=1
$ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
deployment.apps/rook-ceph-osd-1 scaled

3.Verify that the rook-ceph-osd pod is terminated.
$ oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
NAME                               READY   STATUS        RESTARTS   AGE
rook-ceph-osd-1-79666595d9-srvp7   0/2     Terminating   3          4h55m
$ oc delete pod rook-ceph-osd-1-79666595d9-srvp7 --force --grace-period=0
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "rook-ceph-osd-1-79666595d9-srvp7" force deleted
$ oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
No resources found in openshift-storage namespace.

4.Remove the old OSD from the cluster so that you can add a new OSD.
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created

5.Check the status of the ocs-osd-removal pod.
$ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NAME                        READY   STATUS    RESTARTS   AGE
ocs-osd-removal-job-8hdzv   1/1     Running   0          88s

$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
2022-03-06 18:18:43.824586 I | rookcmd: starting Rook v4.10.0-0.4a36b5f4bbabe54c9dd2671886325a5771191b30 with arguments '/usr/local/bin/rook ceph osd remove --osd-ids=1 --force-osd-removal false'
2022-03-06 18:18:43.824742 I | rookcmd: flag values: --force-osd-removal=false, --help=false, --log-level=DEBUG, --operator-image=, --osd-ids=1, --preserve-pvc=false, --service-account=
2022-03-06 18:18:43.824747 I | op-mon: parsing mon endpoints: b=172.30.6.164:6789,c=172.30.75.171:6789,a=172.30.82.155:6789
2022-03-06 18:18:44.881080 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config
2022-03-06 18:18:44.881281 I | cephclient: generated admin config in /var/lib/rook/openshift-storage
2022-03-06 18:18:44.881395 D | cephclient: config file @ /etc/ceph/ceph.conf:
[global]
fsid                        = a983e53d-33d9-4c09-8d71-842aa48e2219
mon initial members         = a b c
mon host                    = [v2:172.30.82.155:3300,v1:172.30.82.155:6789],[v2:172.30.6.164:3300,v1:172.30.6.164:6789],[v2:172.30.75.171:3300,v1:172.30.75.171:6789]
bdev_flock_retry            = 20
mon_osd_full_ratio          = .85
mon_osd_backfillfull_ratio  = .8
mon_osd_nearfull_ratio      = .75
mon_max_pg_per_osd          = 600
mon_pg_warn_max_object_skew = 0
mon_data_avail_warn         = 15

[osd]
osd_memory_target_cgroup_limit_ratio = 0.8

[client.admin]
keyring = /var/lib/rook/openshift-storage/client.admin.keyring

2022-03-06 18:18:44.881443 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-03-06 18:18:45.258092 I | cephosd: validating status of osd.1
2022-03-06 18:18:45.258151 I | cephosd: osd.1 is marked 'DOWN'
2022-03-06 18:18:45.258195 D | exec: Running command: ceph osd safe-to-destroy 1 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-03-06 18:18:45.639791 W | cephosd: osd.1 is NOT be ok to destroy, retrying in 1m until success
2022-03-06 18:19:45.640689 D | exec: Running command: ceph osd safe-to-destroy 1 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-03-06 18:19:45.995720 W | cephosd: osd.1 is NOT be ok to destroy, retrying in 1m until success
2022-03-06 18:20:45.995993 D | exec: Running command: ceph osd safe-to-destroy 1 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-03-06 18:20:46.381228 W | cephosd: osd.1 is NOT be ok to destroy, retrying in 1m until success

6.Delete the Job:
$ oc delete job ocs-osd-removal-job
job.batch "ocs-osd-removal-job" deleted

7.Run OSD Removal job with FORCE_OSD_REMOVAL
$ oc process -n openshift-storage ocs-osd-removal -p FORCE_OSD_REMOVAL=true -p FAILED_OSD_IDS=${osd_id_to_remove} |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created

8.Check the status of the ocs-osd-removal pod with FORCE_OSD_REMOVAL
$ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NAME                        READY   STATUS      RESTARTS   AGE
ocs-osd-removal-job-gpr9x   0/1     Completed   0          37s

$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
2022-03-06 18:23:59.509428 I | rookcmd: starting Rook v4.10.0-0.4a36b5f4bbabe54c9dd2671886325a5771191b30 with arguments '/usr/local/bin/rook ceph osd remove --osd-ids=1 --force-osd-removal true'
2022-03-06 18:23:59.509543 I | rookcmd: flag values: --force-osd-removal=true, --help=false, --log-level=DEBUG, --operator-image=, --osd-ids=1, --preserve-pvc=false, --service-account=
2022-03-06 18:23:59.509546 I | op-mon: parsing mon endpoints: b=172.30.6.164:6789,c=172.30.75.171:6789,a=172.30.82.155:6789
2022-03-06 18:24:00.531096 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config
2022-03-06 18:24:00.531303 I | cephclient: generated admin config in /var/lib/rook/openshift-storage
2022-03-06 18:24:00.531426 D | cephclient: config file @ /etc/ceph/ceph.conf:
[global]
fsid                        = a983e53d-33d9-4c09-8d71-842aa48e2219
mon initial members         = b c a
mon host                    = [v2:172.30.6.164:3300,v1:172.30.6.164:6789],[v2:172.30.75.171:3300,v1:172.30.75.171:6789],[v2:172.30.82.155:3300,v1:172.30.82.155:6789]
bdev_flock_retry            = 20
mon_osd_full_ratio          = .85
mon_osd_backfillfull_ratio  = .8
mon_osd_nearfull_ratio      = .75
mon_max_pg_per_osd          = 600
mon_pg_warn_max_object_skew = 0
mon_data_avail_warn         = 15

[osd]
osd_memory_target_cgroup_limit_ratio = 0.8

[client.admin]
keyring = /var/lib/rook/openshift-storage/client.admin.keyring

2022-03-06 18:24:00.531455 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-03-06 18:24:00.997096 I | cephosd: validating status of osd.1
2022-03-06 18:24:00.997128 I | cephosd: osd.1 is marked 'DOWN'
2022-03-06 18:24:00.997142 D | exec: Running command: ceph osd safe-to-destroy 1 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-03-06 18:24:01.341574 I | cephosd: osd.1 is NOT be ok to destroy but force removal is enabled so proceeding with removal
2022-03-06 18:24:01.341629 D | exec: Running command: ceph osd find 1 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-03-06 18:24:01.714774 I | cephosd: marking osd.1 out
2022-03-06 18:24:01.714819 D | exec: Running command: ceph osd out osd.1 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-03-06 18:24:02.713567 I | cephosd: removing the OSD deployment "rook-ceph-osd-1"
2022-03-06 18:24:02.713605 D | op-k8sutil: removing rook-ceph-osd-1 deployment if it exists
2022-03-06 18:24:02.713612 I | op-k8sutil: removing deployment rook-ceph-osd-1 if it exists
2022-03-06 18:24:02.731191 I | op-k8sutil: Removed deployment rook-ceph-osd-1
2022-03-06 18:24:02.737131 I | op-k8sutil: "rook-ceph-osd-1" still found. waiting...
2022-03-06 18:24:04.747394 I | op-k8sutil: confirmed rook-ceph-osd-1 does not exist
2022-03-06 18:24:04.759802 I | cephosd: removing the osd prepare job "rook-ceph-osd-prepare-ocs-deviceset-2-data-07z25p"
2022-03-06 18:24:04.789368 I | cephosd: removing the OSD PVC "ocs-deviceset-2-data-07z25p"
2022-03-06 18:24:04.797281 I | cephosd: purging osd.1
2022-03-06 18:24:04.797322 D | exec: Running command: ceph osd purge osd.1 --force --yes-i-really-mean-it --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-03-06 18:24:05.222582 I | cephosd: attempting to remove host '\x01' from crush map if not in use
2022-03-06 18:24:05.222653 D | exec: Running command: ceph osd crush rm ocs-deviceset-2-data-07z25p --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-03-06 18:24:06.227372 D | exec: Running command: ceph crash ls --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-03-06 18:24:06.612621 I | cephosd: no ceph crash to silence
2022-03-06 18:24:06.612655 I | cephosd: completed removal of OSD 1

8.Check OSD pods status
$ oc get -n openshift-storage pods -l app=rook-ceph-osd
NAME                               READY   STATUS    RESTARTS   AGE
rook-ceph-osd-0-6d87d76d7-p575f    2/2     Running   0          5h5m
rook-ceph-osd-1-67d9dcddf9-sw7qv   2/2     Running   0          117s
rook-ceph-osd-2-65c497cfc8-4sxnr   2/2     Running   0          5h5m

9.Check Ceph status:
sh-4.4$ ceph status
  cluster:
    id:     a983e53d-33d9-4c09-8d71-842aa48e2219
    health: HEALTH_WARN
            1 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum a,b,c (age 5h)
    mgr: a(active, since 5h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 4m), 3 in (since 5m)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   11 pools, 177 pgs
    objects: 1.46k objects, 3.5 GiB
    usage:   11 GiB used, 289 GiB / 300 GiB avail
    pgs:     177 active+clean
 
  io:
    client:   1.7 KiB/s rd, 12 KiB/s wr, 2 op/s rd, 1 op/s wr
 
sh-4.4$ ceph crash ls 
ID                                                                ENTITY  NEW  
2022-03-06T18:15:26.619459Z_f294c3ce-f5a0-48ed-9a7a-be7058ddfb01  osd.1    *  

Delete crash item
sh-4.4$ ceph crash archive-all

Check ceph status
sh-4.4$ ceph health   
HEALTH_OK

10.Check ceph osd tree:
sh-4.4$ ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME                                 STATUS  REWEIGHT  PRI-AFF
 -1         0.29306  root default                                                       
 -8         0.09769      rack rack0                                                     
 -7         0.09769          host ocs-deviceset-2-data-0bzqhb                           
  1    hdd  0.09769              osd.1                             up   1.00000  1.00000
-12         0.09769      rack rack1                                                     
-11         0.09769          host ocs-deviceset-1-data-075jmc                           
  2    hdd  0.09769              osd.2                             up   1.00000  1.00000
 -4         0.09769      rack rack2                                                     
 -3         0.09769          host ocs-deviceset-0-data-0dtsbc                           
  0    hdd  0.09769              osd.0                             up   1.00000  1.00000