Bug 2128966

Summary: In the ocs-osd-removal job, osd should be marked out before checking osd is safe-to-destroy
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Rachael <rgeorge>
Component: rookAssignee: Subham Rai <srai>
Status: CLOSED CURRENTRELEASE QA Contact: Rachael <rgeorge>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.12CC: mmuench, muagarwa, ocs-bugs, odf-bz-bot, srai, tnielsen
Target Milestone: ---   
Target Release: ODF 4.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.12.0-100 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-02-08 14:06:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rachael 2022-09-22 06:39:03 UTC
Description of problem (please be detailed as possible and provide log
snippets):

In the current implementation of ocs-osd-removal job, the osd safe-to-destroy check is made before the OSD is marked as OUT. This prevents ceph from rebalancing the data from the failed/down OSD to other OSDs (if applicable).

This means that the ocs-osd-removal job has to be run with FORCE_OSD_REMOVAL=true even though the cluster has sufficient space to restore the data after the OSD is removed.

sh-4.4$ ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP  META     AVAIL    %USE  VAR   PGS  STATUS
 2    hdd  0.50000   1.00000  512 GiB  7.0 GiB  6.8 GiB   0 B  202 MiB  505 GiB  1.36  0.89    0    down
 4    hdd  0.50000   1.00000  512 GiB  8.8 GiB  8.7 GiB   0 B   45 MiB  503 GiB  1.71  1.12  161      up
 5    hdd  0.50000   1.00000  512 GiB  7.1 GiB  7.1 GiB   0 B   48 MiB  505 GiB  1.39  0.91  175      up
 1    hdd  0.50000   1.00000  512 GiB  8.6 GiB  8.4 GiB   0 B  191 MiB  503 GiB  1.68  1.10  178      up
 3    hdd  0.50000   1.00000  512 GiB  8.0 GiB  7.9 GiB   0 B   40 MiB  504 GiB  1.55  1.02  180      up
 0    hdd  0.50000   1.00000  512 GiB  7.6 GiB  7.6 GiB   0 B   69 MiB  504 GiB  1.49  0.97  174      up
                       TOTAL  3.0 TiB   47 GiB   46 GiB   0 B  595 MiB  3.0 TiB  1.53                   
MIN/MAX VAR: 0.89/1.12  STDDEV: 0.13

$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=2 FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created

$ oc logs ocs-osd-removal-job-x7bcx -f
2022-09-22 06:12:22.688551 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-09-22 06:12:23.109739 I | cephosd: validating status of osd.2
2022-09-22 06:12:23.109764 I | cephosd: osd.2 is marked 'DOWN'
2022-09-22 06:12:23.109778 D | exec: Running command: ceph osd safe-to-destroy 2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-09-22 06:12:23.522885 W | cephosd: osd.2 is NOT be ok to destroy, retrying in 1m until success


Version of all relevant components (if applicable):
---------------------------------------------------
ODF: odf-operator.v4.12.0    full_version=4.12.0-59
OCP: 4.12.0-0.nightly-2022-09-20-095559


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
Yes, manually marking the OSD out from ceph toolbox.

sh-4.4$ ceph osd out 2
marked out osd.2. 


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
No


Steps to Reproduce:
-------------------
1. Ensure that there is sufficient space and OSDs (more than 3) in the cluster to be able to restore the data from the failed OSD. Add capacity if required.
2. Perform device replacement as documented here with FORCE_OSD_REMOVAL=false: 
   https://dxp-docp-prod.apps.ext-waf.spoke.prod.us-west-2.aws.paas.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.11/html-single/replacing_devices/index#replacing-operational-or-failed-storage-devices-on-vmware-infrastructure_rhodf
3. Check the logs of ocs-osd-removal-job pod



Actual results:
---------------
The job is waiting for ceph osd to be safe-to-destroy.


Expected results:
-----------------
If there is sufficient space to restore data, the ocs-osd-removal job should complete with enabling FORCE_OSD_REMOVAL.



Additional info:
----------------

Once the OSD is marked out manually, the ocs-osd-removal job eventually succeeds after the data has been restored to available OSDs.

2022-09-22 06:16:25.200261 W | cephosd: osd.2 is NOT be ok to destroy, retrying in 1m until success
2022-09-22 06:17:25.202342 D | exec: Running command: ceph osd safe-to-destroy 2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-09-22 06:17:25.642281 W | cephosd: osd.2 is NOT be ok to destroy, retrying in 1m until success
2022-09-22 06:18:25.642476 D | exec: Running command: ceph osd safe-to-destroy 2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-09-22 06:18:26.048276 I | cephosd: osd.2 is safe to destroy, proceeding
2022-09-22 06:18:26.048334 D | exec: Running command: ceph osd find 2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-09-22 06:18:26.482261 I | cephosd: marking osd.2 out
2022-09-22 06:18:26.482303 D | exec: Running command: ceph osd out osd.2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-09-22 06:18:26.919903 I | cephosd: removing the OSD deployment "rook-ceph-osd-2"