2128966 – In the ocs-osd-removal job, osd should be marked out before checking osd is safe-to-destroy

Bug 2128966 - In the ocs-osd-removal job, osd should be marked out before checking osd is safe-to-destroy

Summary: In the ocs-osd-removal job, osd should be marked out before checking osd is s...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.12
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.12.0
Assignee:	Subham Rai
QA Contact:	Rachael
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-09-22 06:39 UTC by Rachael
Modified:	2023-08-09 17:03 UTC (History)
CC List:	6 users (show)
Fixed In Version:	4.12.0-100
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-02-08 14:06:28 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage rook pull 425	0	None	open	Bug 2128966: osd: osdSafeToDestroy check should be after osd out	2022-10-24 17:26:45 UTC
Github	rook rook pull 11138	0	None	open	osd: osdSafeToDestroy check should be after osd out	2022-10-11 10:56:30 UTC

Description Rachael 2022-09-22 06:39:03 UTC

Description of problem (please be detailed as possible and provide log
snippets):

In the current implementation of ocs-osd-removal job, the osd safe-to-destroy check is made before the OSD is marked as OUT. This prevents ceph from rebalancing the data from the failed/down OSD to other OSDs (if applicable).

This means that the ocs-osd-removal job has to be run with FORCE_OSD_REMOVAL=true even though the cluster has sufficient space to restore the data after the OSD is removed.

sh-4.4$ ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP  META     AVAIL    %USE  VAR   PGS  STATUS
 2    hdd  0.50000   1.00000  512 GiB  7.0 GiB  6.8 GiB   0 B  202 MiB  505 GiB  1.36  0.89    0    down
 4    hdd  0.50000   1.00000  512 GiB  8.8 GiB  8.7 GiB   0 B   45 MiB  503 GiB  1.71  1.12  161      up
 5    hdd  0.50000   1.00000  512 GiB  7.1 GiB  7.1 GiB   0 B   48 MiB  505 GiB  1.39  0.91  175      up
 1    hdd  0.50000   1.00000  512 GiB  8.6 GiB  8.4 GiB   0 B  191 MiB  503 GiB  1.68  1.10  178      up
 3    hdd  0.50000   1.00000  512 GiB  8.0 GiB  7.9 GiB   0 B   40 MiB  504 GiB  1.55  1.02  180      up
 0    hdd  0.50000   1.00000  512 GiB  7.6 GiB  7.6 GiB   0 B   69 MiB  504 GiB  1.49  0.97  174      up
                       TOTAL  3.0 TiB   47 GiB   46 GiB   0 B  595 MiB  3.0 TiB  1.53                   
MIN/MAX VAR: 0.89/1.12  STDDEV: 0.13

$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=2 FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created

$ oc logs ocs-osd-removal-job-x7bcx -f
2022-09-22 06:12:22.688551 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-09-22 06:12:23.109739 I | cephosd: validating status of osd.2
2022-09-22 06:12:23.109764 I | cephosd: osd.2 is marked 'DOWN'
2022-09-22 06:12:23.109778 D | exec: Running command: ceph osd safe-to-destroy 2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-09-22 06:12:23.522885 W | cephosd: osd.2 is NOT be ok to destroy, retrying in 1m until success


Version of all relevant components (if applicable):
---------------------------------------------------
ODF: odf-operator.v4.12.0    full_version=4.12.0-59
OCP: 4.12.0-0.nightly-2022-09-20-095559


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
Yes, manually marking the OSD out from ceph toolbox.

sh-4.4$ ceph osd out 2
marked out osd.2. 


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
No


Steps to Reproduce:
-------------------
1. Ensure that there is sufficient space and OSDs (more than 3) in the cluster to be able to restore the data from the failed OSD. Add capacity if required.
2. Perform device replacement as documented here with FORCE_OSD_REMOVAL=false: 
   https://dxp-docp-prod.apps.ext-waf.spoke.prod.us-west-2.aws.paas.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.11/html-single/replacing_devices/index#replacing-operational-or-failed-storage-devices-on-vmware-infrastructure_rhodf
3. Check the logs of ocs-osd-removal-job pod



Actual results:
---------------
The job is waiting for ceph osd to be safe-to-destroy.


Expected results:
-----------------
If there is sufficient space to restore data, the ocs-osd-removal job should complete with enabling FORCE_OSD_REMOVAL.



Additional info:
----------------

Once the OSD is marked out manually, the ocs-osd-removal job eventually succeeds after the data has been restored to available OSDs.

2022-09-22 06:16:25.200261 W | cephosd: osd.2 is NOT be ok to destroy, retrying in 1m until success
2022-09-22 06:17:25.202342 D | exec: Running command: ceph osd safe-to-destroy 2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-09-22 06:17:25.642281 W | cephosd: osd.2 is NOT be ok to destroy, retrying in 1m until success
2022-09-22 06:18:25.642476 D | exec: Running command: ceph osd safe-to-destroy 2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-09-22 06:18:26.048276 I | cephosd: osd.2 is safe to destroy, proceeding
2022-09-22 06:18:26.048334 D | exec: Running command: ceph osd find 2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-09-22 06:18:26.482261 I | cephosd: marking osd.2 out
2022-09-22 06:18:26.482303 D | exec: Running command: ceph osd out osd.2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-09-22 06:18:26.919903 I | cephosd: removing the OSD deployment "rook-ceph-osd-2"

Note You need to log in before you can comment on or make changes to this bug.