Bug 2128966 - In the ocs-osd-removal job, osd should be marked out before checking osd is safe-to-destroy
Summary: In the ocs-osd-removal job, osd should be marked out before checking osd is s...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.12
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.12.0
Assignee: Subham Rai
QA Contact: Rachael
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-09-22 06:39 UTC by Rachael
Modified: 2023-08-09 17:03 UTC (History)
6 users (show)

Fixed In Version: 4.12.0-100
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-02-08 14:06:28 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 425 0 None open Bug 2128966: osd: osdSafeToDestroy check should be after osd out 2022-10-24 17:26:45 UTC
Github rook rook pull 11138 0 None open osd: osdSafeToDestroy check should be after osd out 2022-10-11 10:56:30 UTC

Description Rachael 2022-09-22 06:39:03 UTC
Description of problem (please be detailed as possible and provide log
snippets):

In the current implementation of ocs-osd-removal job, the osd safe-to-destroy check is made before the OSD is marked as OUT. This prevents ceph from rebalancing the data from the failed/down OSD to other OSDs (if applicable).

This means that the ocs-osd-removal job has to be run with FORCE_OSD_REMOVAL=true even though the cluster has sufficient space to restore the data after the OSD is removed.

sh-4.4$ ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP  META     AVAIL    %USE  VAR   PGS  STATUS
 2    hdd  0.50000   1.00000  512 GiB  7.0 GiB  6.8 GiB   0 B  202 MiB  505 GiB  1.36  0.89    0    down
 4    hdd  0.50000   1.00000  512 GiB  8.8 GiB  8.7 GiB   0 B   45 MiB  503 GiB  1.71  1.12  161      up
 5    hdd  0.50000   1.00000  512 GiB  7.1 GiB  7.1 GiB   0 B   48 MiB  505 GiB  1.39  0.91  175      up
 1    hdd  0.50000   1.00000  512 GiB  8.6 GiB  8.4 GiB   0 B  191 MiB  503 GiB  1.68  1.10  178      up
 3    hdd  0.50000   1.00000  512 GiB  8.0 GiB  7.9 GiB   0 B   40 MiB  504 GiB  1.55  1.02  180      up
 0    hdd  0.50000   1.00000  512 GiB  7.6 GiB  7.6 GiB   0 B   69 MiB  504 GiB  1.49  0.97  174      up
                       TOTAL  3.0 TiB   47 GiB   46 GiB   0 B  595 MiB  3.0 TiB  1.53                   
MIN/MAX VAR: 0.89/1.12  STDDEV: 0.13

$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=2 FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created

$ oc logs ocs-osd-removal-job-x7bcx -f
2022-09-22 06:12:22.688551 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-09-22 06:12:23.109739 I | cephosd: validating status of osd.2
2022-09-22 06:12:23.109764 I | cephosd: osd.2 is marked 'DOWN'
2022-09-22 06:12:23.109778 D | exec: Running command: ceph osd safe-to-destroy 2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-09-22 06:12:23.522885 W | cephosd: osd.2 is NOT be ok to destroy, retrying in 1m until success


Version of all relevant components (if applicable):
---------------------------------------------------
ODF: odf-operator.v4.12.0    full_version=4.12.0-59
OCP: 4.12.0-0.nightly-2022-09-20-095559


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
Yes, manually marking the OSD out from ceph toolbox.

sh-4.4$ ceph osd out 2
marked out osd.2. 


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
No


Steps to Reproduce:
-------------------
1. Ensure that there is sufficient space and OSDs (more than 3) in the cluster to be able to restore the data from the failed OSD. Add capacity if required.
2. Perform device replacement as documented here with FORCE_OSD_REMOVAL=false: 
   https://dxp-docp-prod.apps.ext-waf.spoke.prod.us-west-2.aws.paas.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.11/html-single/replacing_devices/index#replacing-operational-or-failed-storage-devices-on-vmware-infrastructure_rhodf
3. Check the logs of ocs-osd-removal-job pod



Actual results:
---------------
The job is waiting for ceph osd to be safe-to-destroy.


Expected results:
-----------------
If there is sufficient space to restore data, the ocs-osd-removal job should complete with enabling FORCE_OSD_REMOVAL.



Additional info:
----------------

Once the OSD is marked out manually, the ocs-osd-removal job eventually succeeds after the data has been restored to available OSDs.

2022-09-22 06:16:25.200261 W | cephosd: osd.2 is NOT be ok to destroy, retrying in 1m until success
2022-09-22 06:17:25.202342 D | exec: Running command: ceph osd safe-to-destroy 2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-09-22 06:17:25.642281 W | cephosd: osd.2 is NOT be ok to destroy, retrying in 1m until success
2022-09-22 06:18:25.642476 D | exec: Running command: ceph osd safe-to-destroy 2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-09-22 06:18:26.048276 I | cephosd: osd.2 is safe to destroy, proceeding
2022-09-22 06:18:26.048334 D | exec: Running command: ceph osd find 2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-09-22 06:18:26.482261 I | cephosd: marking osd.2 out
2022-09-22 06:18:26.482303 D | exec: Running command: ceph osd out osd.2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-09-22 06:18:26.919903 I | cephosd: removing the OSD deployment "rook-ceph-osd-2"


Note You need to log in before you can comment on or make changes to this bug.