Bug 2027826

Summary:	OSD Removal template needs to expose option to force remove the OSD
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Travis Nielsen <tnielsen>
Component:	ocs-operator	Assignee:	Subham Rai <srai>
Status:	CLOSED ERRATA	QA Contact:	Itzhak <ikave>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.8	CC:	jarrpa, madam, mmuench, muagarwa, nberry, ocs-bugs, odf-bz-bot, sostapov, srai
Target Milestone:	---	Keywords:	AutomationBackLog
Target Release:	ODF 4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.10.0-163	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	2054518 2120260 (view as bug list)		Environment:
Last Closed:	2022-04-13 18:50:37 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2027396
Bug Blocks:	2054518, 2120260

Description Travis Nielsen 2021-11-30 19:31:37 UTC

This bug was initially created as a copy of Bug #2026007

I am copying this bug because: 

An OCS operator update is needed to expose an option to force removal of an OSD if Ceph indicates the OSD is not safe-to-destroy.

If https://bugzilla.redhat.com/show_bug.cgi?id=2027396 is approved for 4.9.z, we will also need this for 4.9.z.


Description of problem (please be detailed as possible and provide log
snippets):
Use ceph 'osd safe-to-destroy' and 'osd ok-to-stop' feature in OSD purge job

[1] mgr: implement 'osd safe-to-destroy' and 'osd ok-to-stop' commands
     https://github.com/ceph/ceph/pull/16976 
     An osd is safe to destroy if
we have osd_stat for it
osd_stat indicates no pgs stored
all pgs are known
no pgs map to it
i.e., overall data durability will not be affected
An OSD is ok to stop if

we have the pg stats we need
no PGs will drop below min_size
i.e., availability won't be immediately compromised

Comment 1 Travis Nielsen 2021-11-30 19:34:40 UTC

The OSD removal job now has the following option as seen in https://github.com/red-hat-storage/rook/pull/313:

--force-osd-removal [true | false]

The OSD removal job template created by the OCS operator needs to expose a variable for this option. Otherwise, the OSD will not be removable in scenarios where the PGs are never safe-to-destroy in small clusters where another OSD is not available to which the PGs can be backfilled.

This can be moved to 4.10 if https://bugzilla.redhat.com/show_bug.cgi?id=2027396 is not approved for 4.9.z.

Comment 2 Jose A. Rivera 2022-02-07 16:38:01 UTC

Since this hasn't been looked at or even prioritized, moving this out to ODF 4.11. The change itself should be fairly trivial. Same for the backport, if approved.

Comment 4 Travis Nielsen 2022-02-07 17:30:40 UTC

Is it just a matter of finding someone to fix it? If so, I'll get someone from Rook to do it in 4.10. The osd removal job will really not be usable in some scenarios without this option exposed in the template.

Comment 5 Mudit Agarwal 2022-02-09 03:56:14 UTC

Bringing it back to 4.10 and adding the devel ack.
Travis is this really a Feature, looks like a bug fix to me.

PS: RFEs are not allowed at this point in 4.10

Comment 6 Jose A. Rivera 2022-02-09 16:16:04 UTC

I'm removing the RFE tag, as this is indeed a proper bug fix.

Comment 7 Travis Nielsen 2022-02-09 23:21:46 UTC

Agreed, a bug fix rather than RFE, thanks.

Comment 13 Itzhak 2022-03-13 13:22:17 UTC

Should I add the parameters 'osd safe-to-destroy' and 'osd ok-to-stop' in the osd removal job?
Please provide more details about the exact steps needed to test it.

Comment 14 Subham Rai 2022-03-14 09:55:03 UTC

I think first you need to mark osd safe to destroy and then pass the flag accordingly in the oc process.

Comment 15 Travis Nielsen 2022-03-14 14:43:12 UTC

Please see these instructions for testing: https://docs.google.com/document/d/1WHxEdmwTn1EmrNujjOzBGp6R-IHbwcdsYPkG0nrIW5o/edit

Comment 16 Itzhak 2022-03-14 16:33:36 UTC

I tested it with vSphere 4.10 dynamic clusters.

Steps:

1. I checked the output of the commands 'ceph osd safe-to-destroy' and 'ceph osd ok-to-stop' on osd 0:

sh-4.4$ ceph osd safe-to-destroy 0
Error EBUSY: OSD(s) 0 have 177 pgs currently mapped to them. 
sh-4.4$ ceph osd ok-to-stop 0
{"ok_to_stop":true,"osds":[0],"num_ok_pgs":177,"num_not_ok_pgs":0,"ok_become_degraded":["1.0","1.1","1.2","1.3","1.4","1.5","1.6","1.7","1.8","1.9","1.a","1.b","1.c","1.d","1.e","1.f","1.10","1.11","1.12","1.13","1.14","1.15","1.16","1.17","1.18","1.19","1.1a","1.1b","1.1c","1.1d","1.1e","1.1f","2.0","2.1","2.2","2.3","2.4","2.5","2.6","2.7","3.0","3.1","3.2","3.3","3.4","3.5","3.6","3.7","4.0","4.1","4.2","4.3","4.4","4.5","4.6","4.7","5.0","5.1","5.2","5.3","5.4","5.5","5.6","5.7","6.0","6.1","6.2","6.3","6.4","6.5","6.6","6.7","7.0","7.1","7.2","7.3","7.4","7.5","7.6","7.7","8.0","9.0","9.1","9.2","9.3","9.4","9.5","9.6","9.7","9.8","9.9","9.a","9.b","9.c","9.d","9.e","9.f","9.10","9.11","9.12","9.13","9.14","9.15","9.16","9.17","9.18","9.19","9.1a","9.1b","9.1c","9.1d","9.1e","9.1f","10.0","10.1","10.2","10.3","10.4","10.5","10.6","10.7","10.8","10.9","10.a","10.b","10.c","10.d","10.e","10.f","10.10","10.11","10.12","10.13","10.14","10.15","10.16","10.17","10.18","10.19","10.1a","10.1b","10.1c","10.1d","10.1e","10.1f","11.0","11.1","11.2","11.3","11.4","11.5","11.6","11.7","11.8","11.9","11.a","11.b","11.c","11.d","11.e","11.f","11.10","11.11","11.12","11.13","11.14","11.15","11.16","11.17","11.18","11.19","11.1a","11.1b","11.1c","11.1d","11.1e","11.1f"]}
sh-4.4$ 


2. Delete the disk from vSphere side that associated to osd-0. 
The osd-0 went to "CrashLoopBackOff" state:
$ oc get pods | grep osd
rook-ceph-osd-0-6b7957b4bd-5fh4t                                  1/2     CrashLoopBackOff   1 (13s ago)   3h17m
rook-ceph-osd-1-68f5f9f9bc-fp7mm                                  2/2     Running            0             3h17m
rook-ceph-osd-2-78fcfb4748-drknv                                  2/2     Running            0             3h17m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0jsqcp-nhs7w           0/1     Completed          0             3h17m
rook-ceph-osd-prepare-ocs-deviceset-1-data-045v6j-tlr8l           0/1     Completed          0             3h17m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0fwmcx-8dp8b           0/1     Completed          0             3h17m

3. Scale the osd-0 deployment: 
$ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
deployment.apps/rook-ceph-osd-0 scaled

4. Delete the osd-0 pod:
$ oc delete pod rook-ceph-osd-0-6b7957b4bd-5fh4t --force --grace-period=0
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "rook-ceph-osd-0-6b7957b4bd-5fh4t" force deleted

5. Check the output of the commands 'ceph osd safe-to-destroy' and 'ceph osd ok-to-stop' on osd 0:
sh-4.4$ ceph osd ok-to-stop 0
{"ok_to_stop":true,"osds":[0],"num_ok_pgs":0,"num_not_ok_pgs":0}
sh-4.4$ ceph osd safe-to-destroy 0
Error EAGAIN: OSD(s) 0 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions.

6. Running the osd removal job without the "FORCE_OSD_REMOVAL" param:
oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0 |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created

7. Check the osd removal job and see it stuck on running state: 
$ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NAME                        READY   STATUS    RESTARTS   AGE
ocs-osd-removal-job-x8bh6   1/1     Running   0          12m

When looking at the osd removal job logs I saw this message repeatedly:
2022-03-14 15:20:06.805613 W | cephosd: osd.0 is NOT be ok to destroy, retrying in 1m until success
2022-03-14 15:21:06.806068 D | exec: Running command: ceph osd safe-to-destroy 0 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json

8. Delete the current osd removal job
$ oc delete jobs ocs-osd-removal-job
job.batch "ocs-osd-removal-job" deleted

9. Create a new one instead with the "FORCE_OSD_REMOVAL" param:
oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0 -p FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created

10. Check the osd removal job and see it is completed successfully:
$ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NAME                        READY   STATUS      RESTARTS   AGE
ocs-osd-removal-job-9lq2k   0/1     Completed   0          72s

Also check the osd removal job logs and saw this message: 
2022-03-14 15:35:34.459609 I | cephosd: validating status of osd.0
2022-03-14 15:35:34.459626 I | cephosd: osd.0 is marked 'DOWN'
2022-03-14 15:35:34.459645 D | exec: Running command: ceph osd safe-to-destroy 0 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-03-14 15:35:34.735903 I | cephosd: osd.0 is NOT be ok to destroy but force removal is enabled so proceeding with removal


11. Delete the osd removal job: 
$ oc delete job ocs-osd-removal-job 
job.batch "ocs-osd-removal-job" deleted

12. Check the osd pods and see the new osd-0 pod running:
$ oc get pods | grep osd
rook-ceph-osd-0-585b9b6f6d-6plp2                                  2/2     Running     0          5m13s
rook-ceph-osd-1-68f5f9f9bc-fp7mm                                  2/2     Running     0          4h8m
rook-ceph-osd-2-78fcfb4748-drknv                                  2/2     Running     0          4h8m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0xv8x8-5fskd           0/1     Completed   0          5m46s
rook-ceph-osd-prepare-ocs-deviceset-1-data-045v6j-tlr8l           0/1     Completed   0          4h8m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0fwmcx-8dp8b           0/1     Completed   0          4h8m

13. Check the PVC state:
$ oc get pvc | grep ocs-deviceset
ocs-deviceset-0-data-0xv8x8   Bound    pvc-18f210c1-dc33-4912-b8c9-7056883dc105   256Gi      RWO            thin                          6m20s
ocs-deviceset-1-data-045v6j   Bound    pvc-a17d78f3-812d-494b-9e57-e20b0e62a15d   256Gi      RWO            thin                          4h9m
ocs-deviceset-2-data-0fwmcx   Bound    pvc-6734a2f5-5981-462b-b003-952cdfa8e324   256Gi      RWO            thin                          4h9m

14. silence the osd crash warning  
ceph crash archive 2022-03-14T14:50:18.260644Z_dbd5458b-fea0-47b1-94eb-4d3335bb7913

15. Verify ceph health is ok:
sh-4.4$ ceph health      
HEALTH_OK

Comment 17 Itzhak 2022-03-14 16:35:12 UTC

Additional info:

OCP version:
Client Version: 4.10.0-0.nightly-2022-03-05-023708
Server Version: 4.10.0-0.nightly-2022-03-13-040322
Kubernetes Version: v1.23.3+e419edf

OCS verison:
ocs-operator.v4.10.0              OpenShift Container Storage   4.10.0               Succeeded

cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-03-13-040322   True        False         4h20m   Cluster version is 4.10.0-0.nightly-2022-03-13-040322

Rook version
rook: v4.10.0-0.2285b5b9c4a9993456f0b78b7b23a7399ca98731
go: go1.16.12

Ceph version
ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)


Link to Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10780/

Comment 18 Itzhak 2022-03-14 16:57:09 UTC

I think that from the conclusions above, we can move this bug to Verify. 
Please let me know what do you think.

Comment 19 Travis Nielsen 2022-03-14 17:01:43 UTC

Yes, all the steps above look expected, sounds good to move it to verified

Comment 22 Itzhak 2022-03-15 10:00:15 UTC

According to the conclusions above, I am moving the bug to Verified.

Comment 24 errata-xmlrpc 2022-04-13 18:50:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1372