2153257 – [IBM Z] - ocs-osd-removal-job not getting completed when trying to replace failed node

Bug 2153257 - [IBM Z] - ocs-osd-removal-job not getting completed when trying to replace failed node

Summary: [IBM Z] - ocs-osd-removal-job not getting completed when trying to replace fa...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.12
Hardware:	s390x
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Travis Nielsen
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-12-14 11:06 UTC by Abdul Kandathil (IBM)
Modified:	2023-08-09 17:03 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-01-10 16:44:43 UTC
Embargoed:

Attachments	(Terms of Use)

Description Abdul Kandathil (IBM) 2022-12-14 11:06:02 UTC

Description of problem (please be detailed as possible and provide log
snippets):
While testing Failed node replacement, ocs-osd-removal-job not getting completed.

[root@m4204001 ~]# oc -n openshift-storage get pod ocs-osd-removal-job-7k5kv
NAME                        READY   STATUS    RESTARTS   AGE
ocs-osd-removal-job-7k5kv   1/1     Running   0          141m
[root@m4204001 ~]#

Logs:

2022-12-14 08:35:06.698509 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-12-14 08:35:07.167520 I | cephosd: validating status of osd.1
2022-12-14 08:35:07.167559 I | cephosd: osd.1 is marked 'DOWN'
2022-12-14 08:35:07.167581 D | exec: Running command: ceph osd find 1 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-12-14 08:35:07.953915 I | cephosd: marking osd.1 out
2022-12-14 08:35:07.953971 D | exec: Running command: ceph osd out osd.1 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-12-14 08:35:08.689277 D | exec: Running command: ceph osd safe-to-destroy 1 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-12-14 08:35:09.214305 W | cephosd: osd.1 is NOT be ok to destroy, retrying in 1m until success


Version of all relevant components (if applicable):
ODF 4.12.0-140

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy ODF 4.12 on OCP 4.12
2. Fail one of the ODF node
3. Follow steps for replace node


Actual results:
ocs-osd-removal-job get completed


Expected results:
ocs-osd-removal-job keeps in running status for more than an hour.

Additional info:

Must gather logs:

https://drive.google.com/file/d/1GQI6vs7L0pt9D_Na1m8uSM93DQHdMFcE/view?usp=sharing

Comment 2 Travis Nielsen 2022-12-14 21:17:42 UTC

When removing an OSD, Rook will query ceph to know if it is safe-to-destroy the OSD:

2022-12-14 08:35:09.214305 W | cephosd: osd.1 is NOT be ok to destroy, retrying in 1m until success

Since the OSD is not safe to destroy, Rook will wait and continue checking indefinitely until it is safe.
An OSD is safe to destroy when all the PGs have been moved to other OSDs in the cluster such
that the PGs are active+clean. This means all replicas of the data are replicated safely and there
is no risk of data loss.

If there are no spare OSDs where the PGs can be moved, an OSD will never be safe to destroy.
For example, if there are 3 OSDs in the cluster and pools with replica 3, a lost OSD
will never be safe to destroy.

If you need to remove an OSD even when it is not safe to destroy, you will need to
force the removal on the removal template with the flag FORCE_OSD_REMOVAL.
See the docs on the osd removal for more details.

Comment 3 Oded 2022-12-21 10:51:36 UTC

Hi,

You need to add FORCE_FLAG to the removal job.
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f -

Steps:
1. Delete the old ocs-osd-removal job
$ oc delete job ocs-osd-removal -n openshift-storage

2.Run ocs-osd-removal job with FORCE_FLAG:
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f -

3.Verify job moved to completed state:
$ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage


We have an api [from tool pod] to check if we need to add FORCE_FLAG:
```
sh-4.4$ ceph osd ok-to-stop 0
{"ok_to_stop":true,"osds":[0],"num_ok_pgs":0,"num_not_ok_pgs":0}
sh-4.4$ ceph osd safe-to-destroy 0
Error EAGAIN: OSD(s) 0 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions.
```
if "ceph osd safe-to-destroy <osd-di>" return error -> we need to add the FORCE_FLAG, else we don't need to add the force flag.
```

I opened a doc bz to add `-p` string to command : https://bugzilla.redhat.com/show_bug.cgi?id=2139406

Comment 8 Travis Nielsen 2023-01-10 16:44:43 UTC

If this is not resolved by the force flag, please reopen the issue.

Note You need to log in before you can comment on or make changes to this bug.