Bug 2153257
| Summary: | [IBM Z] - ocs-osd-removal-job not getting completed when trying to replace failed node | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Abdul Kandathil (IBM) <akandath> |
| Component: | rook | Assignee: | Travis Nielsen <tnielsen> |
| Status: | CLOSED NOTABUG | QA Contact: | Neha Berry <nberry> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.12 | CC: | brgardne, madam, ocs-bugs, odf-bz-bot, oviner |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | s390x | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-01-10 16:44:43 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Abdul Kandathil (IBM)
2022-12-14 11:06:02 UTC
When removing an OSD, Rook will query ceph to know if it is safe-to-destroy the OSD: 2022-12-14 08:35:09.214305 W | cephosd: osd.1 is NOT be ok to destroy, retrying in 1m until success Since the OSD is not safe to destroy, Rook will wait and continue checking indefinitely until it is safe. An OSD is safe to destroy when all the PGs have been moved to other OSDs in the cluster such that the PGs are active+clean. This means all replicas of the data are replicated safely and there is no risk of data loss. If there are no spare OSDs where the PGs can be moved, an OSD will never be safe to destroy. For example, if there are 3 OSDs in the cluster and pools with replica 3, a lost OSD will never be safe to destroy. If you need to remove an OSD even when it is not safe to destroy, you will need to force the removal on the removal template with the flag FORCE_OSD_REMOVAL. See the docs on the osd removal for more details. Hi,
You need to add FORCE_FLAG to the removal job.
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f -
Steps:
1. Delete the old ocs-osd-removal job
$ oc delete job ocs-osd-removal -n openshift-storage
2.Run ocs-osd-removal job with FORCE_FLAG:
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f -
3.Verify job moved to completed state:
$ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
We have an api [from tool pod] to check if we need to add FORCE_FLAG:
```
sh-4.4$ ceph osd ok-to-stop 0
{"ok_to_stop":true,"osds":[0],"num_ok_pgs":0,"num_not_ok_pgs":0}
sh-4.4$ ceph osd safe-to-destroy 0
Error EAGAIN: OSD(s) 0 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions.
```
if "ceph osd safe-to-destroy <osd-di>" return error -> we need to add the FORCE_FLAG, else we don't need to add the force flag.
```
I opened a doc bz to add `-p` string to command : https://bugzilla.redhat.com/show_bug.cgi?id=2139406
If this is not resolved by the force flag, please reopen the issue. |