Description of problem (please be detailed as possible and provide log snippets): When executing the ocs-osd-removal job with more than one OSD ID, we get an "Invalid value" error. Version of all relevant components (if applicable): OCP version: Client Version: 4.6.0-0.nightly-2020-12-08-021151 Server Version: 4.6.0-0.nightly-2020-12-16-010206 Kubernetes Version: v1.19.0+7070803 OCS verison: ocs-operator.v4.6.0-195.ci OpenShift Container Storage 4.6.0-195.ci Succeeded cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-12-16-010206 True False 24h Cluster version is 4.6.0-0.nightly-2020-12-16-010206 Rook version rook: 4.6-80.1ae5ac6a.release_4.6 go: go1.15.2 Ceph version ceph version 14.2.8-115.el8cp (183dfafff0de1f79fccc983d82e733fedc0e988b) nautilus (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, If we want to remove more than one failed OSD, we can't execute multiple ocs-osd-removal jobs at once. Is there any workaround available to the best of your knowledge? Yes. Executing separate ocs-osd-removal jobs for every OSD id. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes. Can this issue reproduce from the UI? No If this is a regression, please provide more details to justify this: Steps to Reproduce: Execute the command $oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0,1 |oc create -n openshift-storage -f - (0,1 are just examples for osd id's, you can pick up other osd id's as well) Actual results: The Job "ocs-osd-removal-0,1" failed with "Invalid value" error Expected results: The job should finish successfully and create 2 ocs-osd-removal jobs: "ocs-osd-removal-0", "ocs-osd-removal-1" Additional info:
One other note about this BZ. When I tried to execute 2 ocs-osd-removal jobs separately simultaneously, the process has finished successfully. I tested it with a dynamic cluster on vSphere. I tried to delete with 2 hard drives in worker nodes: 'compute-0', 'compute-1' with their corresponding OSD's 0,2. Then I ran the commands: $oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0 |oc create -n openshift-storage -f - $oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=2 |oc create -n openshift-storage -f - And delete their corresponding PV's. The process finished successfully with all 3 OSD's up and running and Ceph health OK.
@Neha Sure, I will take a look.
Moving out of 4.6.0 since it's not a blocking issue. At least there is a known workaround to remove OSDs individually.
The PR is created here: https://github.com/openshift/ocs-operator/pull/969/
Providing QE ack, see reproducer from the bug description.
https://github.com/openshift/ocs-operator/pull/1003/ is merged.
Will this be tested on both Dynamic and LSO clusters, or testing it on a dynamic cluster would suffice?
This fix can be verified on any cluster. In general, the OSD removal job does need to be tested on both dynamic and LSO clusters, but this fix does not require both types.
I tested it with a vSphere 4.7 dynamic cluster. steps I did to reproduce the bug: 1. Go to the vSphere platform where the cluster is located, and delete 2 disks. 2. Look at the terminal and see that 2 of the OSD's are down: $ oc get pods -n openshift-storage | grep osd rook-ceph-osd-0-7855c957d-6ps45 2/2 Running 0 21h rook-ceph-osd-1-78ffdc9644-hc2gc 1/2 CrashLoopBackOff 5 20h rook-ceph-osd-2-b68f4c767-4hgrg 1/2 CrashLoopBackOff 4 21h rook-ceph-osd-prepare-ocs-deviceset-1-data-08489z-pq65k 0/1 Completed 0 21h rook-ceph-osd-prepare-ocs-deviceset-2-data-02bwcc-jkttr 0/1 Completed 0 21h 3. Delete osd 1: $ osd_id_to_remove=1 $ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0 deployment.apps/rook-ceph-osd-1 scaled $ oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove} NAME READY STATUS RESTARTS AGE rook-ceph-osd-1-78ffdc9644-hc2gc 0/2 Terminating 6 20h $ oc project openshift-storage Now using project "openshift-storage" on server "https://api.ikave-vm47-feb17.qe.rh-ocs.com:6443". $ oc delete pod rook-ceph-osd-1-78ffdc9644-hc2gc pod "rook-ceph-osd-1-78ffdc9644-hc2gc" deleted $ oc delete pod rook-ceph-osd-1-78ffdc9644-hc2gc --grace-period=0 --force warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "rook-ceph-osd-1-78ffdc9644-hc2gc" force deleted 4. Delete osd 2 with similar steps as the above steps. 5. Execute the ocs-osd-removal-job: $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=1,2 |oc create -n openshift-storage -f - job.batch/ocs-osd-removal-job created $ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage NAME READY STATUS RESTARTS AGE ocs-osd-removal-job-r5wh5 0/1 Completed 0 28s 6. Check the PV's status: $ oc get pv | grep 100Gi | grep openshift-storage pvc-3ce32a3f-6786-4e5d-ab1c-73a905381ed3 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-data-0hnnwc thin 56s pvc-a8c94f98-dccf-4504-a2d3-828d3adc8173 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-data-0qsxkm thin 55s pvc-accf611b-39ab-4c9d-8ecc-958e384e9959 100Gi RWO Delete Failed openshift-storage/ocs-deviceset-1-data-08489z thin 21h pvc-d9e34905-4ba0-4368-8f60-3aa51efc2931 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-data-02bwcc thin 21h pvc-e86e797e-f62f-453a-9a78-c8f62f3b18f5 100Gi RWO Delete Failed openshift-storage/ocs-deviceset-0-data-0dq4g5 thin 21h 7. There are 2 PV's in status Failed and 3 PV's in Bound state. Delete the PV's with status Failed: $ oc delete pv pvc-accf611b-39ab-4c9d-8ecc-958e384e9959 persistentvolume "pvc-accf611b-39ab-4c9d-8ecc-958e384e9959" deleted $ oc delete pv pvc-e86e797e-f62f-453a-9a78-c8f62f3b18f5 persistentvolume "pvc-e86e797e-f62f-453a-9a78-c8f62f3b18f5" deleted 8. Delete the ocs-osd-removal-job: $oc delete jobs.batch ocs-osd-removal-job job.batch "ocs-osd-removal-job" deleted 9. check the osd pods: $ oc get pods -n openshift-storage | grep osd rook-ceph-osd-0-7855c957d-6ps45 2/2 Running 0 22h rook-ceph-osd-1-559bf6578c-nxq6p 2/2 Running 0 69m rook-ceph-osd-2-cf6bf47bc-8lcfk 2/2 Running 0 69m rook-ceph-osd-prepare-ocs-deviceset-0-data-0hnnwc-cjc7p 0/1 Completed 0 70m rook-ceph-osd-prepare-ocs-deviceset-1-data-0qsxkm-4vggb 0/1 Completed 0 70m rook-ceph-osd-prepare-ocs-deviceset-2-data-02bwcc-jkttr 0/1 Completed 0 22h 10. Silence the Ceph warnings of osd crash: $ oc rsh rook-ceph-tools-5c6ddd4df9-9v2dk sh-4.4# sh-4.4# ceph crash ls-new ID ENTITY NEW 2021-02-18_11:00:51.223639Z_c7fa00d1-fd63-4dab-bb8d-831f205d91e8 osd.1 * 2021-02-18_11:04:02.151486Z_9964cf00-e986-48c1-91e1-210273c2f9c7 osd.2 * sh-4.4# ceph crash archive 2021-02-18_11:00:51.223639Z_c7fa00d1-fd63-4dab-bb8d-831f205d91e8 sh-4.4# ceph crash archive 2021-02-18_11:04:02.151486Z_9964cf00-e986-48c1-91e1-210273c2f9c7 11. After approximately 40 minutes, Ceph Health Back to be Health OK. Versions: OCP version: Client Version: 4.6.0-0.nightly-2021-01-12-112514 Server Version: 4.7.0-0.nightly-2021-02-13-071408 Kubernetes Version: v1.20.0+bd9e442 OCS verison: ocs-operator.v4.7.0-263.ci OpenShift Container Storage 4.7.0-263.ci Succeeded cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-02-13-071408 True False 27h Cluster version is 4.7.0-0.nightly-2021-02-13-071408 Rook version rook: 4.7-94.16bbf3806.release_4.7 go: go1.15.5 Ceph version ceph version 14.2.11-112.el8cp (f00060cb2688083840d657432768de1f6609767e) nautilus (stable)
According to the steps above, we can see the ocs-osd-removal-job worked with multiple IDs. So I am moving the bug to status verified.
Thank you for the doc text Servesha--could you take a look at my edited version and let me know if it looks ok?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041