Description of problem (please be detailed as possible and provide log snippests): The `ocs-osd-removal-job` should generate unique names for each job. This feature would eliminate the need to delete the older/existing job when triggering the job for a second time. The user can directly proceed to the next disk replacement. Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? In case of UI, only one osd can be replaced at a time. With unique job names, one doesn't have to delete the older job after the first osd replacement. If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: 1. Once this feature is added, the admin can run `oc process` command to trigger the job without deleting the older/existing jobs in the cluster. 2. One template (ocs-osd-removal) will create multiple jobs. 3. This feature will need documentation efforts for the relevant OCS versions (below OCS 4.9)
No RFE in 4.9 at this point of time. Eran, should this be added to our backlog?
Not a blocker & we have proper documentation for the customers to remove the existing job before creating a new one. Moving it out to 4.14.
Hi Servesha, The BZ is now addressed. Instead of generating unique names every time, we went with deleting the completed jobs so they can be rerun again. Can you please look at the doc changes that might be required for this? As Neha is not available now, Itzhak from QE can collaborate with you. Reference- https://bugzilla.redhat.com/show_bug.cgi?id=1970939#c3 https://chat.google.com/room/AAAAK3AjBdo/Jl1UYkoqLFE
I tested it with a vSphere UPI 4.14 cluster. Unfortunately, The ocs-osd-removal job failed to complete successfully. I did the following steps: 1. Delete the disk from vSphere side. 2. I checked the OSDs: $ oc get pods -n openshift-storage | grep osd ocs-osd-removal-job-mwtl8 0/1 Completed 0 2m3s rook-ceph-osd-0-544bcf44c-5t48x 2/2 Running 0 17m rook-ceph-osd-1-594976b949-r7ndl 2/2 Running 0 17m rook-ceph-osd-2-7844589467-6v2j4 1/2 CrashLoopBackOff 1 (20s ago) 17m rook-ceph-osd-prepare-ocs-deviceset-0-data-0mcv42-688bf 0/1 Completed 0 18m rook-ceph-osd-prepare-ocs-deviceset-1-data-09c7f6-bgwl2 0/1 Completed 0 18m rook-ceph-osd-prepare-ocs-deviceset-2-data-0vjk7r-czl82 0/1 Completed 0 18m 3. Scale down the OSD deployment for the OSD to be replaced $ osd_id_to_remove=2 $ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0 deployment.apps/rook-ceph-osd-2 scaled 4. Delete the old ocs-osd-removal job: $ oc -n openshift-storage delete pod ocs-osd-removal-job-mwtl8 pod "ocs-osd-removal-job-mwtl8" deleted 5. Execute the ocs-osd-removal job: $ oc project openshift-storage Now using project "openshift-storage" on server "https://api.ikave-vm414.qe.rh-ocs.com:6443". $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f - job.batch/ocs-osd-removal-job created 6. Checking the ocs-osd-removal job status, and see that it didn't complete successfully: $ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage NAME READY STATUS RESTARTS AGE ocs-osd-removal-job-z68ng 1/1 Running 0 26m Additional info: Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/31095/ Versions: OC version: Client Version: 4.10.24 Server Version: 4.14.0-0.nightly-2023-11-05-194730 Kubernetes Version: v1.27.6+f67aeb3 OCS version: ocs-operator.v4.14.0-rhodf OpenShift Container Storage 4.14.0-rhodf Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-11-05-194730 True False 85m Cluster version is 4.14.0-0.nightly-2023-11-05-194730 Rook version: rook: v4.14.0-0.103536c37b9f9063f4abe9db7c59150125b75908 go: go1.20.10 Ceph version: ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
Not a 4.14 blocker
I made a mistake when executing the osd removal job - I should pass the param "FORCE_OSD_REMOVAL=true". Sorry for the confusion. I tested it again with the same steps above. I did the following steps: 1. Delete the disk from the vSphere side. 2. I checked the OSDs: $ oc get pods -n openshift-storage | grep osd ocs-osd-removal-job-4stqs 0/1 Completed 0 2m8s rook-ceph-osd-0-86b7d46f4f-4kbq5 1/2 CrashLoopBackOff 2 (21s ago) 21m rook-ceph-osd-1-6c449d8498-tndsp 2/2 Running 0 21m rook-ceph-osd-2-78cf9f4c6c-mqw54 2/2 Running 0 21m rook-ceph-osd-prepare-ocs-deviceset-0-data-0lskk5-4248w 0/1 Completed 0 21m rook-ceph-osd-prepare-ocs-deviceset-1-data-0gbg9w-b7l78 0/1 Completed 0 21m rook-ceph-osd-prepare-ocs-deviceset-2-data-0kznvm-d4rvb 0/1 Completed 0 21m 3. Scale down the OSD deployment for the OSD to be replaced $ osd_id_to_remove=0 $ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0 deployment.apps/rook-ceph-osd-0 scaled 4. Delete the old ocs-osd-removal pod: $ oc -n openshift-storage delete pod ocs-osd-removal-job-4stqs pod "ocs-osd-removal-job-4stqs" deleted 5. Execute the ocs-osd-removal job with "FORCE_OSD_REMOVAL=true": $ oc project openshift-storage Now using project "openshift-storage" on server "https://api.ikave-vm414.qe.rh-ocs.com:6443". $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f - job.batch/ocs-osd-removal-job created 6. Checking the ocs-osd-removal job status, and verify it was completed successfully. $ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage NAME READY STATUS RESTARTS AGE ocs-osd-removal-job-z9kgv 0/1 Completed 0 13s $ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal' 2023-11-07 10:47:48.390469 I | cephosd: completed removal of OSD 0 7. Also, I checked that a new OSD came up successfully: $ oc get pods -n openshift-storage | grep osd ocs-osd-removal-job-z9kgv 0/1 Completed 0 3m13s rook-ceph-osd-0-f4448d86f-kbsbd 2/2 Running 0 2m24s rook-ceph-osd-1-6c449d8498-tndsp 2/2 Running 0 32m rook-ceph-osd-2-78cf9f4c6c-mqw54 2/2 Running 0 32m rook-ceph-osd-prepare-ocs-deviceset-0-data-0lskk5-4248w 0/1 Completed 0 32m rook-ceph-osd-prepare-ocs-deviceset-1-data-0fbb56-b8njz 0/1 Completed 0 2m52s rook-ceph-osd-prepare-ocs-deviceset-2-data-0kznvm-d4rvb 0/1 Completed 0 32m Additional info: Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/31120/. Versions: OC version: Client Version: 4.10.24 Server Version: 4.14.0-0.nightly-2023-11-06-203803 Kubernetes Version: v1.27.6+f67aeb3 OCS version: ocs-operator.v4.14.0-rhodf OpenShift Container Storage 4.14.0-rhodf Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-11-06-203803 True False 56m Cluster version is 4.14.0-0.nightly-2023-11-06-203803 Rook version: rook: v4.14.0-0.103536c37b9f9063f4abe9db7c59150125b75908 go: go1.20.10 Ceph version: ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6832
One thing I noticed. When I run the ocs-osd-removal job, I don't see the job created: $ oc get jobs NAME COMPLETIONS DURATION AGE rook-ceph-osd-prepare-ocs-deviceset-0-data-0xxp9b 1/1 36s 5h20m rook-ceph-osd-prepare-ocs-deviceset-1-data-0rmwdb 1/1 30s 112m rook-ceph-osd-prepare-ocs-deviceset-2-data-0nxxhh 1/1 34s 5h20m Even though the process was completed successfully, as I mentioned in my previous comment https://bugzilla.redhat.com/show_bug.cgi?id=1970939#c25. Let me know if it's a bug or not.
That was actually expected as by the time you are trying to see the job, it has been deleted as it was successful. But FYI we are reverting this BZ. This requires doc changes at too many places & doesn't really add any value to the customer's exp. more details here https://bugzilla.redhat.com/show_bug.cgi?id=2248833
Okay, thanks for the clarification.