This bug was initially created as a copy of Bug #1970939 We merged a fix for this BZ where we deleted any completed osd removal job after they were successful, So that while creating another osd removal job a check for the earlier completed job's deletion is not necessary. But we didn't know our docs had these steps where customer using encryption have a step where they depend on the logs of the successful job logs to get the PVC name for the replaced osd. They would use this PVC name to clean up the dmcrypt mapping on the node. Updating the docs for so many types of platforms at so many places feels risky & anyway, this fix is not adding much value instead just creating more issues. So we should revert the fix & push that with 4.14.1. Considering 4.14.1 is not that far away this is a very small risk. This only affects customer using encryption who are trying to replace their OSDs. Even if some customer faces any problems and any cases come to support we can easily find the orphan dmcrypt mapping by its pvc name manually & delete it. And to add these unused mappings also do not break anything so they are not at all concerning.
I tested it with a GCP 4.15 cluster. I did the following steps: 1. I execute the ocs-osd-removal job: $ osd_id_to_remove=2 $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f - job.batch/ocs-osd-removal-job created 2. Switch to openshift-storage project: $ oc project openshift-storage Now using project "openshift-storage" on server "https://api.ikave-aws414.qe.rh-ocs.com:6443". 3. Check the jobs and see the ocs-osd-removal job completed: $ oc get jobs NAME COMPLETIONS DURATION AGE ocs-osd-removal-job 1/1 5s 24s rook-ceph-osd-prepare-ocs-deviceset-0-data-0t75xz 1/1 23s 128m rook-ceph-osd-prepare-ocs-deviceset-1-data-0k7zm9 1/1 23s 128m rook-ceph-osd-prepare-ocs-deviceset-2-data-095m2g 1/1 22s 128m 4. Check the pods, and see the new ocs-osd-removal pod created: $ oc get pods | grep osd ocs-osd-removal-job-8lsc5 0/1 Completed 0 39s rook-ceph-osd-0-fb8558fdb-4htxm 2/2 Running 0 128m rook-ceph-osd-1-567f8567c7-86s7t 2/2 Running 0 128m rook-ceph-osd-2-54f5974b4c-ms29x 2/2 Running 0 128m rook-ceph-osd-prepare-ocs-deviceset-0-data-0t75xz-tk652 0/1 Completed 0 128m rook-ceph-osd-prepare-ocs-deviceset-1-data-0k7zm9-nkhx2 0/1 Completed 0 128m rook-ceph-osd-prepare-ocs-deviceset-2-data-095m2g-8kww6 0/1 Completed 0 128m 5. Trying to create another ocs-osd-removal job, which failed as expected: $ osd_id_to_remove=1 $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f - Error from server (AlreadyExists): error when creating "STDIN": jobs.batch "ocs-osd-removal-job" already exists 6. Delete the old ocs-osd-removal job: $ oc delete jobs.batch ocs-osd-removal-job job.batch "ocs-osd-removal-job" deleted 7. Check the pods again: oc get pods | grep osd rook-ceph-osd-0-fb8558fdb-4htxm 2/2 Running 0 152m rook-ceph-osd-1-567f8567c7-86s7t 2/2 Running 0 152m rook-ceph-osd-2-54f5974b4c-ms29x 2/2 Running 0 152m rook-ceph-osd-prepare-ocs-deviceset-0-data-0t75xz-tk652 0/1 Completed 0 152m rook-ceph-osd-prepare-ocs-deviceset-1-data-0k7zm9-nkhx2 0/1 Completed 0 152m rook-ceph-osd-prepare-ocs-deviceset-2-data-095m2g-8kww6 0/1 Completed 0 152m 8. Create a new ocs-osd-removal job, check the jobs and pods and see it created successfully: $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f - job.batch/ocs-osd-removal-job created $ oc get jobs NAME COMPLETIONS DURATION AGE ocs-osd-removal-job 1/1 5s 9s rook-ceph-osd-prepare-ocs-deviceset-0-data-0t75xz 1/1 23s 153m rook-ceph-osd-prepare-ocs-deviceset-1-data-0k7zm9 1/1 23s 153m rook-ceph-osd-prepare-ocs-deviceset-2-data-095m2g 1/1 22s 153m ikave:ocs-ci$ oc get pods | grep osd ocs-osd-removal-job-glzts 0/1 Completed 0 22s rook-ceph-osd-0-fb8558fdb-4htxm 2/2 Running 0 153m rook-ceph-osd-1-567f8567c7-86s7t 2/2 Running 0 153m rook-ceph-osd-2-54f5974b4c-ms29x 2/2 Running 0 153m rook-ceph-osd-prepare-ocs-deviceset-0-data-0t75xz-tk652 0/1 Completed 0 153m rook-ceph-osd-prepare-ocs-deviceset-1-data-0k7zm9-nkhx2 0/1 Completed 0 153m rook-ceph-osd-prepare-ocs-deviceset-2-data-095m2g-8kww6 0/1 Completed 0 153m Additional info: Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/32167/ Versions: OC version: Client Version: 4.10.24 Server Version: 4.15.0-0.nightly-2023-12-18-220750 Kubernetes Version: v1.28.4+7aa0a74 OCS version: ocs-operator.v4.15.0-89.stable OpenShift Container Storage 4.15.0-89.stable Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.0-0.nightly-2023-12-18-220750 True False 168m Cluster version is 4.15.0-0.nightly-2023-12-18-220750 Rook version: rook: v4.15.0-0.fcbc808fe7930fc7d937d06c983d5f0b96d952d3 go: go1.20.10 Ceph version: ceph version 17.2.6-167.el9cp (5ef1496ea3e9daaa9788809a172bd5a1c3192cf7) quincy (stable)
One more thing that I checked is the test recovery from volume deletion with vSphere. This test executes the ocs-osd-removal job after a volume deletion from the vSphere side and checks that it was completed successfully. You can see here that the test finished successfully: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/632/17496/851123/851140/851143/log.
According to the two comments above, I am moving the bug to Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383