Description of problem (please be detailed as possible and provide log snippets): In job template we are using the rook master tag in the init container in the product, we should read the rook version from the env and not use master or any specific version. https://github.com/red-hat-storage/ocs-operator/blob/release-4.9/controllers/storagecluster/job_templates.go#L165 We have this from the 4.9 when the templated was added. Version of all relevant components (if applicable): From 4.9 to the latest main Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
*** Bug 2135736 has been marked as a duplicate of this bug. ***
@shubam - 1) what would be the steps to verify this bug? 2) Do we need to add any new tests due to this change?
I tested it with vSphere OCP 4.12 and ODF 4.12 dynamic cluster. The steps I did to reproduce the bug: 1. I deleted a disk from vSphere. 2. Check the osd status and observe the osd that is down: $ oc get pods -o wide | grep osd rook-ceph-osd-0-76748c9b6-vpwz9 2/2 Running 0 69m 10.130.2.22 compute-1 <none> <none> rook-ceph-osd-1-54749698d7-2jp48 1/2 CrashLoopBackOff 3 (41s ago) 69m 10.129.2.20 compute-0 <none> <none> rook-ceph-osd-2-99f58954-k42nk 2/2 Running 0 68m 10.128.2.20 compute-2 <none> <none> 3. I scaled the osd-1 deployment and deleted the osd-1 pod, as mentioned in the doc. 4. Run the "ocs-osd-removal" job, and see it completed successfully: $ oc get jobs NAME COMPLETIONS DURATION AGE ocs-osd-removal-job 1/1 11s 22s rook-ceph-osd-prepare-ocs-deviceset-0-data-0vbd6h 1/1 72s 76m rook-ceph-osd-prepare-ocs-deviceset-1-data-0269xm 1/1 40s 76m rook-ceph-osd-prepare-ocs-deviceset-2-data-0w6njt 0/1 1s 1s $ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage NAME READY STATUS RESTARTS AGE ocs-osd-removal-job-8b449 0/1 Completed 0 59s 5. Check the logs of the "ocs-osd-removal-job" $ oc logs ocs-osd-removal-job-8b449 2022-11-22 11:41:25.506021 I | rookcmd: starting Rook v4.12.0-0.e237b7ff0b9225db1a5f8a95dc50f9f8e2d55206 with arguments '/usr/local/bin/rook ceph osd remove --osd-ids=1 --force-osd-removal true' 2022-11-22 11:41:25.506069 I | rookcmd: flag values: --force-osd-removal=true, --help=false, --log-level=DEBUG, --operator-image=, --osd-ids=1, --preserve-pvc=false, --service-account= We can see that in the first line, the rook version is v4.12.0-0 without the master tag. additional info: Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/18259/ Versions: OC version: Client Version: 4.10.24 Server Version: 4.12.0-0.nightly-2022-11-22-012345 Kubernetes Version: v1.25.2+5533733 OCS verison: ocs-operator.v4.12.0-114.stable OpenShift Container Storage 4.12.0-114.stable Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2022-11-22-012345 True False 6h44m Cluster version is 4.12.0-0.nightly-2022-11-22-012345 Rook version: rook: v4.12.0-0.e237b7ff0b9225db1a5f8a95dc50f9f8e2d55206 go: go1.18.7 Ceph version: ceph version 16.2.10-72.el8cp (3311949c2d1edf5cabcc20ba0f35b4bfccbf021e) pacific (stable)
According to the two comments above, I am moving the bug to Verified.