I think let's keep this one bug only as I also think the root cause is the same for them
Hi Alicia, Running a little busy due to the Feature Development cycle for 4.13 as just a couple of weeks are left. But I can assure you this bug is on my radar & I have already made some investigations into the root cause & I expect to look at it more deeply after the feature freeze for 4.13 which is on Feb 28. If there is some customer dependency or waiting on the issue please do let me know I can move things around, in that case, to have prioritized attention on this.
Hi Alicia, Basically earlier when the template was created once it was not getting updated afterwards. Which was creating problem. For ex, if someone installs odf 4.10 then the template is created at that time with a rook-ceph-image in the template job spec. Later on customer goes to upgrade odf from 4.10 to 4.11, 4.11 to 4.12 and so on. But as the template was not reconciled, the rook-ceph-image on the template job spec will remain the old one (4.10 one in this case) even though you are now on some newer version of odf like may be 4.12. With this fix the template will get reconciled, So the rook ceph image on the template job spec will remain the correct one.
What should be the updated steps? should we try to update from 4.12 to 4.13? Or just deploy a cluster with 4.13 and execute the command: $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f -
I found the earlier merged patch was incomplete, So I had moved the BZ to post to merge another fix for it. It's now merged. Also I will be backporting the fix to all way till 4.9 I have created clone Bzs 4.12- https://bugzilla.redhat.com/show_bug.cgi?id=2211592 4.11- https://bugzilla.redhat.com/show_bug.cgi?id=2211594 4.10- https://bugzilla.redhat.com/show_bug.cgi?id=2211595 4.9- https://bugzilla.redhat.com/show_bug.cgi?id=2211598
Verification steps for the BZ- Install ODF in a version previous to 4.9.11, and check the templates created for the rook-ceph-image in them, And the Parameters in them. Now go on upgrading ODF releases, First to the latest version in ODF 4.9 then to 4.10 to 4.11 to 4.12. Try creating the process from the template with each version oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f - The mentioned error "unknown parameter name "FORCE_OSD_REMOVAL"" will be there. Each time check the template yaml too, They won't have changed anything in their object section or in the parameter section. Now upgrade to ODF 4.13 If you try to now run oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f - It should succeed without any error. Now if you also check the templates they should have been updated. There will be the latest rook-ceph image, The new parameter FORCE_OSD_REMOVAL will be now in the parameters section.
I followed the process described in the comment above. 1. I deployed a cluster with OCP 4.9 and ODF 4.9.10(lower than 4.9.11). 2. I checked the ocs osd removal job command, which resulted in the expected error: $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0 -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f - error: unknown parameter name "FORCE_OSD_REMOVAL" error: no objects passed to create 3. Upgrade OCP and ODF from 4.9 to 4.10. 4. Checked the ocs osd removal job command, which resulted in the expected error: $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0 -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f - error: unknown parameter name "FORCE_OSD_REMOVAL" error: no objects passed to create 5. Upgrade OCP and ODF from 4.10 to 4.11. 6. Checked the ocs osd removal job command, which resulted in the expected error: $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0 -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f - error: unknown parameter name "FORCE_OSD_REMOVAL" error: no objects passed to create 7. Upgrade OCP and ODF from 4.11 to 4.12. 8. Checked the ocs osd removal job command, which resulted in the expected error: $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0 -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f - error: unknown parameter name "FORCE_OSD_REMOVAL" error: no objects passed to create 9. Upgrade OCP and ODF from 4.12 to 4.13. 10. Checked the ocs osd removal job command, which now succeeded: $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0 -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f - job.batch/ocs-osd-removal-job created $ oc get jobs ocs-osd-removal-job NAME COMPLETIONS DURATION AGE ocs-osd-removal-job 1/1 8s 21m
Additional info: Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25639/. Cluster versions after the last upgrade: OC version: Client Version: 4.10.24 Server Version: 4.13.0-0.nightly-2023-06-15-222927 Kubernetes Version: v1.26.5+7d22122 OCS version: ocs-operator.v4.13.0-rhodf OpenShift Container Storage 4.13.0-rhodf ocs-operator.v4.12.4-rhodf Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-0.nightly-2023-06-15-222927 True False 43m Cluster version is 4.13.0-0.nightly-2023-06-15-222927 Rook version: rook: v4.13.0-0.b57f0c7db8116e754fc77b55825d7fd75c6f1aa3 go: go1.19.9 Ceph version: ceph version 17.2.6-70.el9cp (fe62dcdbb2c6e05782a3e2b67d025b84ff5047cc) quincy (stable)
According to the comments above, I am moving the BZ to Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days