+++ This bug was initially created as a clone of Bug #2231074 +++ Description of problem (please be detailed as possible and provide log snippests): If ODF v4.12.z is installed but StorageCluster is not yet created, and we try to upgrade to ODF v4.13.z, it does not succeed as the "rook-ceph-operator" is stuck in "CreateContainerConfigError" error. ➜ ~ oc get pod/rook-ceph-operator-799f4557f8-z76dn NAME READY STATUS RESTARTS AGE rook-ceph-operator-799f4557f8-z76dn 0/1 CreateContainerConfigError 0 85s --- ➜ ~ oc describe pod/rook-ceph-operator-799f4557f8-z76dn Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 110s default-scheduler Successfully assigned openshift-storage/rook-ceph-operator-799f4557f8-z76dn to t3-585mv-worker-0-b5rl7 Normal Pulled 6s (x10 over 107s) kubelet Container image "icr.io/cpopen/rook-ceph-operator@sha256:70aebdc2b80283fc69f77acc7390667868939dea5839070673814b6351fda4d7" already present on machine Warning Failed 6s (x10 over 107s) kubelet Error: couldn't find key CSI_ENABLE_READ_AFFINITY in ConfigMap openshift-storage/ocs-operator-config --- ➜ ~ oc get cm ocs-operator-config -oyaml apiVersion: v1 data: CSI_CLUSTER_NAME: 8a514d5d-f345-42bd-8fa7-54c37e9c9fe2 kind: ConfigMap metadata: creationTimestamp: "2023-08-10T07:16:14Z" name: ocs-operator-config namespace: openshift-storage ownerReferences: - apiVersion: ocs.openshift.io/v1 blockOwnerDeletion: true controller: true kind: OCSInitialization name: ocsinit uid: 6cdfa990-37e1-4596-b0e5-69baedafc0f3 resourceVersion: "17531216" uid: 22e4fa9c-a8ca-40fa-8e92-c2c4b4f5119d Version of all relevant components (if applicable): ODF v4.13.z Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? Yes. Delete the "ocs-operator-config" ConfigMap. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: It is a regression, since it didn't happen in previous upgrades. But, it's a corner case and a very minor issue which was never tested. Steps to Reproduce: 1. Install ODF operator v4.12.z. Do not create StorageCluster. 2. Upgrade to ODF operator v4.13.z. 3. Check the operator pod status. Actual results: rook-ceph-operator pod is in "CreateContainerConfigError" blocking the upgrade. Expected results: Upgrade should complete without issue. Additional info: --- Additional comment from RHEL Program Management on 2023-08-10 13:24:16 UTC --- This bug having no release flag set previously, is now set with release flag 'odf‑4.14.0' to '?', and so is being proposed to be fixed at the ODF 4.14.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag. --- Additional comment from Malay Kumar parida on 2023-08-25 08:55:19 UTC --- The same issue can also be seen when upgrading without storagecluster from 4.13 to 4.14 as well. The root cause for this is the ocsinitialization controller just creates the ocs-operator-config cm(from which rook-ceph-operator pod takes env values for configuration). The task of keeping it updated is on the storagecluster controller. But when a storagecluster is not there and an upgrade happens the configmap is not updated but the new rook operator looks for the Key for configuration. So the rook-cpeh-operator pod fails & so does the upgrade. The solution is to change the behavior of the handling of the configmap so that when the storagecluster controller is not present ocsinitialization controller should own the configmap & keep it updated. But when the storagecluster is present it should do nothing and leave it to the storagecluster controller.
Moving to 4.13.6 as we don't have any more bandwidth for testing.
*** Bug 2248958 has been marked as a duplicate of this bug. ***
Hello Team, Do we have a target date for the back port ? We already have customers hitting this issue on install of 4.13
The proposed fix for this has been up for some time but due to QE bandwidth issues it has been pushed till 4.13.6.
This looks like a much bigger issue now that ODF 4.14 was released, it is actually blocking our CI and customers seem to be hitting this as well. As such, we really need to mark this as a blocker and nominate it for 4.13.5.
https://chat.google.com/room/AAAAREGEba8/q0sy9xRQ7e0 After the GA of 4.14 on 8th Nov, Our Catalogosurce now also have 4.14 bundles which changes the behaviour of OLM. Now if someone is trying to install ODF 4.13 OLM first installs ODF 4.12 & then automatically upgrades to 4.13. So here this issues get's triggered where there is a upgrade situation from 4.12 to 4.13 & storagecluster is not present. This is now being hit by a customer trying to install ODF 4.13.4 as well as our CI is facing this during 4.13.5 builds. We need to take this in 4.13.5 anyhow.
KCS created https://access.redhat.com/solutions/7044025
Bug Fixed Test Procedure: 1.Deply OCP cluster 4.13.0-0.nightly-2023-11-17-172647 [vsphere] 2.Install ODF 4.12.10-1 [without sotragecluster] $ oc patch operatorhub.config.openshift.io/cluster -p='{"spec":{"sources":[{"disabled":true,"name":"redhat-operators"}]}}' --type=merge operatorhub.config.openshift.io/cluster patched $ cat CatalogSource.yaml --- apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: redhat-operators namespace: openshift-marketplace labels: ocs-operator-internal: "true" spec: displayName: Openshift Container Storage icon: base64data: "" mediatype: "" image: quay.io/rhceph-dev/ocs-registry:latest-stable-4.12 publisher: Red Hat sourceType: grpc priority: 100 # If the registry image still have the same tag (latest-stable-4.6, or for stage testing) # we need to have this updateStrategy, otherwise we will not see new pushed content. updateStrategy: registryPoll: interval: 15m $ oc create -f CatalogSource.yaml catalogsource.operators.coreos.com/redhat-operators created $ podman run --entrypoint cat quay.io/rhceph-dev/ocs-registry:latest-stable-4.12 /icsp.yaml | oc apply -f - imagecontentsourcepolicy.operator.openshift.io/df-repo-v4.12.10-1 created $ oc get csv -A NAMESPACE NAME DISPLAY VERSION REPLACES PHASE openshift-operator-lifecycle-manager packageserver Package Server 0.19.0 Succeeded openshift-storage mcg-operator.v4.12.10-rhodf NooBaa Operator 4.12.10-rhodf mcg-operator.v4.12.9-rhodf Succeeded openshift-storage ocs-operator.v4.12.10-rhodf OpenShift Container Storage 4.12.10-rhodf ocs-operator.v4.12.9-rhodf Succeeded openshift-storage odf-csi-addons-operator.v4.12.10-rhodf CSI Addons 4.12.10-rhodf odf-csi-addons-operator.v4.12.9-rhodf Succeeded openshift-storage odf-operator.v4.12.10-rhodf OpenShift Data Foundation 4.12.10-rhodf odf-operator.v4.12.9-rhodf Succeeded $ oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-6f58b5f5d5-gs25w 2/2 Running 0 3m2s noobaa-operator-b8c48b64b-lf6xd 1/1 Running 0 9m54s ocs-metrics-exporter-649bfbf4f9-swmfg 1/1 Running 0 10m ocs-operator-654f56d565-j4hc8 1/1 Running 0 10m odf-console-5fbb8546bb-x2fr7 1/1 Running 0 9m58s odf-operator-controller-manager-5df847fc94-5hczs 2/2 Running 0 9m58s rook-ceph-operator-5cc69f8967-rnpbm 1/1 Running 0 10m 3.Upgrade ODF 4.12.10-1 -> ODF 4.13.5-8 a.Change channel in subscription odf-operator [stable-4.12 -> stable-4.13] $ oc edit subscription odf-operator -n openshift-storage subscription.operators.coreos.com/odf-operator edited b.edit catalogsource oc edit catalogsource -n openshift-marketplace redhat-operators [ image: quay.io/rhceph-dev/ocs-registry:4.13.5-8 ] c.Apply icsp.yaml podman run --entrypoint cat quay.io/rhceph-dev/ocs-registry:4.13.5-8 /icsp.yaml | oc apply -f - imagecontentsourcepolicy.operator.openshift.io/df-repo-v4.13.5-8 created d.Check CSV: $ oc get csv -A NAMESPACE NAME DISPLAY VERSION REPLACES PHASE openshift-operator-lifecycle-manager packageserver Package Server 0.19.0 Succeeded openshift-storage mcg-operator.v4.13.5-rhodf NooBaa Operator 4.13.5-rhodf mcg-operator.v4.12.10-rhodf Succeeded openshift-storage ocs-operator.v4.13.5-rhodf OpenShift Container Storage 4.13.5-rhodf ocs-operator.v4.12.10-rhodf Succeeded openshift-storage odf-csi-addons-operator.v4.13.5-rhodf CSI Addons 4.13.5-rhodf odf-csi-addons-operator.v4.12.10-rhodf Succeeded openshift-storage odf-operator.v4.13.5-rhodf OpenShift Data Foundation 4.13.5-rhodf odf-operator.v4.12.10-rhodf Succeeded $ oc get pods NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-6f5b6bf87c-hkz5h 2/2 Running 0 2m40s noobaa-operator-5f44fc7f8b-4f5xb 1/1 Running 0 2m9s ocs-metrics-exporter-5b6495bb8-rv7nn 1/1 Running 0 2m3s ocs-operator-6b55544958-mbcb2 1/1 Running 0 2m2s odf-console-6c74987c64-9mlwl 1/1 Running 0 15m odf-operator-controller-manager-67d7b5797c-fcbkd 2/2 Running 0 15m rook-ceph-operator-b66687fd7-78rds 1/1 Running 0 77s 4.Install Stoargecluster [4.13.5-8] $ oc get storagecluster NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 5m24s Ready 2023-11-19T13:32:02Z 4.13.5 sh-5.1$ ceph health HEALTH_OK
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.5 Bug Fix Update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:7775