Description of problem (please be detailed as possible and provide log snippests): If ODF v4.12.z is installed but StorageCluster is not yet created, and we try to upgrade to ODF v4.13.z, it does not succeed as the "rook-ceph-operator" is stuck in "CreateContainerConfigError" error. ➜ ~ oc get pod/rook-ceph-operator-799f4557f8-z76dn NAME READY STATUS RESTARTS AGE rook-ceph-operator-799f4557f8-z76dn 0/1 CreateContainerConfigError 0 85s --- ➜ ~ oc describe pod/rook-ceph-operator-799f4557f8-z76dn Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 110s default-scheduler Successfully assigned openshift-storage/rook-ceph-operator-799f4557f8-z76dn to t3-585mv-worker-0-b5rl7 Normal Pulled 6s (x10 over 107s) kubelet Container image "icr.io/cpopen/rook-ceph-operator@sha256:70aebdc2b80283fc69f77acc7390667868939dea5839070673814b6351fda4d7" already present on machine Warning Failed 6s (x10 over 107s) kubelet Error: couldn't find key CSI_ENABLE_READ_AFFINITY in ConfigMap openshift-storage/ocs-operator-config --- ➜ ~ oc get cm ocs-operator-config -oyaml apiVersion: v1 data: CSI_CLUSTER_NAME: 8a514d5d-f345-42bd-8fa7-54c37e9c9fe2 kind: ConfigMap metadata: creationTimestamp: "2023-08-10T07:16:14Z" name: ocs-operator-config namespace: openshift-storage ownerReferences: - apiVersion: ocs.openshift.io/v1 blockOwnerDeletion: true controller: true kind: OCSInitialization name: ocsinit uid: 6cdfa990-37e1-4596-b0e5-69baedafc0f3 resourceVersion: "17531216" uid: 22e4fa9c-a8ca-40fa-8e92-c2c4b4f5119d Version of all relevant components (if applicable): ODF v4.13.z Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? Yes. Delete the "ocs-operator-config" ConfigMap. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: It is a regression, since it didn't happen in previous upgrades. But, it's a corner case and a very minor issue which was never tested. Steps to Reproduce: 1. Install ODF operator v4.12.z. Do not create StorageCluster. 2. Upgrade to ODF operator v4.13.z. 3. Check the operator pod status. Actual results: rook-ceph-operator pod is in "CreateContainerConfigError" blocking the upgrade. Expected results: Upgrade should complete without issue. Additional info:
The same issue can also be seen when upgrading without storagecluster from 4.13 to 4.14 as well. The root cause for this is the ocsinitialization controller just creates the ocs-operator-config cm(from which rook-ceph-operator pod takes env values for configuration). The task of keeping it updated is on the storagecluster controller. But when a storagecluster is not there and an upgrade happens the configmap is not updated but the new rook operator looks for the Key for configuration. So the rook-cpeh-operator pod fails & so does the upgrade. The solution is to change the behavior of the handling of the configmap so that when the storagecluster controller is not present ocsinitialization controller should own the configmap & keep it updated. But when the storagecluster is present it should do nothing and leave it to the storagecluster controller.
Why should the ODF operator be upgraded if the storage cluster is not installed? We can delete ODF and re-install the new operator
Hi Oded although that's true that they can just delete the whole thing and re-install the new version, this was a genuine bug in our code that shouldn't have been there. The failure to upgrade might cause confusion and unnecessary support cases, so better just to fix it.
Hi Malay, Can you check the test procedure? 1. Deploy OCP4.14 cluster without ODF 2. Install ODF4.13 without install storagecluster [ quay.io/rhceph-dev/ocs-registry:4.13.3-5 ] 3. Upgrade ODF4.14 [ quay.io/rhceph-dev/ocs-registry:4.14.0-135 ]
Yes Oded, This is the way to test it.
Bug Fixed 1. Install OCP 4.14 4.14.0-0.nightly-2023-09-15-233408 2. Install ODF4.13 [4.13.2-3] Opertor via UI $ oc get csv -A NAMESPACE NAME DISPLAY VERSION REPLACES PHASE openshift-operator-lifecycle-manager packageserver Package Server 0.0.1-snapshot Succeeded openshift-storage mcg-operator.v4.13.2-rhodf NooBaa Operator 4.13.2-rhodf mcg-operator.v4.13.1-rhodf Succeeded openshift-storage ocs-operator.v4.13.2-rhodf OpenShift Container Storage 4.13.2-rhodf ocs-operator.v4.13.1-rhodf Succeeded openshift-storage odf-csi-addons-operator.v4.13.2-rhodf CSI Addons 4.13.2-rhodf odf-csi-addons-operator.v4.13.1-rhodf Succeeded openshift-storage odf-operator.v4.13.2-rhodf OpenShift Data Foundation 4.13.2-rhodf odf-operator.v4.13.1-rhodf Succeeded 3. Upgrade ODF4.14 a.Disabling default source: redhat-operators $ oc patch operatorhub.config.openshift.io/cluster -p='{"spec":{"sources":[{"disabled":true,"name":"redhat-operators"}]}}' --type=merge operatorhub.config.openshift.io/cluster patched b.Change channel in subscription odf-operator [stable-4.13 -> stable-4.14] $ oc edit subscription odf-operator -n openshift-storage C. Create CatalogSource with “quay.io/rhceph-dev/ocs-registry:latest-stable-4.14” image $ oc create -f CatalogSource.yaml catalogsource.operators.coreos.com/redhat-operators created $ cat CatalogSource.yaml --- apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: redhat-operators namespace: openshift-marketplace labels: ocs-operator-internal: "true" spec: displayName: Openshift Container Storage icon: base64data: "" mediatype: "" image: quay.io/rhceph-dev/ocs-registry:latest-stable-4.14 publisher: Red Hat sourceType: grpc priority: 100 # If the registry image still have the same tag (latest-stable-4.6, or for stage testing) # we need to have this updateStrategy, otherwise we will not see new pushed content. updateStrategy: registryPoll: interval: 15m $ oc get CatalogSource redhat-operators -n openshift-marketplace NAME DISPLAY TYPE PUBLISHER AGE redhat-operators Openshift Container Storage grpc Red Hat 3m15s d.Apply icsp.yaml $ podman run --entrypoint cat quay.io/rhceph-dev/ocs-registry:latest-stable-4.14 /icsp.yaml | oc apply -f - imagecontentsourcepolicy.operator.openshift.io/df-repo created 4.Check CSV:$ oc get csv -A NAMESPACE NAME DISPLAY VERSION REPLACES PHASE openshift-operator-lifecycle-manager packageserver Package Server 0.0.1-snapshot Succeeded openshift-storage mcg-operator.v4.14.0-135.stable NooBaa Operator 4.14.0-135.stable mcg-operator.v4.13.2-rhodf Succeeded openshift-storage ocs-operator.v4.14.0-135.stable OpenShift Container Storage 4.14.0-135.stable ocs-operator.v4.13.2-rhodf Succeeded openshift-storage odf-csi-addons-operator.v4.14.0-135.stable CSI Addons 4.14.0-135.stable odf-csi-addons-operator.v4.13.2-rhodf Succeeded openshift-storage odf-operator.v4.14.0-135.stable OpenShift Data Foundation 4.14.0-135.stable odf-operator.v4.13.2-rhodf Succeeded 5. Check pods on openshift-storage $ oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-5f9c677f6-b2kln 2/2 Running 0 4m1s noobaa-operator-d9ddd977f-rcf6r 2/2 Running 0 4m24s ocs-metrics-exporter-7b5bf57957-ht66f 1/1 Running 0 4m44s ocs-operator-568fbbb9c4-j86hk 1/1 Running 0 4m44s odf-console-d656466b5-vxszp 1/1 Running 0 22m odf-operator-controller-manager-66c899649b-g6rkb 2/2 Running 1 (10m ago) 22m rook-ceph-operator-587cc6f966-sr8qt 1/1 Running 0 4m19s For more Info: https://docs.google.com/document/d/1lf4Enu-efynTMl0N77xxHQeVTfnSis0mX1L3i6vNa_4/edit
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6832