Description of problem (please be detailed as possible and provide log snippests): Upgrade from 4.12 to 4.13 in external mode deployment is failing because rook-ceph-operator is not reaching clean state after upgrade Version of all relevant components (if applicable): quay.io/rhceph-dev/ocs-registry:4.13.0-124 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, not able to proceed with upgrades from 4.12 to 4.13 in external mode deploy Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? not sure If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.deploy 4.12 externl mode deploy 2. Run upgrade to 4.13 for the build quay.io/rhceph-dev/ocs-registry:4.13.0-124 Actual results: Upgrade fails Expected results: Successful upgrade Additional info: from rook-ceph-operator logs: =========================================== name: rook-ceph-operator ready: false restartCount: 0 started: false state: waiting: message: couldn't find key CSI_CEPHFS_KERNEL_MOUNT_OPTIONS in ConfigMap openshift-storage/ocs-operator-config reason: CreateContainerConfigError =============================================== From odf-operator-controller-manager logs we see ======================================== 2023-04-20T14:23:06.799892316Z 2023-04-20T14:23:06Z INFO controllers.StorageSystem vendor CSV is installed and ready {"instance": "openshift-storage/ocs-external-storagecluster-storagesystem", "ClusterServiceVersion": "odf-csi-addons-operator.v4.13.0-124.stable"} 2023-04-20T14:23:06.809406503Z 2023-04-20T14:23:06Z ERROR Reconciler error {"controller": "storagesystem", "controllerGroup": "odf.openshift.io", "controllerKind": "StorageSystem", "StorageSystem": {"name":"ocs-external-storagecluster-storagesystem","namespace":"openshift-storage"}, "namespace": "openshift-storage", "name": "ocs-external-storagecluster-storagesystem", "reconcileID": "47fcd510-9671-4993-9a89-480bd619a2fe", "error": "CSV is not successfully installed"} 2023-04-20T14:23:06.809406503Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler ========================================= All the must gather located at : https://url.corp.redhat.com/mustgather
The ocs-operator-config cm behaves in such a way that if the desired value is different than the current value than only it updates. In the case when for external mode upgrade happens from 4.12 to 4.13, It checks the desired value by calling the function getCephFSKernelMountOptions which returns a blank value, And compares it with the current value of the key, coincidently the configmap gives a blank value for the non-existent key. So it thinks the configmap is already in the correct state, skips adding the key to the cm. https://github.com/red-hat-storage/ocs-operator/blob/1c6879a5455aa669d8b984973cdcffd626c6eb32/controllers/storagecluster/ocs_operator_config.go#L58 A solution to this is in the function getCephFSKernelMountOptions, We can return ms_mode=legacy instead of a blank value which will have the same function but will help us to avoid this issue. https://github.com/red-hat-storage/ocs-operator/blob/1c6879a5455aa669d8b984973cdcffd626c6eb32/controllers/storagecluster/ocs_operator_config.go#L125. Will raise a fix accordingly.
Hi @pbalogh , Initially, the bug was reported due to after upgrade the rook-ceph-operator pod was not coming up to running. I checked the must gather the earlier issue about rook-ceph-operator pod not starting up is not there. In fact, all the pods are clean & running. I checked the csvs all the CSV's also have succeeded. I also checked the storagecluster & it's also ready, the ceph cluster & it's connected. I checked the job details & I see few other problems like AssertionError: Job mcg-workload doesn't have any running pod & few other errors. Can you try rerunning the job?
This failure looks like to be something else other than the original issue that was there, Can you rerun the job with a pause so that I can take a look inside the cluster to see whats wrong
Any updates?
Moving the discussion to Google chat for faster resolution. Thread Link- https://chat.google.com/room/AAAAREGEba8/vgRySkM1nXE
Upgrade has passed ok again.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742