Bug 2188427 - [External mode upgrade]: Upgrade from 4.12 -> 4.13 external mode is failing because rook-ceph-operator is not reaching clean state
Summary: [External mode upgrade]: Upgrade from 4.12 -> 4.13 external mode is failing b...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.13
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ODF 4.13.0
Assignee: Malay Kumar parida
QA Contact: Petr Balogh
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-04-20 17:39 UTC by shylesh
Modified: 2023-08-09 17:00 UTC (History)
8 users (show)

Fixed In Version: 4.13.0-181
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-06-21 15:25:08 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-operator pull 2024 0 None open Return ms_mode=legacy instead of blank value for External/Provider mode 2023-04-20 18:40:56 UTC
Github red-hat-storage ocs-operator pull 2026 0 None open Bug 2188427:[release-4.13] Use ms_mode=legacy instead of blank value, add network spec to external mode ceph cluster, ad... 2023-04-24 04:57:10 UTC
Red Hat Product Errata RHBA-2023:3742 0 None None None 2023-06-21 15:25:52 UTC

Description shylesh 2023-04-20 17:39:55 UTC
Description of problem (please be detailed as possible and provide log
snippests):

Upgrade from 4.12 to 4.13 in external mode deployment is failing because rook-ceph-operator is not reaching clean state after upgrade

Version of all relevant components (if applicable):
quay.io/rhceph-dev/ocs-registry:4.13.0-124

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, not able to proceed with upgrades from 4.12 to 4.13 in external mode deploy

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
not sure

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.deploy 4.12 externl mode deploy
2. Run upgrade to 4.13 for the build quay.io/rhceph-dev/ocs-registry:4.13.0-124



Actual results:
Upgrade fails

Expected results:
Successful upgrade

Additional info:
from rook-ceph-operator logs:

===========================================
    name: rook-ceph-operator
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: couldn't find key CSI_CEPHFS_KERNEL_MOUNT_OPTIONS in ConfigMap openshift-storage/ocs-operator-config
        reason: CreateContainerConfigError
===============================================


From odf-operator-controller-manager logs we see

========================================

2023-04-20T14:23:06.799892316Z 2023-04-20T14:23:06Z	INFO	controllers.StorageSystem	vendor CSV is installed and ready	{"instance": "openshift-storage/ocs-external-storagecluster-storagesystem", "ClusterServiceVersion": "odf-csi-addons-operator.v4.13.0-124.stable"}
2023-04-20T14:23:06.809406503Z 2023-04-20T14:23:06Z	ERROR	Reconciler error	{"controller": "storagesystem", "controllerGroup": "odf.openshift.io", "controllerKind": "StorageSystem", "StorageSystem": {"name":"ocs-external-storagecluster-storagesystem","namespace":"openshift-storage"}, "namespace": "openshift-storage", "name": "ocs-external-storagecluster-storagesystem", "reconcileID": "47fcd510-9671-4993-9a89-480bd619a2fe", "error": "CSV is not successfully installed"}
2023-04-20T14:23:06.809406503Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler

=========================================

All the must gather located at : https://url.corp.redhat.com/mustgather

Comment 2 Malay Kumar parida 2023-04-20 18:11:26 UTC
The ocs-operator-config cm behaves in such a way that if the desired value is different than the current value than only it updates. In the case when for external mode upgrade happens from 4.12 to 4.13, It checks the desired value by calling the function getCephFSKernelMountOptions which returns a blank value, And compares it with the current value of the key, coincidently the configmap gives a blank value for the non-existent key. So it thinks the configmap is already in the correct state, skips adding the key to the cm. 
https://github.com/red-hat-storage/ocs-operator/blob/1c6879a5455aa669d8b984973cdcffd626c6eb32/controllers/storagecluster/ocs_operator_config.go#L58

A solution to this is in the function getCephFSKernelMountOptions, We can return ms_mode=legacy instead of a blank value which will have the same function but will help us to avoid this issue. https://github.com/red-hat-storage/ocs-operator/blob/1c6879a5455aa669d8b984973cdcffd626c6eb32/controllers/storagecluster/ocs_operator_config.go#L125.

Will raise a fix accordingly.

Comment 9 Malay Kumar parida 2023-05-08 05:08:07 UTC
Hi @pbalogh , 
Initially, the bug was reported due to after upgrade the rook-ceph-operator pod was not coming up to running.

I checked the must gather the earlier issue about rook-ceph-operator pod not starting up is not there. In fact, all the pods are clean & running.
I checked the csvs all the CSV's also have succeeded. 
I also checked the storagecluster & it's also ready, the ceph cluster & it's connected. 

I checked the job details & I see few other problems like AssertionError: Job mcg-workload doesn't have any running pod & few other errors.

Can you try rerunning the job?

Comment 11 Malay Kumar parida 2023-05-12 10:03:30 UTC
This failure looks like to be something else other than the original issue that was there, Can you rerun the job with a pause so that I can take a look inside the cluster to see whats wrong

Comment 12 Mudit Agarwal 2023-05-15 17:58:01 UTC
Any updates?

Comment 13 Malay Kumar parida 2023-05-15 18:34:02 UTC
Moving the discussion to Google chat for faster resolution.
Thread Link- https://chat.google.com/room/AAAAREGEba8/vgRySkM1nXE

Comment 17 Petr Balogh 2023-05-17 06:45:18 UTC
Upgrade has passed ok again.

Comment 19 errata-xmlrpc 2023-06-21 15:25:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742


Note You need to log in before you can comment on or make changes to this bug.