2188427 – [External mode upgrade]: Upgrade from 4.12 -> 4.13 external mode is failing because rook-ceph-operator is not reaching clean state

Bug 2188427 - [External mode upgrade]: Upgrade from 4.12 -> 4.13 external mode is failing because rook-ceph-operator is not reaching clean state

Summary: [External mode upgrade]: Upgrade from 4.12 -> 4.13 external mode is failing b...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.13
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.13.0
Assignee:	Malay Kumar parida
QA Contact:	Petr Balogh
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-04-20 17:39 UTC by shylesh
Modified:	2023-08-09 17:00 UTC (History)
CC List:	8 users (show)
Fixed In Version:	4.13.0-181
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-06-21 15:25:08 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 2024	None	open	Return ms_mode=legacy instead of blank value for External/Provider mode	2023-04-20 18:40:56 UTC
Github	red-hat-storage ocs-operator pull 2026	None	open	Bug 2188427:[release-4.13] Use ms_mode=legacy instead of blank value, add network spec to external mode ceph cluster, ad...	2023-04-24 04:57:10 UTC
Red Hat Product Errata	RHBA-2023:3742	None	None	None	2023-06-21 15:25:52 UTC

Description shylesh 2023-04-20 17:39:55 UTC

Description of problem (please be detailed as possible and provide log
snippests):

Upgrade from 4.12 to 4.13 in external mode deployment is failing because rook-ceph-operator is not reaching clean state after upgrade

Version of all relevant components (if applicable):
quay.io/rhceph-dev/ocs-registry:4.13.0-124

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, not able to proceed with upgrades from 4.12 to 4.13 in external mode deploy

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
not sure

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.deploy 4.12 externl mode deploy
2. Run upgrade to 4.13 for the build quay.io/rhceph-dev/ocs-registry:4.13.0-124



Actual results:
Upgrade fails

Expected results:
Successful upgrade

Additional info:
from rook-ceph-operator logs:

===========================================
    name: rook-ceph-operator
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: couldn't find key CSI_CEPHFS_KERNEL_MOUNT_OPTIONS in ConfigMap openshift-storage/ocs-operator-config
        reason: CreateContainerConfigError
===============================================


From odf-operator-controller-manager logs we see

========================================

2023-04-20T14:23:06.799892316Z 2023-04-20T14:23:06Z	INFO	controllers.StorageSystem	vendor CSV is installed and ready	{"instance": "openshift-storage/ocs-external-storagecluster-storagesystem", "ClusterServiceVersion": "odf-csi-addons-operator.v4.13.0-124.stable"}
2023-04-20T14:23:06.809406503Z 2023-04-20T14:23:06Z	ERROR	Reconciler error	{"controller": "storagesystem", "controllerGroup": "odf.openshift.io", "controllerKind": "StorageSystem", "StorageSystem": {"name":"ocs-external-storagecluster-storagesystem","namespace":"openshift-storage"}, "namespace": "openshift-storage", "name": "ocs-external-storagecluster-storagesystem", "reconcileID": "47fcd510-9671-4993-9a89-480bd619a2fe", "error": "CSV is not successfully installed"}
2023-04-20T14:23:06.809406503Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler

=========================================

All the must gather located at : https://url.corp.redhat.com/mustgather

Comment 2 Malay Kumar parida 2023-04-20 18:11:26 UTC

The ocs-operator-config cm behaves in such a way that if the desired value is different than the current value than only it updates. In the case when for external mode upgrade happens from 4.12 to 4.13, It checks the desired value by calling the function getCephFSKernelMountOptions which returns a blank value, And compares it with the current value of the key, coincidently the configmap gives a blank value for the non-existent key. So it thinks the configmap is already in the correct state, skips adding the key to the cm. 
https://github.com/red-hat-storage/ocs-operator/blob/1c6879a5455aa669d8b984973cdcffd626c6eb32/controllers/storagecluster/ocs_operator_config.go#L58

A solution to this is in the function getCephFSKernelMountOptions, We can return ms_mode=legacy instead of a blank value which will have the same function but will help us to avoid this issue. https://github.com/red-hat-storage/ocs-operator/blob/1c6879a5455aa669d8b984973cdcffd626c6eb32/controllers/storagecluster/ocs_operator_config.go#L125.

Will raise a fix accordingly.

Comment 9 Malay Kumar parida 2023-05-08 05:08:07 UTC

Hi @pbalogh , 
Initially, the bug was reported due to after upgrade the rook-ceph-operator pod was not coming up to running.

I checked the must gather the earlier issue about rook-ceph-operator pod not starting up is not there. In fact, all the pods are clean & running.
I checked the csvs all the CSV's also have succeeded. 
I also checked the storagecluster & it's also ready, the ceph cluster & it's connected. 

I checked the job details & I see few other problems like AssertionError: Job mcg-workload doesn't have any running pod & few other errors.

Can you try rerunning the job?

Comment 11 Malay Kumar parida 2023-05-12 10:03:30 UTC

This failure looks like to be something else other than the original issue that was there, Can you rerun the job with a pause so that I can take a look inside the cluster to see whats wrong

Comment 12 Mudit Agarwal 2023-05-15 17:58:01 UTC

Any updates?

Comment 13 Malay Kumar parida 2023-05-15 18:34:02 UTC

Moving the discussion to Google chat for faster resolution.
Thread Link- https://chat.google.com/room/AAAAREGEba8/vgRySkM1nXE

Comment 17 Petr Balogh 2023-05-17 06:45:18 UTC

Upgrade has passed ok again.

Comment 19 errata-xmlrpc 2023-06-21 15:25:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742

Note You need to log in before you can comment on or make changes to this bug.