Bug 2188427

Summary:	[External mode upgrade]: Upgrade from 4.12 -> 4.13 external mode is failing because rook-ceph-operator is not reaching clean state
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	shylesh <shmohan>
Component:	ocs-operator	Assignee:	Malay Kumar parida <mparida>
Status:	CLOSED ERRATA	QA Contact:	Petr Balogh <pbalogh>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.13	CC:	mparida, muagarwa, nberry, nigoyal, ocs-bugs, odf-bz-bot, pbalogh, vavuthu
Target Milestone:	---
Target Release:	ODF 4.13.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	4.13.0-181	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-06-21 15:25:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description shylesh 2023-04-20 17:39:55 UTC

Description of problem (please be detailed as possible and provide log
snippests):

Upgrade from 4.12 to 4.13 in external mode deployment is failing because rook-ceph-operator is not reaching clean state after upgrade

Version of all relevant components (if applicable):
quay.io/rhceph-dev/ocs-registry:4.13.0-124

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, not able to proceed with upgrades from 4.12 to 4.13 in external mode deploy

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
not sure

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.deploy 4.12 externl mode deploy
2. Run upgrade to 4.13 for the build quay.io/rhceph-dev/ocs-registry:4.13.0-124



Actual results:
Upgrade fails

Expected results:
Successful upgrade

Additional info:
from rook-ceph-operator logs:

===========================================
    name: rook-ceph-operator
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: couldn't find key CSI_CEPHFS_KERNEL_MOUNT_OPTIONS in ConfigMap openshift-storage/ocs-operator-config
        reason: CreateContainerConfigError
===============================================


From odf-operator-controller-manager logs we see

========================================

2023-04-20T14:23:06.799892316Z 2023-04-20T14:23:06Z	INFO	controllers.StorageSystem	vendor CSV is installed and ready	{"instance": "openshift-storage/ocs-external-storagecluster-storagesystem", "ClusterServiceVersion": "odf-csi-addons-operator.v4.13.0-124.stable"}
2023-04-20T14:23:06.809406503Z 2023-04-20T14:23:06Z	ERROR	Reconciler error	{"controller": "storagesystem", "controllerGroup": "odf.openshift.io", "controllerKind": "StorageSystem", "StorageSystem": {"name":"ocs-external-storagecluster-storagesystem","namespace":"openshift-storage"}, "namespace": "openshift-storage", "name": "ocs-external-storagecluster-storagesystem", "reconcileID": "47fcd510-9671-4993-9a89-480bd619a2fe", "error": "CSV is not successfully installed"}
2023-04-20T14:23:06.809406503Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler

=========================================

All the must gather located at : https://url.corp.redhat.com/mustgather

Comment 2 Malay Kumar parida 2023-04-20 18:11:26 UTC

The ocs-operator-config cm behaves in such a way that if the desired value is different than the current value than only it updates. In the case when for external mode upgrade happens from 4.12 to 4.13, It checks the desired value by calling the function getCephFSKernelMountOptions which returns a blank value, And compares it with the current value of the key, coincidently the configmap gives a blank value for the non-existent key. So it thinks the configmap is already in the correct state, skips adding the key to the cm. 
https://github.com/red-hat-storage/ocs-operator/blob/1c6879a5455aa669d8b984973cdcffd626c6eb32/controllers/storagecluster/ocs_operator_config.go#L58

A solution to this is in the function getCephFSKernelMountOptions, We can return ms_mode=legacy instead of a blank value which will have the same function but will help us to avoid this issue. https://github.com/red-hat-storage/ocs-operator/blob/1c6879a5455aa669d8b984973cdcffd626c6eb32/controllers/storagecluster/ocs_operator_config.go#L125.

Will raise a fix accordingly.

Comment 9 Malay Kumar parida 2023-05-08 05:08:07 UTC

Hi @pbalogh , 
Initially, the bug was reported due to after upgrade the rook-ceph-operator pod was not coming up to running.

I checked the must gather the earlier issue about rook-ceph-operator pod not starting up is not there. In fact, all the pods are clean & running.
I checked the csvs all the CSV's also have succeeded. 
I also checked the storagecluster & it's also ready, the ceph cluster & it's connected. 

I checked the job details & I see few other problems like AssertionError: Job mcg-workload doesn't have any running pod & few other errors.

Can you try rerunning the job?

Comment 11 Malay Kumar parida 2023-05-12 10:03:30 UTC

This failure looks like to be something else other than the original issue that was there, Can you rerun the job with a pause so that I can take a look inside the cluster to see whats wrong

Comment 12 Mudit Agarwal 2023-05-15 17:58:01 UTC

Any updates?

Comment 13 Malay Kumar parida 2023-05-15 18:34:02 UTC

Moving the discussion to Google chat for faster resolution.
Thread Link- https://chat.google.com/room/AAAAREGEba8/vgRySkM1nXE

Comment 17 Petr Balogh 2023-05-17 06:45:18 UTC

Upgrade has passed ok again.

Comment 19 errata-xmlrpc 2023-06-21 15:25:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742