2235571 – [ODF 4.13] Upgrade from 4.12.z to 4.13.z fails if StorageCluster is not created

Bug 2235571 - [ODF 4.13] Upgrade from 4.12.z to 4.13.z fails if StorageCluster is not created

Summary: [ODF 4.13] Upgrade from 4.12.z to 4.13.z fails if StorageCluster is not created

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	ODF 4.13.5
Assignee:	Malay Kumar parida
QA Contact:	Oded
Docs Contact:
URL:
Whiteboard:	block
Duplicates (1):	2248958 (view as bug list)
Depends On:	2231074
Blocks:
TreeView+	depends on / blocked

Reported:	2023-08-29 06:03 UTC by Malay Kumar parida
Modified:	2023-12-13 12:58 UTC (History)
CC List:	9 users (show)
Fixed In Version:	4.13.5-8
Doc Type:	No Doc Update
Doc Text:
Clone Of:	2231074
Environment:
Last Closed:	2023-12-13 12:58:12 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 2181	0	None	open	Bug 2235571:[release-4.13] Fix upgrade issue when storagecluster is not present	2023-09-28 10:19:57 UTC
Red Hat Product Errata	RHBA-2023:7775	0	None	None	None	2023-12-13 12:58:16 UTC

Description Malay Kumar parida 2023-08-29 06:03:23 UTC

+++ This bug was initially created as a clone of Bug #2231074 +++

Description of problem (please be detailed as possible and provide log
snippests):

If ODF v4.12.z is installed but StorageCluster is not yet created, and we try to upgrade to ODF v4.13.z, it does not succeed as the "rook-ceph-operator" is stuck in "CreateContainerConfigError" error.

➜  ~ oc get pod/rook-ceph-operator-799f4557f8-z76dn
NAME                                  READY   STATUS                       RESTARTS   AGE
rook-ceph-operator-799f4557f8-z76dn   0/1     CreateContainerConfigError   0          85s
---

➜  ~ oc describe pod/rook-ceph-operator-799f4557f8-z76dn
Events:
  Type     Reason                 Age                 From               Message
  ----     ------                 ----                ----               -------
  Normal   Scheduled              110s                default-scheduler  Successfully assigned openshift-storage/rook-ceph-operator-799f4557f8-z76dn to t3-585mv-worker-0-b5rl7
  Normal   Pulled                 6s (x10 over 107s)  kubelet            Container image "icr.io/cpopen/rook-ceph-operator@sha256:70aebdc2b80283fc69f77acc7390667868939dea5839070673814b6351fda4d7" already present on machine
  Warning  Failed                 6s (x10 over 107s)  kubelet            Error: couldn't find key CSI_ENABLE_READ_AFFINITY in ConfigMap openshift-storage/ocs-operator-config
---

➜  ~ oc get cm ocs-operator-config -oyaml
apiVersion: v1
data:
  CSI_CLUSTER_NAME: 8a514d5d-f345-42bd-8fa7-54c37e9c9fe2
kind: ConfigMap
metadata:
  creationTimestamp: "2023-08-10T07:16:14Z"
  name: ocs-operator-config
  namespace: openshift-storage
  ownerReferences:
  - apiVersion: ocs.openshift.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: OCSInitialization
    name: ocsinit
    uid: 6cdfa990-37e1-4596-b0e5-69baedafc0f3
  resourceVersion: "17531216"
  uid: 22e4fa9c-a8ca-40fa-8e92-c2c4b4f5119d


Version of all relevant components (if applicable):
ODF v4.13.z

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes

Is there any workaround available to the best of your knowledge?
Yes. Delete the "ocs-operator-config" ConfigMap.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

Yes

If this is a regression, please provide more details to justify this:

It is a regression, since it didn't happen in previous upgrades. But, it's a corner case and a very minor issue which was never tested.

Steps to Reproduce:
1. Install ODF operator v4.12.z. Do not create StorageCluster.
2. Upgrade to ODF operator v4.13.z.
3. Check the operator pod status.


Actual results:
rook-ceph-operator pod is in "CreateContainerConfigError" blocking the upgrade.

Expected results:
Upgrade should complete without issue.

Additional info:

--- Additional comment from RHEL Program Management on 2023-08-10 13:24:16 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.14.0' to '?', and so is being proposed to be fixed at the ODF 4.14.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from Malay Kumar parida on 2023-08-25 08:55:19 UTC ---

The same issue can also be seen when upgrading without storagecluster from 4.13 to 4.14 as well.
The root cause for this is the ocsinitialization controller just creates the ocs-operator-config cm(from which rook-ceph-operator pod takes env values for configuration). The task of keeping it updated is on the storagecluster controller. But when a storagecluster is not there and an upgrade happens the configmap is not updated but the new rook operator looks for the Key for configuration. So the rook-cpeh-operator pod fails & so does the upgrade.

The solution is to change the behavior of the handling of the configmap so that when the storagecluster controller is not present ocsinitialization controller should own the configmap & keep it updated. But when the storagecluster is present it should do nothing and leave it to the storagecluster controller.

Comment 3 krishnaram Karthick 2023-10-27 10:16:41 UTC

Moving to 4.13.6 as we don't have any more bandwidth for testing.

Comment 6 Santosh Pillai 2023-11-10 05:22:48 UTC

*** Bug 2248958 has been marked as a duplicate of this bug. ***

Comment 7 khover 2023-11-10 12:16:45 UTC

Hello Team,

Do we have a target date for the back port ? 

We already have customers hitting this issue on install of 4.13

Comment 8 Malay Kumar parida 2023-11-13 05:02:43 UTC

The proposed fix for this has been up for some time but due to QE bandwidth issues it has been pushed till 4.13.6.

Comment 9 Boris Ranto 2023-11-13 09:35:24 UTC

This looks like a much bigger issue now that ODF 4.14 was released, it is actually blocking our CI and customers seem to be hitting this as well. As such, we really need to mark this as a blocker and nominate it for 4.13.5.

Comment 11 Malay Kumar parida 2023-11-13 09:39:22 UTC

https://chat.google.com/room/AAAAREGEba8/q0sy9xRQ7e0

After the GA of 4.14 on 8th Nov, Our Catalogosurce now also have 4.14 bundles which changes the behaviour of OLM.
Now if someone is trying to install ODF 4.13 OLM first installs ODF 4.12 & then automatically upgrades to 4.13.
So here this issues get's triggered where there is a upgrade situation from 4.12 to 4.13 & storagecluster is not present.
This is now being hit by a customer trying to install ODF 4.13.4 as well as our CI is facing this during 4.13.5 builds.
We need to take this in 4.13.5 anyhow.

Comment 13 khover 2023-11-16 18:00:30 UTC

KCS created


https://access.redhat.com/solutions/7044025

Comment 15 Oded 2023-11-19 13:39:36 UTC

Bug Fixed


Test Procedure:
1.Deply OCP cluster 4.13.0-0.nightly-2023-11-17-172647 [vsphere]
2.Install ODF 4.12.10-1 [without sotragecluster]
$ oc patch operatorhub.config.openshift.io/cluster -p='{"spec":{"sources":[{"disabled":true,"name":"redhat-operators"}]}}' --type=merge
operatorhub.config.openshift.io/cluster patched

$ cat CatalogSource.yaml 
---
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: redhat-operators
  namespace: openshift-marketplace
  labels:
      ocs-operator-internal: "true"
spec:
  displayName: Openshift Container Storage
  icon:
    base64data: ""
    mediatype: ""
  image: quay.io/rhceph-dev/ocs-registry:latest-stable-4.12
  publisher: Red Hat
  sourceType: grpc
  priority: 100
  # If the registry image still have the same tag (latest-stable-4.6, or for stage testing)
  # we need to have this updateStrategy, otherwise we will not see new pushed content.
  updateStrategy:
    registryPoll:
        interval: 15m

$ oc create -f CatalogSource.yaml 
catalogsource.operators.coreos.com/redhat-operators created

$ podman run --entrypoint cat quay.io/rhceph-dev/ocs-registry:latest-stable-4.12  /icsp.yaml | oc apply -f -
imagecontentsourcepolicy.operator.openshift.io/df-repo-v4.12.10-1 created

$ oc get csv -A 
NAMESPACE                              NAME                                     DISPLAY                       VERSION         REPLACES                                PHASE
openshift-operator-lifecycle-manager   packageserver                            Package Server                0.19.0                                                  Succeeded
openshift-storage                      mcg-operator.v4.12.10-rhodf              NooBaa Operator               4.12.10-rhodf   mcg-operator.v4.12.9-rhodf              Succeeded
openshift-storage                      ocs-operator.v4.12.10-rhodf              OpenShift Container Storage   4.12.10-rhodf   ocs-operator.v4.12.9-rhodf              Succeeded
openshift-storage                      odf-csi-addons-operator.v4.12.10-rhodf   CSI Addons                    4.12.10-rhodf   odf-csi-addons-operator.v4.12.9-rhodf   Succeeded
openshift-storage                      odf-operator.v4.12.10-rhodf              OpenShift Data Foundation     4.12.10-rhodf   odf-operator.v4.12.9-rhodf              Succeeded

$ oc get pods -n openshift-storage
NAME                                               READY   STATUS    RESTARTS   AGE
csi-addons-controller-manager-6f58b5f5d5-gs25w     2/2     Running   0          3m2s
noobaa-operator-b8c48b64b-lf6xd                    1/1     Running   0          9m54s
ocs-metrics-exporter-649bfbf4f9-swmfg              1/1     Running   0          10m
ocs-operator-654f56d565-j4hc8                      1/1     Running   0          10m
odf-console-5fbb8546bb-x2fr7                       1/1     Running   0          9m58s
odf-operator-controller-manager-5df847fc94-5hczs   2/2     Running   0          9m58s
rook-ceph-operator-5cc69f8967-rnpbm                1/1     Running   0          10m


3.Upgrade ODF 4.12.10-1 -> ODF 4.13.5-8
a.Change channel in subscription odf-operator [stable-4.12 -> stable-4.13]
$ oc edit subscription odf-operator -n openshift-storage
subscription.operators.coreos.com/odf-operator edited

b.edit catalogsource
oc edit catalogsource -n openshift-marketplace redhat-operators  [  image: quay.io/rhceph-dev/ocs-registry:4.13.5-8 ]

c.Apply icsp.yaml
podman run --entrypoint cat quay.io/rhceph-dev/ocs-registry:4.13.5-8   /icsp.yaml | oc apply -f -
imagecontentsourcepolicy.operator.openshift.io/df-repo-v4.13.5-8 created

d.Check CSV:
$ oc get csv -A   
NAMESPACE                              NAME                                    DISPLAY                       VERSION        REPLACES                                 PHASE
openshift-operator-lifecycle-manager   packageserver                           Package Server                0.19.0                                                  Succeeded
openshift-storage                      mcg-operator.v4.13.5-rhodf              NooBaa Operator               4.13.5-rhodf   mcg-operator.v4.12.10-rhodf              Succeeded
openshift-storage                      ocs-operator.v4.13.5-rhodf              OpenShift Container Storage   4.13.5-rhodf   ocs-operator.v4.12.10-rhodf              Succeeded
openshift-storage                      odf-csi-addons-operator.v4.13.5-rhodf   CSI Addons                    4.13.5-rhodf   odf-csi-addons-operator.v4.12.10-rhodf   Succeeded
openshift-storage                      odf-operator.v4.13.5-rhodf              OpenShift Data Foundation     4.13.5-rhodf   odf-operator.v4.12.10-rhodf              Succeeded

$ oc get pods
NAME                                               READY   STATUS    RESTARTS   AGE
csi-addons-controller-manager-6f5b6bf87c-hkz5h     2/2     Running   0          2m40s
noobaa-operator-5f44fc7f8b-4f5xb                   1/1     Running   0          2m9s
ocs-metrics-exporter-5b6495bb8-rv7nn               1/1     Running   0          2m3s
ocs-operator-6b55544958-mbcb2                      1/1     Running   0          2m2s
odf-console-6c74987c64-9mlwl                       1/1     Running   0          15m
odf-operator-controller-manager-67d7b5797c-fcbkd   2/2     Running   0          15m
rook-ceph-operator-b66687fd7-78rds                 1/1     Running   0          77s

4.Install Stoargecluster [4.13.5-8]
$ oc get storagecluster 
NAME                 AGE     PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   5m24s   Ready              2023-11-19T13:32:02Z   4.13.5

sh-5.1$ ceph health
HEALTH_OK

Comment 20 errata-xmlrpc 2023-12-13 12:58:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.5 Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7775

Note You need to log in before you can comment on or make changes to this bug.