2231074 – Upgrade from 4.12.z to 4.13.z fails if StorageCluster is not created

Bug 2231074 - Upgrade from 4.12.z to 4.13.z fails if StorageCluster is not created

Summary: Upgrade from 4.12.z to 4.13.z fails if StorageCluster is not created

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	ODF 4.14.0
Assignee:	Malay Kumar parida
QA Contact:	Oded
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2235571
TreeView+	depends on / blocked

Reported:	2023-08-10 13:24 UTC by umanga
Modified:	2023-12-05 14:24 UTC (History)
CC List:	2 users (show)
Fixed In Version:	4.14.0-128
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2235571 (view as bug list)
Environment:
Last Closed:	2023-11-08 18:53:30 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 2180	0	None	open	Bug 2231074: [release-4.14] Fix upgrade issue when storagecluster is not present	2023-09-05 13:11:47 UTC
Red Hat Product Errata	RHSA-2023:6832	0	None	None	None	2023-11-08 18:53:57 UTC

Description umanga 2023-08-10 13:24:03 UTC

Description of problem (please be detailed as possible and provide log
snippests):

If ODF v4.12.z is installed but StorageCluster is not yet created, and we try to upgrade to ODF v4.13.z, it does not succeed as the "rook-ceph-operator" is stuck in "CreateContainerConfigError" error.

➜  ~ oc get pod/rook-ceph-operator-799f4557f8-z76dn
NAME                                  READY   STATUS                       RESTARTS   AGE
rook-ceph-operator-799f4557f8-z76dn   0/1     CreateContainerConfigError   0          85s
---

➜  ~ oc describe pod/rook-ceph-operator-799f4557f8-z76dn
Events:
  Type     Reason                 Age                 From               Message
  ----     ------                 ----                ----               -------
  Normal   Scheduled              110s                default-scheduler  Successfully assigned openshift-storage/rook-ceph-operator-799f4557f8-z76dn to t3-585mv-worker-0-b5rl7
  Normal   Pulled                 6s (x10 over 107s)  kubelet            Container image "icr.io/cpopen/rook-ceph-operator@sha256:70aebdc2b80283fc69f77acc7390667868939dea5839070673814b6351fda4d7" already present on machine
  Warning  Failed                 6s (x10 over 107s)  kubelet            Error: couldn't find key CSI_ENABLE_READ_AFFINITY in ConfigMap openshift-storage/ocs-operator-config
---

➜  ~ oc get cm ocs-operator-config -oyaml
apiVersion: v1
data:
  CSI_CLUSTER_NAME: 8a514d5d-f345-42bd-8fa7-54c37e9c9fe2
kind: ConfigMap
metadata:
  creationTimestamp: "2023-08-10T07:16:14Z"
  name: ocs-operator-config
  namespace: openshift-storage
  ownerReferences:
  - apiVersion: ocs.openshift.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: OCSInitialization
    name: ocsinit
    uid: 6cdfa990-37e1-4596-b0e5-69baedafc0f3
  resourceVersion: "17531216"
  uid: 22e4fa9c-a8ca-40fa-8e92-c2c4b4f5119d


Version of all relevant components (if applicable):
ODF v4.13.z

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes

Is there any workaround available to the best of your knowledge?
Yes. Delete the "ocs-operator-config" ConfigMap.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

Yes

If this is a regression, please provide more details to justify this:

It is a regression, since it didn't happen in previous upgrades. But, it's a corner case and a very minor issue which was never tested.

Steps to Reproduce:
1. Install ODF operator v4.12.z. Do not create StorageCluster.
2. Upgrade to ODF operator v4.13.z.
3. Check the operator pod status.


Actual results:
rook-ceph-operator pod is in "CreateContainerConfigError" blocking the upgrade.

Expected results:
Upgrade should complete without issue.

Additional info:

Comment 2 Malay Kumar parida 2023-08-25 08:55:19 UTC

The same issue can also be seen when upgrading without storagecluster from 4.13 to 4.14 as well.
The root cause for this is the ocsinitialization controller just creates the ocs-operator-config cm(from which rook-ceph-operator pod takes env values for configuration). The task of keeping it updated is on the storagecluster controller. But when a storagecluster is not there and an upgrade happens the configmap is not updated but the new rook operator looks for the Key for configuration. So the rook-cpeh-operator pod fails & so does the upgrade.

The solution is to change the behavior of the handling of the configmap so that when the storagecluster controller is not present ocsinitialization controller should own the configmap & keep it updated. But when the storagecluster is present it should do nothing and leave it to the storagecluster controller.

Comment 8 Oded 2023-09-07 10:41:53 UTC

Why should the ODF operator be upgraded if the storage cluster is not installed? We can delete ODF and re-install the new operator

Comment 9 Malay Kumar parida 2023-09-12 11:17:35 UTC

Hi Oded although that's true that they can just delete the whole thing and re-install the new version, this was a genuine bug in our code that shouldn't have been there. The failure to upgrade might cause confusion and unnecessary support cases, so better just to fix it.

Comment 10 Oded 2023-09-18 11:30:24 UTC

Hi Malay, 
Can you check the test procedure?

1. Deploy OCP4.14 cluster without ODF
2. Install ODF4.13 without install storagecluster [ quay.io/rhceph-dev/ocs-registry:4.13.3-5 ]
3. Upgrade ODF4.14 [ quay.io/rhceph-dev/ocs-registry:4.14.0-135 ]

Comment 11 Malay Kumar parida 2023-09-19 05:30:50 UTC

Yes Oded, This is the way to test it.

Comment 12 Oded 2023-09-19 11:28:57 UTC

Bug Fixed

1. Install OCP 4.14 4.14.0-0.nightly-2023-09-15-233408

2. Install ODF4.13 [4.13.2-3] Opertor via UI
$ oc get csv -A 
NAMESPACE                              NAME                                    DISPLAY                       VERSION          REPLACES                                PHASE
openshift-operator-lifecycle-manager   packageserver                           Package Server                0.0.1-snapshot                                           Succeeded
openshift-storage                      mcg-operator.v4.13.2-rhodf              NooBaa Operator               4.13.2-rhodf     mcg-operator.v4.13.1-rhodf              Succeeded
openshift-storage                      ocs-operator.v4.13.2-rhodf              OpenShift Container Storage   4.13.2-rhodf     ocs-operator.v4.13.1-rhodf              Succeeded
openshift-storage                      odf-csi-addons-operator.v4.13.2-rhodf   CSI Addons                    4.13.2-rhodf     odf-csi-addons-operator.v4.13.1-rhodf   Succeeded
openshift-storage                      odf-operator.v4.13.2-rhodf              OpenShift Data Foundation     4.13.2-rhodf     odf-operator.v4.13.1-rhodf              Succeeded

3. Upgrade ODF4.14
a.Disabling default source: redhat-operators
$ oc patch operatorhub.config.openshift.io/cluster -p='{"spec":{"sources":[{"disabled":true,"name":"redhat-operators"}]}}' --type=merge
operatorhub.config.openshift.io/cluster patched

b.Change channel in subscription odf-operator [stable-4.13 -> stable-4.14]
$ oc edit subscription odf-operator -n openshift-storage


C. Create CatalogSource with  “quay.io/rhceph-dev/ocs-registry:latest-stable-4.14” image
$ oc create -f CatalogSource.yaml 
catalogsource.operators.coreos.com/redhat-operators created
$ cat CatalogSource.yaml 
---
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: redhat-operators
  namespace: openshift-marketplace
  labels:
      ocs-operator-internal: "true"
spec:
  displayName: Openshift Container Storage
  icon:
    base64data: ""
    mediatype: ""
  image: quay.io/rhceph-dev/ocs-registry:latest-stable-4.14
  publisher: Red Hat
  sourceType: grpc
  priority: 100
  # If the registry image still have the same tag (latest-stable-4.6, or for stage testing)
  # we need to have this updateStrategy, otherwise we will not see new pushed content.
  updateStrategy:
    registryPoll:
        interval: 15m

$ oc get CatalogSource redhat-operators -n openshift-marketplace
NAME               DISPLAY                       TYPE   PUBLISHER   AGE
redhat-operators   Openshift Container Storage   grpc   Red Hat     3m15s

d.Apply icsp.yaml
$ podman run --entrypoint cat quay.io/rhceph-dev/ocs-registry:latest-stable-4.14  /icsp.yaml | oc apply -f -
imagecontentsourcepolicy.operator.openshift.io/df-repo created

4.Check CSV:$ oc get csv -A
NAMESPACE                              NAME                                         DISPLAY                       VERSION             REPLACES                                PHASE
openshift-operator-lifecycle-manager   packageserver                                Package Server                0.0.1-snapshot                                              Succeeded
openshift-storage                      mcg-operator.v4.14.0-135.stable              NooBaa Operator               4.14.0-135.stable   mcg-operator.v4.13.2-rhodf              Succeeded
openshift-storage                      ocs-operator.v4.14.0-135.stable              OpenShift Container Storage   4.14.0-135.stable   ocs-operator.v4.13.2-rhodf              Succeeded
openshift-storage                      odf-csi-addons-operator.v4.14.0-135.stable   CSI Addons                    4.14.0-135.stable   odf-csi-addons-operator.v4.13.2-rhodf   Succeeded
openshift-storage                      odf-operator.v4.14.0-135.stable              OpenShift Data Foundation     4.14.0-135.stable   odf-operator.v4.13.2-rhodf              Succeeded


5. Check pods on openshift-storage
$ oc get pods -n openshift-storage 
NAME                                               READY   STATUS    RESTARTS      AGE
csi-addons-controller-manager-5f9c677f6-b2kln      2/2     Running   0             4m1s
noobaa-operator-d9ddd977f-rcf6r                    2/2     Running   0             4m24s
ocs-metrics-exporter-7b5bf57957-ht66f              1/1     Running   0             4m44s
ocs-operator-568fbbb9c4-j86hk                      1/1     Running   0             4m44s
odf-console-d656466b5-vxszp                        1/1     Running   0             22m
odf-operator-controller-manager-66c899649b-g6rkb   2/2     Running   1 (10m ago)   22m
rook-ceph-operator-587cc6f966-sr8qt                1/1     Running   0             4m19s


For more Info: 
https://docs.google.com/document/d/1lf4Enu-efynTMl0N77xxHQeVTfnSis0mX1L3i6vNa_4/edit

Comment 14 errata-xmlrpc 2023-11-08 18:53:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832

Note You need to log in before you can comment on or make changes to this bug.