2275886 – rook-ceph-operator after upgrade to ODF 4.16 is in CrashLoopBackOff

Bug 2275886 - rook-ceph-operator after upgrade to ODF 4.16 is in CrashLoopBackOff

Summary: rook-ceph-operator after upgrade to ODF 4.16 is in CrashLoopBackOff

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.16
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.16.0
Assignee:	Blaine Gardner
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-04-18 08:30 UTC by Petr Balogh
Modified:	2024-07-17 13:19 UTC (History)
CC List:	7 users (show)
Fixed In Version:	4.16.0-82
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-07-17 13:19:50 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage rook pull 628	0	None	open	BUG 2275886: Fix OwnerReferences on CSI config map	2024-04-18 22:27:22 UTC
Red Hat Product Errata	RHSA-2024:4591	0	None	None	None	2024-07-17 13:19:55 UTC

Description Petr Balogh 2024-04-18 08:30:03 UTC

Description of problem (please be detailed as possible and provide log
snippests):

I see those in logs:
2024-04-17T20:07:37.767093153Z 2024-04-17 20:07:37.767052 E | ceph-csi: failed to reconcile failed to update CSI driver options for cluster "ocs-storagecluster-cephcluster": failed to fetch current csi config map: configmaps "rook-ceph-csi-config" not found
2024-04-17T20:07:37.909745677Z 2024-04-17 20:07:37.909698 E | ceph-nodedaemon-controller: ceph version not found for image "registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:cda4d8682b12f13ce90211cad773100c32584b6bcea33a6cb69a66d9aece86f5" used by cluster "ocs-storagecluster-cephcluster" in namespace "openshift-storage". attempt to determine ceph version for the current cluster image timed out
2024-04-17T20:07:37.910571987Z 2024-04-17 20:07:37.910550 E | ceph-nodedaemon-controller: ceph version not found for image "registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:cda4d8682b12f13ce90211cad773100c32584b6bcea33a6cb69a66d9aece86f5" used by cluster "ocs-storagecluster-cephcluster" in namespace "openshift-storage". attempt to determine ceph version for the current cluster image timed out
2024-04-17T20:07:37.913702659Z 2024-04-17 20:07:37.913663 E | ceph-nodedaemon-controller: ceph version not found for image "registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:cda4d8682b12f13ce90211cad773100c32584b6bcea33a6cb69a66d9aece86f5" used by cluster "ocs-storagecluster-cephcluster" in namespace "openshift-storage". attempt to determine ceph version for the current cluster image timed out
2024-04-17T20:07:37.914446422Z 2024-04-17 20:07:37.914426 E | ceph-nodedaemon-controller: ceph version not found for image "registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:cda4d8682b12f13ce90211cad773100c32584b6bcea33a6cb69a66d9aece86f5" used by cluster "ocs-storagecluster-cephcluster" in namespace "openshift-storage". attempt to determine ceph version for the current cluster image timed out
2024-04-17T20:07:37.915218322Z 2024-04-17 20:07:37.915198 E | ceph-nodedaemon-controller: ceph version not found for image "registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:cda4d8682b12f13ce90211cad773100c32584b6bcea33a6cb69a66d9aece86f5" used by cluster "ocs-storagecluster-cephcluster" in namespace "openshift-storage". attempt to determine ceph version for the current cluster image timed out
2024-04-17T20:07:37.915844895Z 2024-04-17 20:07:37.915827 E | ceph-nodedaemon-controller: ceph version not found for image "registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:cda4d8682b12f13ce90211cad773100c32584b6bcea33a6cb69a66d9aece86f5" used by cluster "ocs-storagecluster-cephcluster" in namespace "openshift-storage". attempt to determine ceph version for the current cluster image timed out
2024-04-17T20:07:37.916487018Z 2024-04-17 20:07:37.916468 E | ceph-nodedaemon-controller: ceph version not found for image "registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:cda4d8682b12f13ce90211cad773100c32584b6bcea33a6cb69a66d9aece86f5" used by cluster "ocs-storagecluster-cephcluster" in namespace "openshift-storage". attempt to determine ceph version for the current cluster image timed out
2024-04-17T20:07:37.917080921Z 2024-04-17 20:07:37.917062 E | ceph-nodedaemon-controller: ceph version not found for image "registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:cda4d8682b12f13ce90211cad773100c32584b6bcea33a6cb69a66d9aece86f5" used by cluster "ocs-storagecluster-cephcluster" in namespace "openshift-storage". attempt to determine ceph version for the current cluster image timed out
2024-04-17T20:07:37.917728634Z 2024-04-17 20:07:37.917709 E | ceph-nodedaemon-controller: ceph version not found for image "registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:cda4d8682b12f13ce90211cad773100c32584b6bcea33a6cb69a66d9aece86f5" used by cluster "ocs-storagecluster-cephcluster" in namespace "openshift-storage". attempt to determine ceph version for the current cluster image timed out
2024-04-17T20:07:38.019554664Z 2024-04-17 20:07:38.019513 I | clusterdisruption-controller: deleted all legacy node drain canary pods
2024-04-17T20:07:38.565867552Z 2024-04-17 20:07:38.565819 I | ceph-spec: parsing mon endpoints: a=172.30.35.166:3300,b=172.30.26.128:3300,c=172.30.125.176:3300
2024-04-17T20:07:38.943432810Z 2024-04-17 20:07:38.943366 E | ceph-block-pool-controller: failed to reconcile CephBlockPool "openshift-storage/ocs-storagecluster-cephblockpool". failed to fetch ceph version from cephcluster "ocs-storagecluster-cephcluster": attempt to determine ceph version for the current cluster image timed out
2024-04-17T20:07:38.965356091Z 2024-04-17 20:07:38.965320 I | ceph-spec: parsing mon endpoints: a=172.30.35.166:3300,b=172.30.26.128:3300,c=172.30.125.176:3300
2024-04-17T20:07:38.965435379Z 2024-04-17 20:07:38.965419 I | ceph-fs-subvolumegroup-controller: creating ceph filesystem subvolume group ocs-storagecluster-cephfilesystem-csi in namespace openshift-storage
2024-04-17T20:07:38.965435379Z 2024-04-17 20:07:38.965429 I | cephclient: creating cephfs "ocs-storagecluster-cephfilesystem" subvolume group "csi"
2024-04-17T20:07:39.150090567Z 2024-04-17 20:07:39.150044 I | op-k8sutil: batch job ceph-file-controller-detect-version deleted
2024-04-17T20:07:39.164704664Z 2024-04-17 20:07:39.164675 I | ceph-spec: parsing mon endpoints: a=172.30.35.166:3300,b=172.30.26.128:3300,c=172.30.125.176:3300
2024-04-17T20:07:39.751737909Z 2024-04-17 20:07:39.751694 I | cephclient: successfully created subvolume group "csi" in filesystem "ocs-storagecluster-cephfilesystem"
2024-04-17T20:07:39.765506554Z 2024-04-17 20:07:39.765477 E | ceph-csi: failed to reconcile failed to update CSI driver options for cluster "ocs-storagecluster-cephcluster": failed to fetch current csi config map: configmaps "rook-ceph-csi-config" not found

Version of all relevant components (if applicable):
ODF: 4.16.0-78
OCP: 4.16.0-0.nightly-2024-04-16-195622

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Not sure

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Instal ODF and OCP 4.15
2. Upgrade to 4.16
3.


Actual results:
Mentioned issue

Expected results:
Have everything up and running

Additional info:

Comment 5 Oded 2024-04-18 08:48:05 UTC

Bug reproduced manually:

1.Deploy OCP 4.16
2. Install ODF4.15.1 opertor [GA'ed]
3.Cehck ceph status [HEALTH_OK]
4.Upgrade ODF:
a.Disabling default source: redhat-operators
$ oc patch operatorhub.config.openshift.io/cluster -p='{"spec":{"sources":[{"disabled":true,"name":"redhat-operators"}]}}' --type=merge
operatorhub.config.openshift.io/cluster patched
b.Change channel in subscription odf-operator [stable-4.15 -> stable-4.16]
$ oc edit subscription odf-operator -n openshift-storage
c.Create catalog source:
$ oc create -f CatalogSource.yaml 
oc edit CatalogSource -n openshift-marketplace redhat-operators
catalogsource.operators.coreos.com/redhat-operators created

oviner~/multus$ cat ~/CatalogSource.yaml 
---
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: redhat-operators
  namespace: openshift-marketplace
  labels:
      ocs-operator-internal: "true"
spec:
  displayName: Openshift Container Storage
  icon:
    base64data: ""
    mediatype: ""
  image: quay.io/rhceph-dev/ocs-registry:latest-stable-4.16
  publisher: Red Hat
  sourceType: grpc
  priority: 100
  # If the registry image still have the same tag (latest-stable-4.6, or for stage testing)
  # we need to have this updateStrategy, otherwise we will not see new pushed content.
  updateStrategy:
    registryPoll:
        interval: 15m
d.Enable icsp
podman run --entrypoint cat quay.io/rhceph-dev/ocs-registry:latest-stable-4.16  /icsp.yaml | oc apply -f -

e.Check rook ceph opertor: [stack in Installing state]
$ oc get csv -A
NAMESPACE                              NAME                                        DISPLAY                            VERSION            REPLACES                                PHASE
openshift-storage                      rook-ceph-operator.v4.16.0-77.stable        Rook-Ceph                          4.16.0-77.stable                                           Installing

f.rook ceph opertor in CLBO state
$ oc get pods rook-ceph-operator-6d548fdc94-sbpwp
NAME                                  READY   STATUS             RESTARTS       AGE
rook-ceph-operator-6d548fdc94-sbpwp   0/1     CrashLoopBackOff   23 (44s ago)   98m

For more Info: https://docs.google.com/document/d/13cUS2b6TUl-_2iCeM9iMoR57W1NXWuPEXPFmHPkijpo/edit

MG Link: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2275886/

Comment 7 Oded 2024-04-18 12:30:29 UTC

After working with Madhu's private image, the rook-ceph-operator pod running however some csi pods moved to CLBO because Madhu  used old cephcsi in his private image

$ oc edit csv rook-ceph-operator.v4.16.0-79.stable -n openshift-storage               
 image: quay.io/madhupr001/rook:v1

Comment 15 Oded 2024-04-21 12:57:16 UTC

Bug fixed [tested on quay.io/rhceph-dev/ocs-registry:4.16.0-82]


1.Deploy OCP4.16 [4.16.0-0.nightly-2024-04-18-141003]
2.Install GA’ed 4.15.1
$ oc get csv -A
NAMESPACE                              NAME                                    DISPLAY                       VERSION          REPLACES                                PHASE
openshift-operator-lifecycle-manager   packageserver                           Package Server                0.0.1-snapshot                                           Succeeded
openshift-storage                      mcg-operator.v4.15.1-rhodf              NooBaa Operator               4.15.1-rhodf     mcg-operator.v4.15.0-rhodf              Succeeded
openshift-storage                      ocs-operator.v4.15.1-rhodf              OpenShift Container Storage   4.15.1-rhodf     ocs-operator.v4.15.0-rhodf              Succeeded
openshift-storage                      odf-csi-addons-operator.v4.15.1-rhodf   CSI Addons                    4.15.1-rhodf     odf-csi-addons-operator.v4.15.0-rhodf   Succeeded
openshift-storage                      odf-operator.v4.15.1-rhodf              OpenShift Data Foundation     4.15.1-rhodf     odf-operator.v4.15.0-rhodf              Succeeded
3.Create storagecluster
4.Check storagecluster status, pods status and ceph status:
$ oc get storagecluster
NAME                 AGE     PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   8m18s   Ready              2024-04-21T12:11:47Z   4.15.1

sh-5.1$ ceph -s
  cluster:
    id:     a2271230-7eb4-4459-91aa-911aa8a41dca
    health: HEALTH_OK

5.Upgrade ODF4.15.1 -> ODF4.16.0
a.Disabling default source: redhat-operators
$ oc patch operatorhub.config.openshift.io/cluster -p='{"spec":{"sources":[{"disabled":true,"name":"redhat-operators"}]}}' --type=merge
operatorhub.config.openshift.io/cluster patched
b.Change channel in subscription odf-operator [stable-4.15 -> stable-4.16]
$ oc edit subscription odf-operator -n openshift-storage
c.Create catalog source:
oviner~$ cat CatalogSource.yaml 
---
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: redhat-operators
  namespace: openshift-marketplace
  labels:
      ocs-operator-internal: "true"
spec:
  displayName: Openshift Container Storage
  icon:
    base64data: ""
    mediatype: ""
  image: quay.io/rhceph-dev/ocs-registry:4.16.0-82
  publisher: Red Hat
  sourceType: grpc
  priority: 100
  # If the registry image still have the same tag (latest-stable-4.6, or for stage testing)
  # we need to have this updateStrategy, otherwise we will not see new pushed content.
  updateStrategy:
    registryPoll:
        interval: 15m


oviner~$ oc create -f CatalogSource.yaml 
catalogsource.operators.coreos.com/redhat-operators created
oviner~$ 

d.Enable icsp
oviner~$ podman run --entrypoint cat quay.io/rhceph-dev/ocs-registry:4.16.0-82  /icsp.yaml | oc apply -f -
Trying to pull quay.io/rhceph-dev/ocs-registry:4.16.0-82...
Getting image source signatures
Copying blob 34dd843a0b94 done   | 
imagecontentsourcepolicy.operator.openshift.io/df-repo-v4.16.0-82 created


6.Check csv:
oviner~$ oc get csv -A
NAMESPACE                              NAME                                        DISPLAY                            VERSION            REPLACES                                PHASE
openshift-operator-lifecycle-manager   packageserver                               Package Server                     0.0.1-snapshot                                             Succeeded
openshift-storage                      mcg-operator.v4.16.0-82.stable              NooBaa Operator                    4.16.0-82.stable   mcg-operator.v4.15.2-rhodf              Succeeded
openshift-storage                      ocs-client-operator.v4.16.0-82.stable       OpenShift Data Foundation Client   4.16.0-82.stable                                           Succeeded
openshift-storage                      ocs-operator.v4.16.0-82.stable              OpenShift Container Storage        4.16.0-82.stable   ocs-operator.v4.15.2-rhodf              Succeeded
openshift-storage                      odf-csi-addons-operator.v4.16.0-82.stable   CSI Addons                         4.16.0-82.stable   odf-csi-addons-operator.v4.15.2-rhodf   Succeeded
openshift-storage                      odf-operator.v4.16.0-82.stable              OpenShift Data Foundation          4.16.0-82.stable   odf-operator.v4.15.1-rhodf              Succeeded
openshift-storage                      odf-prometheus-operator.v4.16.0-82.stable   Prometheus Operator                4.16.0-82.stable                                           Succeeded
openshift-storage                      rook-ceph-operator.v4.16.0-82.stable        Rook-Ceph                          4.16.0-82.stable                                           Succeeded

7.Check pods status:
$ oc get pods 
NAME                                                              READY   STATUS      RESTARTS      AGE
console-7c4f6fbf7b-2wnf7                                          1/1     Running     0             14m
csi-addons-controller-manager-5645b9d78d-rxl7p                    2/2     Running     0             10m
csi-cephfsplugin-9szq7                                            2/2     Running     0             9m30s
csi-cephfsplugin-lv7ks                                            2/2     Running     0             9m30s
csi-cephfsplugin-provisioner-587f9758d5-9l9bh                     6/6     Running     0             9m30s
csi-cephfsplugin-provisioner-587f9758d5-cw2cz                     6/6     Running     0             9m30s
csi-cephfsplugin-smfbd                                            2/2     Running     0             9m30s
csi-rbdplugin-725d5                                               3/3     Running     0             9m30s
csi-rbdplugin-c7snt                                               3/3     Running     0             9m30s
csi-rbdplugin-provisioner-5b6d758598-s4h7x                        6/6     Running     0             9m30s
csi-rbdplugin-provisioner-5b6d758598-t8d64                        6/6     Running     0             9m30s
csi-rbdplugin-z46fn                                               3/3     Running     0             9m30s
noobaa-core-0                                                     1/1     Running     0             9m35s
noobaa-db-pg-0                                                    1/1     Running     0             7m57s
noobaa-endpoint-6d85d6c867-hsp9c                                  1/1     Running     0             10m
noobaa-operator-68cf54d6bd-fkjk9                                  1/1     Running     0             10m
ocs-client-operator-console-7c4f6fbf7b-ltl9h                      1/1     Running     0             14m
ocs-client-operator-controller-manager-77cc58d696-cngkc           2/2     Running     0             14m
ocs-metrics-exporter-759867f995-mg8cd                             1/1     Running     0             10m
ocs-operator-65c8959bd6-dgckd                                     1/1     Running     0             10m
odf-console-89c85549-6qzc7                                        1/1     Running     0             14m
odf-operator-controller-manager-7457bc49b4-mjgj2                  2/2     Running     1 (11m ago)   14m
rook-ceph-crashcollector-compute-0-5c9796dc6b-n9r4x               1/1     Running     0             6m41s
rook-ceph-crashcollector-compute-1-7f84b499cf-g6c7r               1/1     Running     0             8m59s
rook-ceph-crashcollector-compute-2-67f57795b8-4zttb               1/1     Running     0             7m28s
rook-ceph-exporter-compute-0-6977c68fdb-54m5x                     1/1     Running     0             6m38s
rook-ceph-exporter-compute-1-b884fd99f-46vz2                      1/1     Running     0             8m56s
rook-ceph-exporter-compute-2-69c66bfbd4-rjbnn                     1/1     Running     0             7m25s
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5cb6bc65b8w7l   2/2     Running     0             4m13s
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6d45659c62qvd   2/2     Running     0             3m50s
rook-ceph-mgr-a-5ddf85b57c-vw7p9                                  3/3     Running     0             5m11s
rook-ceph-mgr-b-6d4ff8c4d6-xqlq9                                  3/3     Running     0             4m46s
rook-ceph-mon-a-84c996b649-w9pk5                                  2/2     Running     0             8m59s
rook-ceph-mon-b-5fc77b4bb9-dfdfd                                  2/2     Running     0             7m28s
rook-ceph-mon-c-7758b7f549-vv2mk                                  2/2     Running     0             5m53s
rook-ceph-operator-5d9c549687-klrvt                               1/1     Running     0             10m
rook-ceph-osd-0-6b44f59964-2zdsp                                  2/2     Running     0             4m23s
rook-ceph-osd-1-655896987-d9x94                                   2/2     Running     0             3m59s
rook-ceph-osd-2-68f9d46844-d4prl                                  2/2     Running     0             3m35s
rook-ceph-osd-prepare-0baa69d3403d1c2216d095eea4c2adcd-jvdvn      0/1     Completed   0             36m
rook-ceph-osd-prepare-97aaf5506140f91891a468f818d3e57c-gvhtd      0/1     Completed   0             36m
rook-ceph-osd-prepare-bb0571daa76f557be646c66568031c71-k7kvj      0/1     Completed   0             36m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-7657bf7vp9kj   2/2     Running     0             6m41s
rook-ceph-tools-59d6dcbd66-mvbqf                                  1/1     Running     0             10m
ux-backend-server-5c67fc645-k9vm8                                 2/2     Running     0             10m

8.check storagecluster

oviner~$ oc get storageclusters.ocs.openshift.io 
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   40m   Ready              2024-04-21T12:11:47Z   4.16.0

9.Check ceph status:
sh-5.1$ ceph -s
  cluster:
    id:     a2271230-7eb4-4459-91aa-911aa8a41dca
    health: HEALTH_OK


For more info: https://docs.google.com/document/d/1HRmQwJ9Hz-lvKogNUFXKFtizXwM2dR5z8B2DRtOOopo/edit

Comment 18 errata-xmlrpc 2024-07-17 13:19:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591

Note You need to log in before you can comment on or make changes to this bug.