Description of problem (please be detailed as possible and provide log snippests): Storagecluster is stuck in Progressing state in latest build v4.15.0-99 the previous build were working fine. the build 99 is not working. below is the error in when we describe the storagecluster - "CephCluster error: failed to create cluster: failed to start ceph osds: failed to update/create OSDs: context canceled" Version of all relevant components (if applicable): [root@nara4-2edb-bastion-0 ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.0-ec.3 True False 23h Cluster version is 4.15.0-ec.3 [root@nara4-2edb-bastion-0 ~]# [root@nara4-2edb-bastion-0 ~]# [root@nara4-2edb-bastion-0 ~]# oc get storagecluster NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 33m Progressing 2024-01-02T06:00:15Z 4.15.0 [root@nara4-2edb-bastion-0 ~]# oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.15.0-99.stable NooBaa Operator 4.15.0-99.stable Succeeded ocs-operator.v4.15.0-99.stable OpenShift Container Storage 4.15.0-99.stable Succeeded odf-csi-addons-operator.v4.15.0-99.stable CSI Addons 4.15.0-99.stable Succeeded odf-operator.v4.15.0-99.stable OpenShift Data Foundation 4.15.0-99.stable Succeeded Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes. we are not able to continue feature testing on new builds. Is there any workaround available to the best of your knowledge? NO Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. create OCP 4.15 cluster 2. install the lso and odf operators. create localvolume 3. create storagesystem and storagecluster will be stuck in progressing state Actual results: storagecluster is stuck in progressing state Expected results: storagecluster will be set to ready state Additional info: [root@nara4-2edb-bastion-0 ~]# oc get pods NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-5f59bb6fc4-nmdmq 2/2 Running 0 10m csi-cephfsplugin-5gzhx 2/2 Running 0 8m37s csi-cephfsplugin-6xzzr 2/2 Running 1 (7m59s ago) 8m37s csi-cephfsplugin-provisioner-ff8bb6b44-ptqgs 6/6 Running 2 (7m58s ago) 8m37s csi-cephfsplugin-provisioner-ff8bb6b44-xrtrj 6/6 Running 4 (7m54s ago) 8m37s csi-cephfsplugin-tjqkq 2/2 Running 1 (8m ago) 8m37s csi-rbdplugin-6s9h5 3/3 Running 0 8m37s csi-rbdplugin-provisioner-567b58b8ff-6vbxx 6/6 Running 4 (7m54s ago) 8m37s csi-rbdplugin-provisioner-567b58b8ff-rp7tv 6/6 Running 0 8m37s csi-rbdplugin-tckq8 3/3 Running 1 (8m ago) 8m37s csi-rbdplugin-vk458 3/3 Running 1 (7m59s ago) 8m37s noobaa-core-0 1/1 Running 0 4m47s noobaa-db-pg-0 1/1 Running 0 5m45s noobaa-operator-5b5bd9b87c-n6npf 2/2 Running 0 10m ocs-metrics-exporter-65d789b85f-9mb6d 1/1 Running 0 5m55s ocs-operator-67dfc4b997-75876 1/1 Running 0 10m odf-console-bb57b6f6-jkwrj 1/1 Running 0 11m odf-operator-controller-manager-7bdbc5c7fd-76tkc 2/2 Running 0 11m rook-ceph-crashcollector-worker-0-7447cfc595-rfxbl 1/1 Running 0 6m14s rook-ceph-crashcollector-worker-1-7946896c88-k5vzn 1/1 Running 0 5m59s rook-ceph-crashcollector-worker-2-6d7b7d78f7-kdx75 1/1 Running 0 6m18s rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6698c6f4ncvdr 2/2 Running 0 6m18s rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-59fd874dfh64v 2/2 Running 0 6m15s rook-ceph-mgr-a-969f78995-zmg4l 3/3 Running 0 7m24s rook-ceph-mgr-b-54c8b967f6-cfc95 3/3 Running 0 7m23s rook-ceph-mon-a-f7dbd6cc-xnrwd 2/2 Running 0 8m12s rook-ceph-mon-b-79b77d47bf-hck2f 2/2 Running 0 7m47s rook-ceph-mon-c-6ccbfcbf7-dk4z4 2/2 Running 0 7m36s rook-ceph-operator-7c45cd9474-p5mfp 1/1 Running 0 8m37s rook-ceph-osd-0-5c497554c4-tb9ks 2/2 Running 0 6m49s rook-ceph-osd-1-6857ccd444-ln4vb 2/2 Running 0 6m49s rook-ceph-osd-2-5cc5c6779c-l27vs 2/2 Running 0 6m47s rook-ceph-osd-prepare-4ae47a7430335c087c9140b4de7e3ba9-5hfs8 0/1 Completed 0 7m rook-ceph-osd-prepare-e7fded9a680cffaca41872ffa7197819-xnd5x 0/1 Completed 0 7m1s rook-ceph-osd-prepare-f3bb1584f0bc543cb4524d67ded2ec19-ls9pt 0/1 Completed 0 7m1s rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-f96ccf8fzlv7 2/2 Running 0 5m59s [root@nara4-2edb-bastion-0 ~]#
Upon going through the must-gather & also after a live call with Naraynswami, I found that Noobaa CR is stuck in the configuring phase. I see that the CephCluster is ready & all related pods are up and running. Noobaa CR is stuck in the configuring phase so the storagecluster never gets ready. The conditions message on noobaa CR says "cannot read admin account info, error: not anonymous method read_account".
[root@nara4-2edb-bastion-0 ~]# oc get storagecluster NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 2d Progressing 2024-01-02T06:00:15Z 4.15.0 [root@nara4-2edb-bastion-0 ~]# [root@nara4-2edb-bastion-0 ~]# oc get storagecluster -o yaml apiVersion: v1 items: - apiVersion: ocs.openshift.io/v1 kind: StorageCluster metadata: annotations: cluster.ocs.openshift.io/local-devices: "true" uninstall.ocs.openshift.io/cleanup-policy: delete uninstall.ocs.openshift.io/mode: graceful creationTimestamp: "2024-01-02T06:00:15Z" finalizers: - storagecluster.ocs.openshift.io generation: 2 name: ocs-storagecluster namespace: openshift-storage ownerReferences: - apiVersion: odf.openshift.io/v1alpha1 kind: StorageSystem name: ocs-storagecluster-storagesystem uid: b23143df-26bb-40b1-a51c-f4d21a336014 resourceVersion: "2355179" uid: c14f50a4-8f42-4c0a-b2a6-bbe8dbc95ea6 spec: arbiter: {} encryption: kms: {} externalStorage: {} flexibleScaling: true managedResources: cephBlockPools: defaultStorageClass: true cephCluster: {} cephConfig: {} cephDashboard: {} cephFilesystems: {} cephNonResilientPools: {} cephObjectStoreUsers: {} cephObjectStores: {} cephRBDMirror: daemonCount: 1 cephToolbox: {} mirroring: {} monDataDirHostPath: /var/lib/rook network: connections: encryption: {} multiClusterService: {} nodeTopologies: {} resourceProfile: balanced storageDeviceSets: - config: {} count: 3 dataPVCTemplate: metadata: {} spec: accessModes: - ReadWriteOnce resources: requests: storage: "1" storageClassName: localblock volumeMode: Block status: {} name: ocs-deviceset-localblock placement: {} preparePlacement: {} replica: 1 resources: {} status: conditions: - lastHeartbeatTime: "2024-01-02T06:00:16Z" lastTransitionTime: "2024-01-02T06:00:16Z" message: Version check successful reason: VersionMatched status: "False" type: VersionMismatch - lastHeartbeatTime: "2024-01-04T06:22:38Z" lastTransitionTime: "2024-01-04T05:22:32Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "True" type: ReconcileComplete - lastHeartbeatTime: "2024-01-02T06:02:59Z" lastTransitionTime: "2024-01-02T06:00:16Z" message: 'CephCluster error: failed to create cluster: failed to start ceph osds: failed to update/create OSDs: context canceled' reason: ClusterStateError status: "False" type: Available - lastHeartbeatTime: "2024-01-04T06:22:38Z" lastTransitionTime: "2024-01-02T06:00:16Z" message: Waiting on Nooba instance to finish initialization reason: NoobaaInitializing status: "True" type: Progressing - lastHeartbeatTime: "2024-01-02T06:02:59Z" lastTransitionTime: "2024-01-02T06:02:59Z" message: 'CephCluster error: failed to create cluster: failed to start ceph osds: failed to update/create OSDs: context canceled' reason: ClusterStateError status: "True" type: Degraded - lastHeartbeatTime: "2024-01-02T06:03:54Z" lastTransitionTime: "2024-01-02T06:02:57Z" message: 'CephCluster is creating: Processing OSD 2 on PVC "ocs-deviceset-localblock-0-data-09kwfn"' reason: ClusterStateCreating status: "False" type: Upgradeable failureDomain: host failureDomainKey: kubernetes.io/hostname failureDomainValues: - worker-0 - worker-1 - worker-2 images: ceph: actualImage: registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:1b1870ca13fc52d3c1a6c603e471e230a90cba94baaef9cf56c02b6c7dac35ca desiredImage: registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:1b1870ca13fc52d3c1a6c603e471e230a90cba94baaef9cf56c02b6c7dac35ca noobaaCore: actualImage: registry.redhat.io/odf4/mcg-core-rhel9@sha256:e84250acc66b169d54f64872df683033a74a71c5757808aff1a98e3fffc18a54 desiredImage: registry.redhat.io/odf4/mcg-core-rhel9@sha256:e84250acc66b169d54f64872df683033a74a71c5757808aff1a98e3fffc18a54 noobaaDB: actualImage: registry.redhat.io/rhel8/postgresql-12@sha256:cd5b8cb243a0b233a08bdf807df7bc6192a18e1dc322789d6d2e064e9721d8f0 desiredImage: registry.redhat.io/rhel8/postgresql-12@sha256:cd5b8cb243a0b233a08bdf807df7bc6192a18e1dc322789d6d2e064e9721d8f0 kmsServerConnection: {} lastAppliedResourceProfile: balanced nodeTopologies: labels: kubernetes.io/hostname: - worker-0 - worker-1 - worker-2 phase: Progressing relatedObjects: - apiVersion: ceph.rook.io/v1 kind: CephCluster name: ocs-storagecluster-cephcluster namespace: openshift-storage resourceVersion: "2354536" uid: c03b3258-84d5-4ff8-a7ba-459aef7ce42b - apiVersion: noobaa.io/v1alpha1 kind: NooBaa name: noobaa namespace: openshift-storage resourceVersion: "2355174" uid: 815db159-46dd-4136-8636-f035e0139ed8 version: 4.15.0 kind: List metadata: resourceVersion: "" [root@nara4-2edb-bastion-0 ~]# oc get pods NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-7865c8f5f4-nvzcf 2/2 Running 0 44h csi-cephfsplugin-5gzhx 2/2 Running 0 2d csi-cephfsplugin-6xzzr 2/2 Running 1 (2d ago) 2d csi-cephfsplugin-provisioner-ff8bb6b44-ptqgs 6/6 Running 2 (2d ago) 2d csi-cephfsplugin-provisioner-ff8bb6b44-xrtrj 6/6 Running 4 (2d ago) 2d csi-cephfsplugin-tjqkq 2/2 Running 1 (2d ago) 2d csi-rbdplugin-6s9h5 3/3 Running 0 2d csi-rbdplugin-provisioner-567b58b8ff-6vbxx 6/6 Running 4 (2d ago) 2d csi-rbdplugin-provisioner-567b58b8ff-rp7tv 6/6 Running 0 2d csi-rbdplugin-tckq8 3/3 Running 1 (2d ago) 2d csi-rbdplugin-vk458 3/3 Running 1 (2d ago) 2d noobaa-core-0 1/1 Running 0 2d noobaa-db-pg-0 1/1 Running 0 2d noobaa-operator-65b7c5fcbd-qx6nt 2/2 Running 0 44h ocs-metrics-exporter-65d789b85f-9mb6d 1/1 Running 0 2d ocs-operator-67dfc4b997-75876 1/1 Running 0 2d odf-console-bb57b6f6-jkwrj 1/1 Running 0 2d odf-operator-controller-manager-7bdbc5c7fd-76tkc 2/2 Running 0 2d rook-ceph-crashcollector-worker-0-7447cfc595-rfxbl 1/1 Running 0 2d rook-ceph-crashcollector-worker-1-7946896c88-k5vzn 1/1 Running 0 2d rook-ceph-crashcollector-worker-2-6d7b7d78f7-kdx75 1/1 Running 0 2d rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6698c6f4ncvdr 2/2 Running 0 2d rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-59fd874dfh64v 2/2 Running 0 2d rook-ceph-mgr-a-969f78995-zmg4l 3/3 Running 0 2d rook-ceph-mgr-b-54c8b967f6-cfc95 3/3 Running 0 2d rook-ceph-mon-a-f7dbd6cc-xnrwd 2/2 Running 0 2d rook-ceph-mon-b-79b77d47bf-hck2f 2/2 Running 0 2d rook-ceph-mon-c-6ccbfcbf7-dk4z4 2/2 Running 0 2d rook-ceph-operator-7c45cd9474-p5mfp 1/1 Running 0 2d rook-ceph-osd-0-5c497554c4-tb9ks 2/2 Running 0 2d rook-ceph-osd-1-6857ccd444-ln4vb 2/2 Running 0 2d rook-ceph-osd-2-5cc5c6779c-l27vs 2/2 Running 0 2d rook-ceph-osd-prepare-4ae47a7430335c087c9140b4de7e3ba9-5hfs8 0/1 Completed 0 2d rook-ceph-osd-prepare-e7fded9a680cffaca41872ffa7197819-xnd5x 0/1 Completed 0 2d rook-ceph-osd-prepare-f3bb1584f0bc543cb4524d67ded2ec19-ls9pt 0/1 Completed 0 2d rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-f96ccf8fzlv7 2/2 Running 0 2d [root@nara4-2edb-bastion-0 ~]# [root@nara4-2edb-bastion-0 ~]# oc get cephcluster NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID ocs-storagecluster-cephcluster /var/lib/rook 3 2d Ready Cluster created successfully HEALTH_OK 8dab246a-923c-47b9-88c9-4b6c7935da57 [root@nara4-2edb-bastion-0 ~]# [root@nara4-2edb-bastion-0 ~]# oc get noobaa NAME S3-ENDPOINTS STS-ENDPOINTS IMAGE PHASE AGE noobaa registry.redhat.io/odf4/mcg-core-rhel9@sha256:e84250acc66b169d54f64872df683033a74a71c5757808aff1a98e3fffc18a54 Configuring 2d [root@nara4-2edb-bastion-0 ~]# oc get noobaa -o yaml apiVersion: v1 items: - apiVersion: noobaa.io/v1alpha1 kind: NooBaa metadata: creationTimestamp: "2024-01-02T06:03:06Z" finalizers: - noobaa.io/graceful_finalizer generation: 1 labels: app: noobaa name: noobaa namespace: openshift-storage ownerReferences: - apiVersion: ocs.openshift.io/v1 blockOwnerDeletion: true controller: true kind: StorageCluster name: ocs-storagecluster uid: c14f50a4-8f42-4c0a-b2a6-bbe8dbc95ea6 resourceVersion: "2355451" uid: 815db159-46dd-4136-8636-f035e0139ed8 spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage operator: Exists autoscaler: autoscalerType: hpav2 prometheusNamespace: openshift-monitoring cleanupPolicy: {} coreResources: limits: cpu: 999m memory: 4Gi requests: cpu: 999m memory: 4Gi dbImage: registry.redhat.io/rhel8/postgresql-12@sha256:cd5b8cb243a0b233a08bdf807df7bc6192a18e1dc322789d6d2e064e9721d8f0 dbResources: limits: cpu: 500m memory: 4Gi requests: cpu: 500m memory: 4Gi dbStorageClass: ocs-storagecluster-ceph-rbd dbType: postgres dbVolumeResources: requests: storage: 50Gi endpoints: maxCount: 2 minCount: 1 resources: limits: cpu: 999m memory: 2Gi requests: cpu: 999m memory: 2Gi image: registry.redhat.io/odf4/mcg-core-rhel9@sha256:e84250acc66b169d54f64872df683033a74a71c5757808aff1a98e3fffc18a54 labels: monitoring: {} loadBalancerSourceSubnets: {} pvPoolDefaultStorageClass: ocs-storagecluster-ceph-rbd security: kms: {} tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" status: accounts: admin: secretRef: name: noobaa-admin namespace: openshift-storage actualImage: registry.redhat.io/odf4/mcg-core-rhel9@sha256:e84250acc66b169d54f64872df683033a74a71c5757808aff1a98e3fffc18a54 conditions: - lastHeartbeatTime: "2024-01-04T06:23:04Z" lastTransitionTime: "2024-01-02T06:03:06Z" message: 'cannot read admin account info, error: not anonymous method read_account' reason: TemporaryError status: "False" type: Available - lastHeartbeatTime: "2024-01-04T06:23:04Z" lastTransitionTime: "2024-01-02T06:03:06Z" message: 'cannot read admin account info, error: not anonymous method read_account' reason: TemporaryError status: "True" type: Progressing - lastHeartbeatTime: "2024-01-04T06:23:04Z" lastTransitionTime: "2024-01-02T06:03:06Z" message: 'cannot read admin account info, error: not anonymous method read_account' reason: TemporaryError status: "False" type: Degraded - lastHeartbeatTime: "2024-01-04T06:23:04Z" lastTransitionTime: "2024-01-02T06:03:06Z" message: 'cannot read admin account info, error: not anonymous method read_account' reason: TemporaryError status: "False" type: Upgradeable - lastHeartbeatTime: "2024-01-04T06:23:04Z" lastTransitionTime: "2024-01-02T06:03:07Z" status: k8s type: KMS-Type - lastHeartbeatTime: "2024-01-04T06:23:04Z" lastTransitionTime: "2024-01-02T06:03:08Z" status: Sync type: KMS-Status observedGeneration: 1 phase: Configuring readme: "\n\n\tNooBaa operator is still working to reconcile this system.\n\tCheck out the system status.phase, status.conditions, and events with:\n\n\t\tkubectl -n openshift-storage describe noobaa\n\t\tkubectl -n openshift-storage get noobaa -o yaml\n\t\tkubectl -n openshift-storage get events --sort-by=metadata.creationTimestamp\n\n\tYou can wait for a specific condition with:\n\n\t\tkubectl -n openshift-storage wait noobaa/noobaa --for condition=available --timeout -1s\n\n\tNooBaa Core Version: master-20230920\n\tNooBaa Operator Version: 5.15.0\n" services: serviceMgmt: externalDNS: - https://noobaa-mgmt-openshift-storage.apps.nara4-2edb.redhat.com:443 internalDNS: - https://noobaa-mgmt.openshift-storage.svc:443 internalIP: - https://172.30.220.62:443 nodePorts: - https://10.20.187.252:0 podPorts: - https://10.131.0.41:8443 serviceS3: externalDNS: - https://s3-openshift-storage.apps.nara4-2edb.redhat.com:443 internalDNS: - https://s3.openshift-storage.svc:443 internalIP: - https://172.30.58.186:443 serviceSts: externalDNS: - https://sts-openshift-storage.apps.nara4-2edb.redhat.com:443 internalDNS: - https://sts.openshift-storage.svc:443 internalIP: - https://172.30.142.245:443 upgradePhase: NoUpgrade kind: List metadata: resourceVersion: "" [root@nara4-2edb-bastion-0 ~]#
As per discussion with Naranyanaswami & the comment above, closing the BZ.
Reopening due to a possible reproduce on ODF build 104. Detailing findings in the follow-up comment.
Initially ocs operator was in error state on the live cluster which is mentioned above. ➜ clust2 oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-855544975d-pbc84 2/2 Running 88 (65m ago) 3d18h csi-cephfsplugin-9fvdw 2/2 Running 1 (3d19h ago) 3d19h csi-cephfsplugin-provisioner-f486cc4c8-6gnwp 6/6 Running 2 (3d19h ago) 3d19h csi-cephfsplugin-provisioner-f486cc4c8-q845x 6/6 Running 4 (3d19h ago) 3d19h csi-cephfsplugin-sr4wm 2/2 Running 1 (3d19h ago) 3d19h csi-cephfsplugin-wf5dl 2/2 Running 0 3d19h csi-rbdplugin-7mmq4 3/3 Running 1 (3d19h ago) 3d19h csi-rbdplugin-kfmf7 3/3 Running 0 3d19h csi-rbdplugin-provisioner-84cd9d7bb7-556p2 6/6 Running 2 (3d19h ago) 3d19h csi-rbdplugin-provisioner-84cd9d7bb7-sx7z5 6/6 Running 5 (3d19h ago) 3d19h csi-rbdplugin-w5xfl 3/3 Running 1 (3d19h ago) 3d19h maintenance-agent-755ccdbb47-gsbvt 0/1 CrashLoopBackOff 732 (61s ago) 3d15h noobaa-core-0 1/1 Running 0 3d19h noobaa-db-pg-0 1/1 Running 0 3d19h noobaa-operator-568c8d7bdc-kwcxr 2/2 Running 17 (83m ago) 3d19h ocs-metrics-exporter-67846dc54b-qzgww 1/1 Running 0 3d19h ocs-operator-bd767766f-svl5j 0/1 Error 101 (47m ago) 3d19h odf-console-5fdb76657d-h46t8 1/1 Running 0 3d19h odf-operator-controller-manager-689c57969b-62284 2/2 Running 78 (47m ago) 3d19h rook-ceph-operator-55c564df6b-4xjbc 1/1 Running 0 3d19h token-exchange-agent-6c4f658fcb-8zltf 1/1 Running 0
I tried to see if the cluster is still there but it's gone now—a few questions while I look at the must-gather. . Is this always reproducible with the said build always? . Which platform is this on? Initially, the BUG was reported from IBM power cluster. . Naraynaswami reported that on #104 they had the issue but with latest builds it did not happen can you try with the latest build once & let me know if it still happens
I see the build on the above link has succeeded. I assume there was some intermittent issue with builds in-between 99-104 which was causing the issue.