2001482 – StorageCluster stuck in Progressing state for MCG only deployment

Bug 2001482 - StorageCluster stuck in Progressing state for MCG only deployment

Summary: StorageCluster stuck in Progressing state for MCG only deployment

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	Multi-Cloud Object Gateway
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.9.0
Assignee:	Nimrod Becker
QA Contact:	Aman Agrawal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-06 08:29 UTC by Aman Agrawal
Modified:	2023-08-09 16:49 UTC (History)
CC List:	16 users (show)
Fixed In Version:	v4.9.0-158.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-12-13 17:46:04 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 1337	None	open	Bug 2001482: [release-4.9] storagecluster: move topology map reconciliation into its own resourceManager	2021-09-14 08:25:42 UTC
Github	red-hat-storage ocs-operator pull 1341	None	open	noobaa: don't set NodeAffinity for standalone	2021-09-20 13:41:01 UTC
Github	red-hat-storage ocs-operator pull 1342	None	open	Bug 2001482: [release-4.9] noobaa: don't set NodeAffinity for standalone	2021-09-20 16:24:39 UTC
Red Hat Product Errata	RHSA-2021:5086	None	None	None	2021-12-13 17:46:23 UTC

Comment 2 umanga 2021-09-08 06:22:58 UTC

Can't find relevant must-gather logs. Can you attach it here?

Comment 4 Mudit Agarwal 2021-09-08 12:00:25 UTC

Aman, can you reproduce the issue and provide the cluster to Umanga/Jose

Comment 6 umanga 2021-09-08 12:51:44 UTC

```
 {"level":"error", "msg":"Failed to set node Topology Map for StorageCluster.", "error":"Not enough nodes found: Expected 3, found 0"}
```

Looks like Nodes are not labelled.

Comment 7 Jose A. Rivera 2021-09-13 15:02:37 UTC

I think I figured out the problem, but I'm going to try and actually verify it on my own cluster first before pushing it into upstream CI. ;)

Comment 8 Jose A. Rivera 2021-09-13 16:18:31 UTC

Upstream PR is live: https://github.com/red-hat-storage/ocs-operator/pull/1335

I have verified the StorageCluster comes up and running:

```
apiVersion: v1
items:
- apiVersion: ocs.openshift.io/v1
  kind: StorageCluster
  metadata:
    annotations:
      uninstall.ocs.openshift.io/cleanup-policy: delete
      uninstall.ocs.openshift.io/mode: graceful
    creationTimestamp: "2021-09-13T16:15:20Z"
    finalizers:
    - storagecluster.ocs.openshift.io
    generation: 2
    name: test-storagecluster
    namespace: openshift-storage
    resourceVersion: "102267"
    uid: 95ef2784-46de-4429-9eee-1480878a059a
  spec:
    arbiter: {}
    encryption:
      kms: {}
    externalStorage: {}
    managedResources:
      cephBlockPools: {}
      cephCluster: {}
      cephConfig: {}
      cephDashboard: {}
      cephFilesystems: {}
      cephObjectStoreUsers: {}
      cephObjectStores: {}
    mirroring: {}
    multiCloudGateway:
      reconcileStrategy: standalone  <-- this is the only *required* spec
    resources:                       <-- these are for testing on undersized clusters only
      noobaa-core: {}
      noobaa-db: {}
      noobaa-endpoint: {}
    version: 4.9.0
  status:
    conditions:
    - lastHeartbeatTime: "2021-09-13T16:16:55Z"
      lastTransitionTime: "2021-09-13T16:15:20Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "True"
      type: ReconcileComplete
    - lastHeartbeatTime: "2021-09-13T16:16:55Z"
      lastTransitionTime: "2021-09-13T16:16:27Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "True"
      type: Available
    - lastHeartbeatTime: "2021-09-13T16:16:55Z"
      lastTransitionTime: "2021-09-13T16:16:27Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "False"
      type: Progressing
    - lastHeartbeatTime: "2021-09-13T16:16:55Z"
      lastTransitionTime: "2021-09-13T16:15:20Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "False"
      type: Degraded
    - lastHeartbeatTime: "2021-09-13T16:16:55Z"
      lastTransitionTime: "2021-09-13T16:16:27Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "True"
      type: Upgradeable
    images:
      ceph:
        desiredImage: ceph/daemon-base:latest-pacific
      noobaaCore:
        actualImage: noobaa/noobaa-core:master-20210609
        desiredImage: noobaa/noobaa-core:master-20210609
      noobaaDB:
        actualImage: centos/postgresql-12-centos7
        desiredImage: centos/postgresql-12-centos7
    phase: Ready
    relatedObjects:
    - apiVersion: noobaa.io/v1alpha1
      kind: NooBaa
      name: noobaa
      namespace: openshift-storage
      resourceVersion: "102265"
      uid: 907594b1-1f38-4c84-9333-0bff5a10d4d3
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
```

Comment 11 Michael Adam 2021-09-14 14:48:15 UTC

fix is contained in ocs-registry:4.9.0-138.ci

Comment 13 Michael Adam 2021-09-17 08:46:52 UTC

Aman, when a crucial BZ like this fails QA, it would be great to set NEEDINFO on the assignee. Otherwise it might go unnoticed for days.

This time it looks like a different issue, something about noobaa initializiation mentioned in the storagecluster cr status.

Nitin, can you take a look?

Comment 14 Michael Adam 2021-09-17 08:50:34 UTC

Note that when looking for root causes, build 139 was the first build where we switched from main branches to release-4.9 branches for ocs-operator and odf-operator, so reviewing the diff between main and release-4.9 should also give a clue. (Note that Jose had verified the fix from the main branch PR.)

Comment 16 Nitin Goyal 2021-09-17 09:32:51 UTC

Hi, I looked at the code and noobaa CR. Noobaa CR is in the Connecting phase because of which ocs-operator is still in the progressing state as we can see in the above storage cluster output. I would suggest you create a new bug in the MCG for the same.

```
Status:
  Conditions:
    Last Heartbeat Time:   2021-09-15T07:55:24Z
    Last Transition Time:  2021-09-15T07:19:23Z
    Message:               Reconcile completed successfully
    Reason:                ReconcileCompleted
    Status:                True
    Type:                  ReconcileComplete
    Last Heartbeat Time:   2021-09-15T07:19:23Z
    Last Transition Time:  2021-09-15T07:19:23Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Available
    Last Heartbeat Time:   2021-09-15T07:55:24Z
    Last Transition Time:  2021-09-15T07:19:23Z
    Message:               Waiting on Nooba instance to finish initialization
    Reason:                NoobaaInitializing
    Status:                True
    Type:                  Progressing
    Last Heartbeat Time:   2021-09-15T07:19:23Z
    Last Transition Time:  2021-09-15T07:19:23Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Degraded
    Last Heartbeat Time:   2021-09-15T07:19:23Z
    Last Transition Time:  2021-09-15T07:19:23Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                Unknown
    Type:                  Upgradeable
```

Comment 17 Jose A. Rivera 2021-09-17 13:05:47 UTC

Everyone, please, at least try to look at the Pod logs before replying to a BZ. Not doing so creates needless turnaround time, which for this team can be tremendous given our time zone distribution (and me in particular being constantly behind on BZ notifications). Also, for what it's worth, it's way easier for me to work with must-gather output that I can browse in a web browser than having to download a zip file.

Clearly the Conditions state that something is wrong with the NooBaa initialization:

      - lastHeartbeatTime: "2021-09-15T07:35:07Z"
        lastTransitionTime: "2021-09-15T07:19:23Z"
        message: Waiting on Nooba instance to finish initialization
        reason: NoobaaInitializing
        status: "True"
        type: Progressing

I found nothing in the ocs-operator logs to indicate a problem, but then I found this in the noobaa-oeprator logs:

time="2021-09-15T07:37:22Z" level=info msg="❌ Not Found: CephObjectStoreUser \"noobaa-ceph-objectstore-user\"\n"

And looking at the NooBaa YAML:

  - lastHeartbeatTime: "2021-09-15T07:19:23Z"
      lastTransitionTime: "2021-09-15T07:19:23Z"
      message: 'RPC: connection (0xc001073180) already closed &{RPC:0xc000433450 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:closed WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} Rec  onnectDelay:3s}'
      reason: TemporaryError
      status: "False"
      type: Available
    - lastHeartbeatTime: "2021-09-15T07:19:23Z"
      lastTransitionTime: "2021-09-15T07:19:23Z"
      message: 'RPC: connection (0xc001073180) already closed &{RPC:0xc000433450 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:closed WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} Rec  onnectDelay:3s}'
      reason: TemporaryError
      status: "True"
      type: Progressing
    - lastHeartbeatTime: "2021-09-15T07:19:23Z"
      lastTransitionTime: "2021-09-15T07:19:23Z"
      message: 'RPC: connection (0xc001073180) already closed &{RPC:0xc000433450 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:closed WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} Rec  onnectDelay:3s}'
      reason: TemporaryError
      status: "False"
      type: Degraded
    - lastHeartbeatTime: "2021-09-15T07:19:23Z"
      lastTransitionTime: "2021-09-15T07:19:23Z"
      message: 'RPC: connection (0xc001073180) already closed &{RPC:0xc000433450 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:closed WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} Rec  onnectDelay:3s}'
      reason: TemporaryError
      status: "False"
      type: Upgradeable

So, offhand, it looks like NooBaa is expecting a CephObjectStoreUser, which is off course invalid for an MCG-only StorageCluster.

At *this* point I could come in and say that we don't handle that particular check, that's in noobaa-operator. As such, reassigning appropriately. @nbecker PTAL when you can.

Comment 28 errata-xmlrpc 2021-12-13 17:46:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:5086

Note You need to log in before you can comment on or make changes to this bug.