Bug 2001482

Summary: StorageCluster stuck in Progressing state for MCG only deployment
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Aman Agrawal <amagrawa>
Component: Multi-Cloud Object GatewayAssignee: Nimrod Becker <nbecker>
Status: CLOSED ERRATA QA Contact: Aman Agrawal <amagrawa>
Severity: high Docs Contact:
Priority: high    
Version: 4.9CC: aos-bugs, ebenahar, etamir, jalbo, jijoy, madam, muagarwa, nbecker, nberry, nigoyal, nthomas, ocs-bugs, odf-bz-bot, rperiyas, sostapov, uchapaga
Target Milestone: ---Keywords: TestBlocker
Target Release: ODF 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: v4.9.0-158.ci Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-12-13 17:46:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 2 umanga 2021-09-08 06:22:58 UTC
Can't find relevant must-gather logs. Can you attach it here?

Comment 4 Mudit Agarwal 2021-09-08 12:00:25 UTC
Aman, can you reproduce the issue and provide the cluster to Umanga/Jose

Comment 6 umanga 2021-09-08 12:51:44 UTC
```
 {"level":"error", "msg":"Failed to set node Topology Map for StorageCluster.", "error":"Not enough nodes found: Expected 3, found 0"}
```

Looks like Nodes are not labelled.

Comment 7 Jose A. Rivera 2021-09-13 15:02:37 UTC
I think I figured out the problem, but I'm going to try and actually verify it on my own cluster first before pushing it into upstream CI. ;)

Comment 8 Jose A. Rivera 2021-09-13 16:18:31 UTC
Upstream PR is live: https://github.com/red-hat-storage/ocs-operator/pull/1335

I have verified the StorageCluster comes up and running:

```
apiVersion: v1
items:
- apiVersion: ocs.openshift.io/v1
  kind: StorageCluster
  metadata:
    annotations:
      uninstall.ocs.openshift.io/cleanup-policy: delete
      uninstall.ocs.openshift.io/mode: graceful
    creationTimestamp: "2021-09-13T16:15:20Z"
    finalizers:
    - storagecluster.ocs.openshift.io
    generation: 2
    name: test-storagecluster
    namespace: openshift-storage
    resourceVersion: "102267"
    uid: 95ef2784-46de-4429-9eee-1480878a059a
  spec:
    arbiter: {}
    encryption:
      kms: {}
    externalStorage: {}
    managedResources:
      cephBlockPools: {}
      cephCluster: {}
      cephConfig: {}
      cephDashboard: {}
      cephFilesystems: {}
      cephObjectStoreUsers: {}
      cephObjectStores: {}
    mirroring: {}
    multiCloudGateway:
      reconcileStrategy: standalone  <-- this is the only *required* spec
    resources:                       <-- these are for testing on undersized clusters only
      noobaa-core: {}
      noobaa-db: {}
      noobaa-endpoint: {}
    version: 4.9.0
  status:
    conditions:
    - lastHeartbeatTime: "2021-09-13T16:16:55Z"
      lastTransitionTime: "2021-09-13T16:15:20Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "True"
      type: ReconcileComplete
    - lastHeartbeatTime: "2021-09-13T16:16:55Z"
      lastTransitionTime: "2021-09-13T16:16:27Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "True"
      type: Available
    - lastHeartbeatTime: "2021-09-13T16:16:55Z"
      lastTransitionTime: "2021-09-13T16:16:27Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "False"
      type: Progressing
    - lastHeartbeatTime: "2021-09-13T16:16:55Z"
      lastTransitionTime: "2021-09-13T16:15:20Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "False"
      type: Degraded
    - lastHeartbeatTime: "2021-09-13T16:16:55Z"
      lastTransitionTime: "2021-09-13T16:16:27Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "True"
      type: Upgradeable
    images:
      ceph:
        desiredImage: ceph/daemon-base:latest-pacific
      noobaaCore:
        actualImage: noobaa/noobaa-core:master-20210609
        desiredImage: noobaa/noobaa-core:master-20210609
      noobaaDB:
        actualImage: centos/postgresql-12-centos7
        desiredImage: centos/postgresql-12-centos7
    phase: Ready
    relatedObjects:
    - apiVersion: noobaa.io/v1alpha1
      kind: NooBaa
      name: noobaa
      namespace: openshift-storage
      resourceVersion: "102265"
      uid: 907594b1-1f38-4c84-9333-0bff5a10d4d3
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
```

Comment 11 Michael Adam 2021-09-14 14:48:15 UTC
fix is contained in ocs-registry:4.9.0-138.ci

Comment 13 Michael Adam 2021-09-17 08:46:52 UTC
Aman, when a crucial BZ like this fails QA, it would be great to set NEEDINFO on the assignee. Otherwise it might go unnoticed for days.

This time it looks like a different issue, something about noobaa initializiation mentioned in the storagecluster cr status.

Nitin, can you take a look?

Comment 14 Michael Adam 2021-09-17 08:50:34 UTC
Note that when looking for root causes, build 139 was the first build where we switched from main branches to release-4.9 branches for ocs-operator and odf-operator, so reviewing the diff between main and release-4.9 should also give a clue. (Note that Jose had verified the fix from the main branch PR.)

Comment 16 Nitin Goyal 2021-09-17 09:32:51 UTC
Hi, I looked at the code and noobaa CR. Noobaa CR is in the Connecting phase because of which ocs-operator is still in the progressing state as we can see in the above storage cluster output. I would suggest you create a new bug in the MCG for the same.

```
Status:
  Conditions:
    Last Heartbeat Time:   2021-09-15T07:55:24Z
    Last Transition Time:  2021-09-15T07:19:23Z
    Message:               Reconcile completed successfully
    Reason:                ReconcileCompleted
    Status:                True
    Type:                  ReconcileComplete
    Last Heartbeat Time:   2021-09-15T07:19:23Z
    Last Transition Time:  2021-09-15T07:19:23Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Available
    Last Heartbeat Time:   2021-09-15T07:55:24Z
    Last Transition Time:  2021-09-15T07:19:23Z
    Message:               Waiting on Nooba instance to finish initialization
    Reason:                NoobaaInitializing
    Status:                True
    Type:                  Progressing
    Last Heartbeat Time:   2021-09-15T07:19:23Z
    Last Transition Time:  2021-09-15T07:19:23Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Degraded
    Last Heartbeat Time:   2021-09-15T07:19:23Z
    Last Transition Time:  2021-09-15T07:19:23Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                Unknown
    Type:                  Upgradeable
```

Comment 17 Jose A. Rivera 2021-09-17 13:05:47 UTC
Everyone, please, at least try to look at the Pod logs before replying to a BZ. Not doing so creates needless turnaround time, which for this team can be tremendous given our time zone distribution (and me in particular being constantly behind on BZ notifications). Also, for what it's worth, it's way easier for me to work with must-gather output that I can browse in a web browser than having to download a zip file.

Clearly the Conditions state that something is wrong with the NooBaa initialization:

      - lastHeartbeatTime: "2021-09-15T07:35:07Z"
        lastTransitionTime: "2021-09-15T07:19:23Z"
        message: Waiting on Nooba instance to finish initialization
        reason: NoobaaInitializing
        status: "True"
        type: Progressing

I found nothing in the ocs-operator logs to indicate a problem, but then I found this in the noobaa-oeprator logs:

time="2021-09-15T07:37:22Z" level=info msg="❌ Not Found: CephObjectStoreUser \"noobaa-ceph-objectstore-user\"\n"

And looking at the NooBaa YAML:

  - lastHeartbeatTime: "2021-09-15T07:19:23Z"
      lastTransitionTime: "2021-09-15T07:19:23Z"
      message: 'RPC: connection (0xc001073180) already closed &{RPC:0xc000433450 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:closed WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} Rec  onnectDelay:3s}'
      reason: TemporaryError
      status: "False"
      type: Available
    - lastHeartbeatTime: "2021-09-15T07:19:23Z"
      lastTransitionTime: "2021-09-15T07:19:23Z"
      message: 'RPC: connection (0xc001073180) already closed &{RPC:0xc000433450 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:closed WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} Rec  onnectDelay:3s}'
      reason: TemporaryError
      status: "True"
      type: Progressing
    - lastHeartbeatTime: "2021-09-15T07:19:23Z"
      lastTransitionTime: "2021-09-15T07:19:23Z"
      message: 'RPC: connection (0xc001073180) already closed &{RPC:0xc000433450 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:closed WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} Rec  onnectDelay:3s}'
      reason: TemporaryError
      status: "False"
      type: Degraded
    - lastHeartbeatTime: "2021-09-15T07:19:23Z"
      lastTransitionTime: "2021-09-15T07:19:23Z"
      message: 'RPC: connection (0xc001073180) already closed &{RPC:0xc000433450 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:closed WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} Rec  onnectDelay:3s}'
      reason: TemporaryError
      status: "False"
      type: Upgradeable

So, offhand, it looks like NooBaa is expecting a CephObjectStoreUser, which is off course invalid for an MCG-only StorageCluster.

At *this* point I could come in and say that we don't handle that particular check, that's in noobaa-operator. As such, reassigning appropriately. @nbecker PTAL when you can.

Comment 28 errata-xmlrpc 2021-12-13 17:46:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:5086