Can't find relevant must-gather logs. Can you attach it here?
Aman, can you reproduce the issue and provide the cluster to Umanga/Jose
``` {"level":"error", "msg":"Failed to set node Topology Map for StorageCluster.", "error":"Not enough nodes found: Expected 3, found 0"} ``` Looks like Nodes are not labelled.
I think I figured out the problem, but I'm going to try and actually verify it on my own cluster first before pushing it into upstream CI. ;)
Upstream PR is live: https://github.com/red-hat-storage/ocs-operator/pull/1335 I have verified the StorageCluster comes up and running: ``` apiVersion: v1 items: - apiVersion: ocs.openshift.io/v1 kind: StorageCluster metadata: annotations: uninstall.ocs.openshift.io/cleanup-policy: delete uninstall.ocs.openshift.io/mode: graceful creationTimestamp: "2021-09-13T16:15:20Z" finalizers: - storagecluster.ocs.openshift.io generation: 2 name: test-storagecluster namespace: openshift-storage resourceVersion: "102267" uid: 95ef2784-46de-4429-9eee-1480878a059a spec: arbiter: {} encryption: kms: {} externalStorage: {} managedResources: cephBlockPools: {} cephCluster: {} cephConfig: {} cephDashboard: {} cephFilesystems: {} cephObjectStoreUsers: {} cephObjectStores: {} mirroring: {} multiCloudGateway: reconcileStrategy: standalone <-- this is the only *required* spec resources: <-- these are for testing on undersized clusters only noobaa-core: {} noobaa-db: {} noobaa-endpoint: {} version: 4.9.0 status: conditions: - lastHeartbeatTime: "2021-09-13T16:16:55Z" lastTransitionTime: "2021-09-13T16:15:20Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "True" type: ReconcileComplete - lastHeartbeatTime: "2021-09-13T16:16:55Z" lastTransitionTime: "2021-09-13T16:16:27Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "True" type: Available - lastHeartbeatTime: "2021-09-13T16:16:55Z" lastTransitionTime: "2021-09-13T16:16:27Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "False" type: Progressing - lastHeartbeatTime: "2021-09-13T16:16:55Z" lastTransitionTime: "2021-09-13T16:15:20Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "False" type: Degraded - lastHeartbeatTime: "2021-09-13T16:16:55Z" lastTransitionTime: "2021-09-13T16:16:27Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "True" type: Upgradeable images: ceph: desiredImage: ceph/daemon-base:latest-pacific noobaaCore: actualImage: noobaa/noobaa-core:master-20210609 desiredImage: noobaa/noobaa-core:master-20210609 noobaaDB: actualImage: centos/postgresql-12-centos7 desiredImage: centos/postgresql-12-centos7 phase: Ready relatedObjects: - apiVersion: noobaa.io/v1alpha1 kind: NooBaa name: noobaa namespace: openshift-storage resourceVersion: "102265" uid: 907594b1-1f38-4c84-9333-0bff5a10d4d3 kind: List metadata: resourceVersion: "" selfLink: "" ```
fix is contained in ocs-registry:4.9.0-138.ci
Aman, when a crucial BZ like this fails QA, it would be great to set NEEDINFO on the assignee. Otherwise it might go unnoticed for days. This time it looks like a different issue, something about noobaa initializiation mentioned in the storagecluster cr status. Nitin, can you take a look?
Note that when looking for root causes, build 139 was the first build where we switched from main branches to release-4.9 branches for ocs-operator and odf-operator, so reviewing the diff between main and release-4.9 should also give a clue. (Note that Jose had verified the fix from the main branch PR.)
Hi, I looked at the code and noobaa CR. Noobaa CR is in the Connecting phase because of which ocs-operator is still in the progressing state as we can see in the above storage cluster output. I would suggest you create a new bug in the MCG for the same. ``` Status: Conditions: Last Heartbeat Time: 2021-09-15T07:55:24Z Last Transition Time: 2021-09-15T07:19:23Z Message: Reconcile completed successfully Reason: ReconcileCompleted Status: True Type: ReconcileComplete Last Heartbeat Time: 2021-09-15T07:19:23Z Last Transition Time: 2021-09-15T07:19:23Z Message: Initializing StorageCluster Reason: Init Status: False Type: Available Last Heartbeat Time: 2021-09-15T07:55:24Z Last Transition Time: 2021-09-15T07:19:23Z Message: Waiting on Nooba instance to finish initialization Reason: NoobaaInitializing Status: True Type: Progressing Last Heartbeat Time: 2021-09-15T07:19:23Z Last Transition Time: 2021-09-15T07:19:23Z Message: Initializing StorageCluster Reason: Init Status: False Type: Degraded Last Heartbeat Time: 2021-09-15T07:19:23Z Last Transition Time: 2021-09-15T07:19:23Z Message: Initializing StorageCluster Reason: Init Status: Unknown Type: Upgradeable ```
Everyone, please, at least try to look at the Pod logs before replying to a BZ. Not doing so creates needless turnaround time, which for this team can be tremendous given our time zone distribution (and me in particular being constantly behind on BZ notifications). Also, for what it's worth, it's way easier for me to work with must-gather output that I can browse in a web browser than having to download a zip file. Clearly the Conditions state that something is wrong with the NooBaa initialization: - lastHeartbeatTime: "2021-09-15T07:35:07Z" lastTransitionTime: "2021-09-15T07:19:23Z" message: Waiting on Nooba instance to finish initialization reason: NoobaaInitializing status: "True" type: Progressing I found nothing in the ocs-operator logs to indicate a problem, but then I found this in the noobaa-oeprator logs: time="2021-09-15T07:37:22Z" level=info msg="❌ Not Found: CephObjectStoreUser \"noobaa-ceph-objectstore-user\"\n" And looking at the NooBaa YAML: - lastHeartbeatTime: "2021-09-15T07:19:23Z" lastTransitionTime: "2021-09-15T07:19:23Z" message: 'RPC: connection (0xc001073180) already closed &{RPC:0xc000433450 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:closed WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} Rec onnectDelay:3s}' reason: TemporaryError status: "False" type: Available - lastHeartbeatTime: "2021-09-15T07:19:23Z" lastTransitionTime: "2021-09-15T07:19:23Z" message: 'RPC: connection (0xc001073180) already closed &{RPC:0xc000433450 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:closed WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} Rec onnectDelay:3s}' reason: TemporaryError status: "True" type: Progressing - lastHeartbeatTime: "2021-09-15T07:19:23Z" lastTransitionTime: "2021-09-15T07:19:23Z" message: 'RPC: connection (0xc001073180) already closed &{RPC:0xc000433450 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:closed WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} Rec onnectDelay:3s}' reason: TemporaryError status: "False" type: Degraded - lastHeartbeatTime: "2021-09-15T07:19:23Z" lastTransitionTime: "2021-09-15T07:19:23Z" message: 'RPC: connection (0xc001073180) already closed &{RPC:0xc000433450 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:closed WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} Rec onnectDelay:3s}' reason: TemporaryError status: "False" type: Upgradeable So, offhand, it looks like NooBaa is expecting a CephObjectStoreUser, which is off course invalid for an MCG-only StorageCluster. At *this* point I could come in and say that we don't handle that particular check, that's in noobaa-operator. As such, reassigning appropriately. @nbecker PTAL when you can.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:5086