Bug 2264039

Summary: [4.14 Clone] Noobaa stuck configuring when deploying on IBM Cloud(IPI) with COS-backed backingstore
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Danny <dzaken>
Component: Multi-Cloud Object GatewayAssignee: Nimrod Becker <nbecker>
Status: CLOSED ERRATA QA Contact: Sagi Hirshfeld <shirshfe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.15CC: asriram, belimele, clacroix, dosypenk, dzaken, ebenahar, etamir, kbg, kramdoss, muagarwa, nbecker, odf-bz-bot, pbalogh
Target Milestone: ---   
Target Release: ODF 4.14.7   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.14.7-1 Doc Type: Bug Fix
Doc Text:
Previously, while deploying OpenShift Data Foundation on IBM installer provisioned infrastructure, the cluster that is using a COS-backed Multicloud Object Gateway backingstore used to be stuck in the `Configuring` state. This was because the attempt to get label from IBMRegion on any node failed as only the worker nodes would have the label. This resulted in failure to create the default backingstore and subsequently failing to deploy Multicloud Object Gateway and OpenShift Data Foundation. With this fix, the label handling in the NooBaa operator is fixed so that Multicloud Object Gateway and OpenShift Data Foundation are deployed successfully.
Story Points: ---
Clone Of: 2255557 Environment:
Last Closed: 2024-05-28 15:39:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2255557    
Bug Blocks:    

Description Danny 2024-02-13 14:19:03 UTC
+++ This bug was initially created as a clone of Bug #2255557 +++

Description of problem (please be detailed as possible and provide log
snippests):

When trying to deploy ODF to an IBM IPI cluster using a COS-backed backingstore, Noobaa becomes stuck in the Configuring state. 

-----------------
  phase: Configuring
  readme: "\n\n\tNooBaa operator is still working to reconcile this system.\n\tCheck
    out the system status.phase, status.conditions, and events with:\n\n\t\tkubectl
    -n openshift-storage describe noobaa\n\t\tkubectl -n openshift-storage get noobaa
    -o yaml\n\t\tkubectl -n openshift-storage get events --sort-by=metadata.creationTimestamp\n\n\tYou
    can wait for a specific condition with:\n\n\t\tkubectl -n openshift-storage wait
    noobaa/noobaa --for condition=available --timeout -1s\n\n\tNooBaa Core Version:
    \    master-20230920\n\tNooBaa Operator Version: 5.15.0\n"
  services:
-----------------
  - lastHeartbeatTime: "2023-12-18T10:39:32Z"
    lastTransitionTime: "2023-12-18T10:30:28Z"
    message: |-
      RequestError: send request failed
      caused by: Put "https://s3.direct..cloud-object-storage.appdomain.cloud/nb.1702895972648.apps.jnk-pr9072b6235.ibmcloud2.qe.rh-ocs.com": dial tcp: lookup s3.direct..cloud-object-storage.appdomain.cloud: no such host
    reason: TemporaryError
-----------------


Version of all relevant components (if applicable):
Server Version: 4.15.0-0.nightly-2023-12-19-033450


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, Noobaa is unable to be installed for this scenario (IBM IPI with COS-backed backingstore).


Is there any workaround available to the best of your knowledge?
No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2


Can this issue reproducible?
yes


Can this issue reproduce from the UI?
unknown


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy OCP to IBM Cloud(IPI)
2. Label worker nodes with region information
'oc label node <worker name> ibm-cloud\.kubernetes\.io/region=<REGION>' in our case <REGION> is us-south.
3. Deploy ODF, creating the Secret via YAML as described here:
https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.9/html-single/deploying_and_managing_openshift_data_foundation_using_google_cloud/index#creating-an-IBM-COS-backed-backingstore_rhodf


Actual results:
Noobaa is stuck in Creating phase.


Expected results:
Noobaa creation is successful, deployment succeeds.


Additional info:

https://cloud.ibm.com/docs/containers?topic=containers-storage_cos_install

https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.9/html-single/deploying_and_managing_openshift_data_foundation_using_google_cloud/index#creating-an-IBM-COS-backed-backingstore_rhodf

noobaa operator logs: https://url.corp.redhat.com/d7c188f

noobaa.yaml: https://url.corp.redhat.com/a9c0127

full ocs must gather: https://url.corp.redhat.com/b9c7f09

--- Additional comment from RHEL Program Management on 2023-12-21 22:49:06 IST ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.15.0' to '?', and so is being proposed to be fixed at the ODF 4.15.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from Ben Eli on 2024-01-02 16:19:06 IST ---

The node labels do not seem to be present - 
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-pr9072b6235/jnk-pr9072b6235_20231218T081756/logs/deployment_1702894821/jnk-pr9072b6235/ocs_must_gather/c2f9bca16b1fcd4caaeb21cefc9c0f5835b8c6d068b152724d7c54e451b83af3/cluster-scoped-resources/oc_output/desc_nodes

--- Additional comment from Petr Balogh on 2024-01-02 16:47:57 IST ---

I am not sure how Coady's PR validated this, but when I've tried manually before the shutdown I collected must gather here:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-cos/pbalogh-cos_20231221T102303/logs/deployment_1703160725/

Validation job was triggered here:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/32237/console

I see labeles on all 3 worker nodes:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-cos/pbalogh-cos_20231221T102303/logs/deployment_1703160725/pbalogh-cos/ocs_must_gather/21ea0203971406f15ee064be07fc5878d185202baf8e79078a0893b8db582ef5/cluster-scoped-resources/oc_output/get_nodes_-o_wide_--show-labels

ibm-cloud.kubernetes.io/region=us-south

Labeling I did manually right after OCP deployment like:
$ oc label node worker-X-node-name ibm-cloud.kubernetes.io/region=us-south

I will let Coady to check why his verification link job doesn't have labels, but when I tried manually I also didn't succeed to get it working.

--- Additional comment from Coady LaCroix on 2024-01-03 00:25:11 IST ---

We were able to get a successful deployment by adding the region label to all of the cluster nodes. Applying to the workers only was insufficient.

https://url.corp.redhat.com/8b2fc5d

Logs are linked to in the description of the job if necessary.

--- Additional comment from RHEL Program Management on 2024-01-09 11:37:35 IST ---

This BZ is being approved for ODF 4.15.0 release, upon receipt of the 3 ACKs (PM,Devel,QA) for the release flag 'odf‑4.15.0

--- Additional comment from RHEL Program Management on 2024-01-09 11:37:35 IST ---

Since this bug has been approved for ODF 4.15.0 release, through release flag 'odf-4.15.0+', the Target Release is being set to 'ODF 4.15.0

--- Additional comment from Coady LaCroix on 2024-01-11 23:11:03 IST ---

Verified the deployment was successful after only labeling the worker nodes.

Jenkins: https://url.corp.redhat.com/7db156b

Verified on Server Version: 4.15.0-0.nightly-2024-01-10-101042

--- Additional comment from errata-xmlrpc on 2024-02-06 16:49:43 IST ---

This bug has been added to advisory RHBA-2023:118688 by Boris Ranto (branto)

--- Additional comment from Mudit Agarwal on 2024-02-07 11:13:02 IST ---



--- Additional comment from Eran Tamir on 2024-02-11 09:48:46 IST ---

please backport to 4.14

Comment 9 errata-xmlrpc 2024-05-28 15:39:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.14.7 Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:3443