Bug 2262087

Summary: MCG fails to initialize the default IBM COS backingstore due to problematic node lookup logic
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Daniel Osypenko <dosypenk>
Component: Multi-Cloud Object GatewayAssignee: Nimrod Becker <nbecker>
Status: CLOSED DUPLICATE QA Contact: krishnaram Karthick <kramdoss>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.13CC: belimele, dzaken, etamir, muagarwa, nbecker, nberry, odf-bz-bot, vavuthu
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-02-07 09:11:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Daniel Osypenko 2024-01-31 12:50:11 UTC
Description of problem (please be detailed as possible and provide log
snippests):

Deploying new cluster on multiple cloud platforms and confirmed on versions ODF 4.13, 4.14, 4.15 ocs-storagecluster is never reaching progress state (over 4h)

ocs-operator logs Reconciler Error

```
oc get storageclusters.ocs.openshift.io -A
NAMESPACE           NAME                 AGE    PHASE         EXTERNAL   CREATED AT             VERSION
openshift-storage   ocs-storagecluster   4h1m   Progressing              2024-01-31T08:41:24Z   4.13.7
```

```
oc logs ocs-operator-868455d4bb-kpl8w -n openshift-storage | grep error

{"level":"error","ts":"2024-01-31T09:21:19Z","msg":"Reconciler error","controller":"storagecluster","controllerGroup":"ocs.openshift.io","controllerKind":"StorageCluster","StorageCluster":{"name":"ocs-storagecluster","namespace":"openshift-storage"},"namespace":"openshift-storage","name":"ocs-storagecluster","reconcileID":"da39bff0-b0fb-4acd-96c1-e439c64fff80","error":"Operation cannot be fulfilled on storageclusters.ocs.openshift.io \"ocs-storagecluster\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235"}
{"level":"error","ts":"2024-01-31T09:30:26Z","msg":"Reconciler error","controller":"storagecluster","controllerGroup":"ocs.openshift.io","controllerKind":"StorageCluster","StorageCluster":{"name":"ocs-storagecluster","namespace":"openshift-storage"},"namespace":"openshift-storage","name":"ocs-storagecluster","reconcileID":"e53ad2b1-c132-438d-b796-86298c180a2e","error":"Operation cannot be fulfilled on storageclusters.ocs.openshift.io \"ocs-storagecluster\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller)

```

```
oc logs odf-operator-controller-manager-f56b885d5-sh287 -n openshift-storage
Defaulted container "kube-rbac-proxy" out of: kube-rbac-proxy, manager
Flag --logtostderr has been deprecated, will be removed in a future release, see https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/2845-deprecate-klog-specific-flags-in-k8s-components
W0131 08:39:37.215604       1 kube-rbac-proxy.go:156] 
==== Deprecation Warning ======================

Insecure listen address will be removed.
Using --insecure-listen-address won't be possible!

The ability to run kube-rbac-proxy without TLS certificates will be removed.
Not using --tls-cert-file and --tls-private-key-file won't be possible!

For more information, please go to https://github.com/brancz/kube-rbac-proxy/issues/187

===============================================

		
I0131 08:39:37.215763       1 kube-rbac-proxy.go:285] Valid token audiences: 
I0131 08:39:37.215834       1 kube-rbac-proxy.go:383] Generating self signed cert as no cert is provided
I0131 08:39:37.729261       1 kube-rbac-proxy.go:447] Starting TCP socket on 0.0.0.0:8443
I0131 08:39:37.729650       1 kube-rbac-proxy.go:454] Listening securely on 0.0.0.0:8443
```
Storage system is Ok, no Warnings, Errors. Cluster is in Working state

Version of all relevant components (if applicable):
Problem has been seen on multiple versions


OC version:
Client Version: 4.13.4
Kustomize Version: v4.5.7
Server Version: 4.13.0-0.nightly-2024-01-30-181028
Kubernetes Version: v1.26.13+77e61a2
-e 
OCS verison:
ocs-operator.v4.13.7-rhodf              OpenShift Container Storage   4.13.7-rhodf   ocs-operator.v4.13.6-rhodf              Succeeded
-e 
Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-2024-01-30-181028   True        False         4h14m   Cluster version is 4.13.0-0.nightly-2024-01-30-181028
-e 
Rook version:
rook: v4.13.7-0.42f43768ad57d91be47327f83653c05eeb721977
go: go1.19.13
-e 
Ceph version:
ceph version 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1 

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
no

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy cluster on one of supported cloud providers
2.
3.


Actual results:
ocs-storageclass never becomes ready

Expected results:
ocs-storageclass is ready in reasonable time upon deployment

Additional info:

must-gather logs: https://drive.google.com/file/d/1yPNggohjwcb2Ndg_cnzKmgIkROFc7KcY/view?usp=sharing

Comment 7 Daniel Osypenko 2024-02-06 12:02:30 UTC
retested with IBM Cloud and odf-operator.v4.15.0-134.stable, no issue

oc get storageclusters.ocs.openshift.io -A
NAMESPACE           NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
openshift-storage   ocs-storagecluster   67m   Ready              2024-02-06T10:49:35Z   4.15.0

oc get noobaa -A
NAMESPACE           NAME     S3-ENDPOINTS                   STS-ENDPOINTS                  IMAGE                                                                                                            PHASE   AGE
openshift-storage   noobaa   ["https://10.240.0.4:30157"]   ["https://10.240.0.4:30181"]   registry.redhat.io/odf4/mcg-core-rhel9@sha256:1d79a2ac176ca6e69c3198d0e35537aaf29373440d214d324d0d433d1473d9a1   Ready   67m

oc get backingstores.noobaa.io -A
NAMESPACE           NAME                           TYPE      PHASE   AGE
openshift-storage   noobaa-default-backing-store   ibm-cos   Ready   67m

Comment 8 Nimrod Becker 2024-02-07 07:58:14 UTC
I agree with comment 6

Comment 10 Mudit Agarwal 2024-02-07 09:11:14 UTC
Please refer to #comment6

Comment 11 Mudit Agarwal 2024-02-07 09:13:02 UTC
Sorry, pressed enter too early. Please #comment5 and #comment6

*** This bug has been marked as a duplicate of bug 2255557 ***