noobaa-core is trying to validate the endpoint details of the backingstore by performing listBuckets on the target. in this case it gets UnknownEndpoint error. Is it possible that this cluster has limited access to the internet and can't reach amazonaws?
other AWS backing-stores are working properly, so it is not a connectivity issue. it looks like there was a momentary networking issue when trying to validate the connection
I tried to reproduce the bug twice on a similar environment, but failed to do so. I marked the bug as a regression because this is the first time we're seeing it; We did not run into it in the past. It may just be a rare bug, but I just went with the dry definition of 'it used to work, and now it doesn't'.
Having same issue on RHOCS baremetal: ... noobaa-core-0 Running noobaa-db-0 Init:0/1 noobaa-operator-5ff7c8d94-d8fkv Running. ... rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-85f9c95dbwx5 Running ... Other notes OpenShift Container Storage 4.4.2 Installing oc logs rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-85f9c95dbwx5 produces: debug 2020-09-03 15:52:21.961 7f95bbad4700 1 ====== starting new request req=0x7f96ca4a3890 ===== debug 2020-09-03 15:52:21.961 7f95bbad4700 1 ====== req done req=0x7f96ca4a3890 op status=0 http_status=200 latency=0s ====== oc logs noobaa-operator-5ff7c8d94-d8fkv produces: time="2020-09-03T15:53:39Z" level=info msg="✈️ RPC: auth.read_auth() Request: <nil>" oc logs noobaa-core-0 produces: Sep-3 15:55:43.598 [/16] [ERROR] core.util.mongo_client:: _connect: initial connect failed, will retry failed to connect to server [noobaa-db-0.noobaa-db:27017] on first connect [Error: getaddrinfo ENOTFOUND noobaa-db-0.noobaa-db at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:66:26) { name: 'MongoNetworkError', errorLabels: [Array], [Symbol(mongoErrorContextSymbol)]: {} }] oc get noobaa noobaa -n openshift-storage -o yaml produces: message: 'RPC: connection (0xc001b40730) already closed &{RPC:0xc00016f0e0 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:closed WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:3s}' reason: TemporaryError
Doesn't seem the same, please file a new BZ.
Thanks. Rebooting the entire cluster resolved the problem. Must have been on some ip assignment issues with the pods.
Since we have no direct way of reproducing the issue, we have to rely on regression testing in order to verify it did not happen again since. According to a run with 4.6.0-98.ci using the same setup parameters - the test no longer fails - http://post-office.corp.redhat.com/archives/ocs-ci/2020-September/msg00505.html Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605