Bug 1873864 - Noobaa: On an baremetal RHCOS cluster, some backingstores are stuck in PROGRESSING state with INVALID_ENDPOINT TemporaryError
Summary: Noobaa: On an baremetal RHCOS cluster, some backingstores are stuck in PROGRE...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: Multi-Cloud Object Gateway
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.6.0
Assignee: Danny
QA Contact: Ben Eli
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-30 15:45 UTC by Ben Eli
Modified: 2020-12-17 06:24 UTC (History)
9 users (show)

Fixed In Version: v4.6.0-75.ci
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-17 06:24:01 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github noobaa noobaa-operator pull 405 0 None closed keep requeueing backing-store for 5 minutes if getting INVALID_ENDPOINT 2020-12-18 08:45:47 UTC
Github noobaa noobaa-operator pull 409 0 None closed Backport to 5.6 2020-12-18 08:45:47 UTC
Red Hat Product Errata RHSA-2020:5605 0 None None None 2020-12-17 06:24:26 UTC

Comment 4 Danny 2020-08-31 13:02:22 UTC
noobaa-core is trying to validate the endpoint details of the backingstore by performing listBuckets on the target. in this case it gets UnknownEndpoint error.

Is it possible that this cluster has limited access to the internet and can't reach amazonaws?

Comment 5 Danny 2020-08-31 13:51:48 UTC
other AWS backing-stores are working properly, so it is not a connectivity issue.
it looks like there was a momentary networking issue when trying to validate the connection

Comment 10 Ben Eli 2020-09-03 14:31:23 UTC
I tried to reproduce the bug twice on a similar environment, but failed to do so.

I marked the bug as a regression because this is the first time we're seeing it; We did not run into it in the past.
It may just be a rare bug, but I just went with the dry definition of 'it used to work, and now it doesn't'.

Comment 11 swilson 2020-09-03 16:26:53 UTC
Having same issue on RHOCS baremetal:

...
noobaa-core-0 Running
noobaa-db-0   Init:0/1
noobaa-operator-5ff7c8d94-d8fkv   Running.
...
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-85f9c95dbwx5 Running
...

Other notes
OpenShift Container Storage 4.4.2 Installing

oc logs rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-85f9c95dbwx5 produces:
debug 2020-09-03 15:52:21.961 7f95bbad4700  1 ====== starting new request req=0x7f96ca4a3890 =====
debug 2020-09-03 15:52:21.961 7f95bbad4700  1 ====== req done req=0x7f96ca4a3890 op status=0 http_status=200 latency=0s ======

oc logs noobaa-operator-5ff7c8d94-d8fkv produces:
time="2020-09-03T15:53:39Z" level=info msg="✈️  RPC: auth.read_auth() Request: <nil>"

oc logs noobaa-core-0 produces:
Sep-3 15:55:43.598 [/16] [ERROR] core.util.mongo_client:: _connect: initial connect failed, will retry failed to connect to server [noobaa-db-0.noobaa-db:27017] on first connect [Error: getaddrinfo ENOTFOUND noobaa-db-0.noobaa-db
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:66:26) {
  name: 'MongoNetworkError',
  errorLabels: [Array],
  [Symbol(mongoErrorContextSymbol)]: {}

}]

oc get noobaa noobaa -n openshift-storage -o yaml produces: 

 message: 'RPC: connection (0xc001b40730) already closed &{RPC:0xc00016f0e0 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/
      State:closed WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0}
      ReconnectDelay:3s}'
    reason: TemporaryError

Comment 12 Nimrod Becker 2020-09-03 16:31:24 UTC
Doesn't seem the same, please file a new BZ.

Comment 13 swilson 2020-09-03 17:47:32 UTC
Thanks. Rebooting the entire cluster resolved the problem. Must have been on some ip assignment issues with the pods.

Comment 18 Ben Eli 2020-09-30 09:38:52 UTC
Since we have no direct way of reproducing the issue, we have to rely on regression testing in order to verify it did not happen again since.
According to a run with 4.6.0-98.ci using the same setup parameters - the test no longer fails -
http://post-office.corp.redhat.com/archives/ocs-ci/2020-September/msg00505.html

Verified.

Comment 22 errata-xmlrpc 2020-12-17 06:24:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605


Note You need to log in before you can comment on or make changes to this bug.