1873864 – Noobaa: On an baremetal RHCOS cluster, some backingstores are stuck in PROGRESSING state with INVALID_ENDPOINT TemporaryError

Bug 1873864 - Noobaa: On an baremetal RHCOS cluster, some backingstores are stuck in PROGRESSING state with INVALID_ENDPOINT TemporaryError

Summary: Noobaa: On an baremetal RHCOS cluster, some backingstores are stuck in PROGRE...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	Multi-Cloud Object Gateway
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.6.0
Assignee:	Danny
QA Contact:	Ben Eli
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-30 15:45 UTC by Ben Eli
Modified:	2020-12-17 06:24 UTC (History)
CC List:	9 users (show)
Fixed In Version:	v4.6.0-75.ci
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-17 06:24:01 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	noobaa noobaa-operator pull 405	None	closed	keep requeueing backing-store for 5 minutes if getting INVALID_ENDPOINT	2020-12-18 08:45:47 UTC
Github	noobaa noobaa-operator pull 409	None	closed	Backport to 5.6	2020-12-18 08:45:47 UTC
Red Hat Product Errata	RHSA-2020:5605	None	None	None	2020-12-17 06:24:26 UTC

Comment 4 Danny 2020-08-31 13:02:22 UTC

noobaa-core is trying to validate the endpoint details of the backingstore by performing listBuckets on the target. in this case it gets UnknownEndpoint error.

Is it possible that this cluster has limited access to the internet and can't reach amazonaws?

Comment 5 Danny 2020-08-31 13:51:48 UTC

other AWS backing-stores are working properly, so it is not a connectivity issue.
it looks like there was a momentary networking issue when trying to validate the connection

Comment 10 Ben Eli 2020-09-03 14:31:23 UTC

I tried to reproduce the bug twice on a similar environment, but failed to do so.

I marked the bug as a regression because this is the first time we're seeing it; We did not run into it in the past.
It may just be a rare bug, but I just went with the dry definition of 'it used to work, and now it doesn't'.

Comment 11 swilson 2020-09-03 16:26:53 UTC

Having same issue on RHOCS baremetal:

...
noobaa-core-0 Running
noobaa-db-0   Init:0/1
noobaa-operator-5ff7c8d94-d8fkv   Running.
...
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-85f9c95dbwx5 Running
...

Other notes
OpenShift Container Storage 4.4.2 Installing

oc logs rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-85f9c95dbwx5 produces:
debug 2020-09-03 15:52:21.961 7f95bbad4700  1 ====== starting new request req=0x7f96ca4a3890 =====
debug 2020-09-03 15:52:21.961 7f95bbad4700  1 ====== req done req=0x7f96ca4a3890 op status=0 http_status=200 latency=0s ======

oc logs noobaa-operator-5ff7c8d94-d8fkv produces:
time="2020-09-03T15:53:39Z" level=info msg="✈️  RPC: auth.read_auth() Request: <nil>"

oc logs noobaa-core-0 produces:
Sep-3 15:55:43.598 [/16] [ERROR] core.util.mongo_client:: _connect: initial connect failed, will retry failed to connect to server [noobaa-db-0.noobaa-db:27017] on first connect [Error: getaddrinfo ENOTFOUND noobaa-db-0.noobaa-db
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:66:26) {
  name: 'MongoNetworkError',
  errorLabels: [Array],
  [Symbol(mongoErrorContextSymbol)]: {}

}]

oc get noobaa noobaa -n openshift-storage -o yaml produces: 

 message: 'RPC: connection (0xc001b40730) already closed &{RPC:0xc00016f0e0 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/
      State:closed WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0}
      ReconnectDelay:3s}'
    reason: TemporaryError

Comment 12 Nimrod Becker 2020-09-03 16:31:24 UTC

Doesn't seem the same, please file a new BZ.

Comment 13 swilson 2020-09-03 17:47:32 UTC

Thanks. Rebooting the entire cluster resolved the problem. Must have been on some ip assignment issues with the pods.

Comment 18 Ben Eli 2020-09-30 09:38:52 UTC

Since we have no direct way of reproducing the issue, we have to rely on regression testing in order to verify it did not happen again since.
According to a run with 4.6.0-98.ci using the same setup parameters - the test no longer fails -
http://post-office.corp.redhat.com/archives/ocs-ci/2020-September/msg00505.html

Verified.

Comment 22 errata-xmlrpc 2020-12-17 06:24:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605

Note You need to log in before you can comment on or make changes to this bug.