Bug 2015657

Summary:	[External Mode] Cluster Object Store remains in unhealthy state due to failed to build admin ops API connection
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Sidhant Agrawal <sagrawal>
Component:	rook	Assignee:	Jiffin <jthottan>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Shrivaibavi Raghaventhiran <sraghave>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.9	CC:	jthottan, madam, muagarwa, ocs-bugs, odf-bz-bot, rperiyas, shan, tnielsen
Target Milestone:	---	Keywords:	AutomationBackLog, Regression
Target Release:	ODF 4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	v4.9.0-210.ci	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-01-07 17:46:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sidhant Agrawal 2021-10-19 19:35:04 UTC

Description of problem (please be detailed as possible and provide log
snippests):
In an OCS external mode cluster, Cluster Object Store is in unhealthy state after fresh deployment.
ocs-external-storagecluster-cephobjectstore remains in Failure phase, and rook-ceph-operator pod logs contains following error messages

---

2021-10-19 18:57:41.157904 I | ceph-object-controller: reconciling external object store
2021-10-19 18:57:41.157936 I | ceph-object-controller: reconciling object store service
2021-10-19 18:57:41.157974 D | op-k8sutil: creating service rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore
2021-10-19 18:57:41.192083 D | op-k8sutil: updating service rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore
2021-10-19 18:57:41.204630 I | ceph-object-controller: ceph object store gateway service running at 172.30.231.51
2021-10-19 18:57:41.204657 I | ceph-object-controller: reconciling external object store endpoint
2021-10-19 18:57:41.204662 I | ceph-object-controller: reconciling external object store service
2021-10-19 18:57:41.204711 D | op-k8sutil: creating endpoint "rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore". [{[{10.70.39.7  <nil> nil}] [] [{http 8080 TCP <nil>}]}]
2021-10-19 18:57:41.226858 D | ceph-object-controller: object store "openshift-storage/ocs-external-storagecluster-cephobjectstore" status updated to "Failure"
2021-10-19 18:57:41.226902 E | ceph-object-controller: failed to reconcile CephObjectStore "openshift-storage/ocs-external-storagecluster-cephobjectstore". failed to create object store deployments: failed to start rgw health checker for CephObjectStore "openshift-storage/ocs-external-storagecluster-cephobjectstore", will re-reconcile: failed to create bucket checker for CephObjectStore "openshift-storage/ocs-external-storagecluster-cephobjectstore": failed to build admin ops API connection: endpoint not set
2021-10-19 18:57:41.226912 D | op-k8sutil: Not Reporting Event because event is same as the old one:openshift-storage:ocs-external-storagecluster-cephobjectstore Warning:ReconcileFailed:failed to create object store deployments: failed to start rgw health checker for CephObjectStore "openshift-storage/ocs-external-storagecluster-cephobjectstore", will re-reconcile: failed to create bucket checker for CephObjectStore "openshift-storage/ocs-external-storagecluster-cephobjectstore": failed to build admin ops API connection: endpoint not set

---


Version of all relevant components (if applicable):
OCP: 4.9.0-0.nightly-2021-10-19-063835
ODF: 4.9.0-193.ci
External RHCS: ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
"Object Service" will be reported as degraded and below alert can be seen in the dashboard
"Cluster Object Store is in unhealthy state for more than 15s. Please check Ceph cluster health or RGW connection."

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes, 2/2

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
Yes, this issue was not observed in 4.8

Steps to Reproduce:
1.Install ODF in external mode
2.Verify Object Service status in the dashboard

Actual results:
Object Service degraded and ocs-external-storagecluster-cephobjectstore in Failure phase

Expected results:
Object Service status and cephobjectstore should be healthy

Additional info:

$ oc -n openshift-storage get cephobjectstore -o yaml | grep phase
    phase: Failure

status of noobaa,backingstore,bucketclass:

NAME                      MGMT-ENDPOINTS                  S3-ENDPOINTS                    IMAGE                                                                                                 PHASE   AGE
noobaa.noobaa.io/noobaa   ["https://10.1.161.61:32573"]   ["https://10.1.161.63:30956"]   quay.io/rhceph-dev/mcg-core@sha256:2f641e8a0b9183e72be800920dd531466717807225b293a37dcfa14465952273   Ready   3h2m

NAME                                                  TYPE            PHASE   AGE
backingstore.noobaa.io/noobaa-default-backing-store   s3-compatible   Ready   176m

NAME                                                PLACEMENT                                                        NAMESPACEPOLICY   PHASE   AGE
bucketclass.noobaa.io/noobaa-default-bucket-class   {"tiers":[{"backingStores":["noobaa-default-backing-store"]}]}                     Ready   176m

Comment 6 Travis Nielsen 2021-10-25 15:14:57 UTC

Jiffin this is merged upstream now, please open a downstream backport PR, thanks

Comment 7 Jiffin 2021-10-26 09:56:17 UTC

(In reply to Travis Nielsen from comment #6)
> Jiffin this is merged upstream now, please open a downstream backport PR,
> thanks

Backport posted

Comment 8 Shrivaibavi Raghaventhiran 2021-11-18 07:40:58 UTC

Tested Version:
----------------
OCP - 4.9.0-0.nightly-2021-11-17-044439
ODF - odf-operator.v4.9.0
OCS - ocs-operator.v4.9.0

Installed ODF on external mode via jenkins,

I see that the ceph object store is in Connected state

$ oc -n openshift-storage get cephobjectstore
NAME                                          AGE
ocs-external-storagecluster-cephobjectstore   168m

$ oc -n openshift-storage get cephobjectstore -o yaml | grep phase
          f:phase: {}
    phase: Connected

From rook-ceph-operator logs
-------------------------------
2021-11-17 09:10:29.576422 I | ceph-object-store-user-controller: CephObjectStore "ocs-external-storagecluster-cephobjectstore" found
2021-11-17 09:10:29.576470 I | ceph-object-store-user-controller: CephObjectStore "ocs-external-storagecluster-cephobjectstore" found
2021-11-17 09:10:29.576504 I | ceph-object-store-user-controller: creating ceph object user "noobaa-ceph-objectstore-user" in namespace "openshift-storage"
2021-11-17 09:10:29.585695 E | ceph-object-store-user-controller: failed to reconcile failed to create/update object store user "noobaa-ceph-objectstore-user": failed to get details from ceph object user "noobaa-ceph-objectstore-user": Get "http://rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore.openshift-storage.svc:8080/admin/user?display-name=my%20display%20name&format=json&max-buckets=1000&uid=noobaa-ceph-objectstore-user": dial tcp: lookup rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore.openshift-storage.svc on 172.30.0.10:53: no such host
2021-11-17 09:10:29.991214 I | ceph-object-controller: reconciling external object store
2021-11-17 09:10:29.991243 I | ceph-object-controller: reconciling object store service
2021-11-17 09:10:30.009636 I | ceph-object-controller: ceph object store gateway service running at 172.30.116.24
2021-11-17 09:10:30.009655 I | ceph-object-controller: reconciling external object store endpoint
2021-11-17 09:10:30.009659 I | ceph-object-controller: reconciling external object store service
2021-11-17 09:10:30.017582 I | ceph-object-controller: ceph rgw status check interval for object store "ocs-external-storagecluster-cephobjectstore" is "1m0s"
2021-11-17 09:10:30.017605 I | ceph-object-controller: starting rgw health checker for CephObjectStore "openshift-storage/ocs-external-storagecluster-cephobjectstore"
2021-11-17 09:10:30.375359 I | op-mon: parsing mon endpoints: catalina=10.70.39.4:6789
2021-11-17 09:10:30.375395 I | op-mon: updating obsolete maxMonID 0 to actual value 24642648160
2021-11-17 09:10:31.393900 I | ceph-object-store-user-controller: CephObjectStore "ocs-external-storagecluster-cephobjectstore" found
2021-11-17 09:10:31.393933 I | ceph-object-store-user-controller: CephObjectStore "ocs-external-storagecluster-cephobjectstore" found
2021-11-17 09:10:31.393969 I | ceph-object-store-user-controller: creating ceph object user "noobaa-ceph-objectstore-user" in namespace "openshift-storage"
2021-11-17 09:10:32.567012 I | ceph-object-store-user-controller: retrieved existing ceph object user "noobaa-ceph-objectstore-user"
2021-11-17 09:10:32.573892 I | ceph-spec: created ceph *v1.Secret object "rook-ceph-object-user-ocs-external-storagecluster-cephobjectstore-noobaa-ceph-objectstore-user"


Initially there were some "NOT FOUND" errors but i see in the later stages "ocs-external-storagecluster-cephobjectstore" is found.

No errors found in UI

Based on the above observations moving this BZ to Verified state

Attaching clean UI screenshot for reference