Description of problem (please be detailed as possible and provide log snippests): In an OCS external mode cluster, Cluster Object Store is in unhealthy state after fresh deployment. ocs-external-storagecluster-cephobjectstore remains in Failure phase, and rook-ceph-operator pod logs contains following error messages --- 2021-10-19 18:57:41.157904 I | ceph-object-controller: reconciling external object store 2021-10-19 18:57:41.157936 I | ceph-object-controller: reconciling object store service 2021-10-19 18:57:41.157974 D | op-k8sutil: creating service rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore 2021-10-19 18:57:41.192083 D | op-k8sutil: updating service rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore 2021-10-19 18:57:41.204630 I | ceph-object-controller: ceph object store gateway service running at 172.30.231.51 2021-10-19 18:57:41.204657 I | ceph-object-controller: reconciling external object store endpoint 2021-10-19 18:57:41.204662 I | ceph-object-controller: reconciling external object store service 2021-10-19 18:57:41.204711 D | op-k8sutil: creating endpoint "rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore". [{[{10.70.39.7 <nil> nil}] [] [{http 8080 TCP <nil>}]}] 2021-10-19 18:57:41.226858 D | ceph-object-controller: object store "openshift-storage/ocs-external-storagecluster-cephobjectstore" status updated to "Failure" 2021-10-19 18:57:41.226902 E | ceph-object-controller: failed to reconcile CephObjectStore "openshift-storage/ocs-external-storagecluster-cephobjectstore". failed to create object store deployments: failed to start rgw health checker for CephObjectStore "openshift-storage/ocs-external-storagecluster-cephobjectstore", will re-reconcile: failed to create bucket checker for CephObjectStore "openshift-storage/ocs-external-storagecluster-cephobjectstore": failed to build admin ops API connection: endpoint not set 2021-10-19 18:57:41.226912 D | op-k8sutil: Not Reporting Event because event is same as the old one:openshift-storage:ocs-external-storagecluster-cephobjectstore Warning:ReconcileFailed:failed to create object store deployments: failed to start rgw health checker for CephObjectStore "openshift-storage/ocs-external-storagecluster-cephobjectstore", will re-reconcile: failed to create bucket checker for CephObjectStore "openshift-storage/ocs-external-storagecluster-cephobjectstore": failed to build admin ops API connection: endpoint not set --- Version of all relevant components (if applicable): OCP: 4.9.0-0.nightly-2021-10-19-063835 ODF: 4.9.0-193.ci External RHCS: ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? "Object Service" will be reported as degraded and below alert can be seen in the dashboard "Cluster Object Store is in unhealthy state for more than 15s. Please check Ceph cluster health or RGW connection." Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes, 2/2 Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Yes, this issue was not observed in 4.8 Steps to Reproduce: 1.Install ODF in external mode 2.Verify Object Service status in the dashboard Actual results: Object Service degraded and ocs-external-storagecluster-cephobjectstore in Failure phase Expected results: Object Service status and cephobjectstore should be healthy Additional info: $ oc -n openshift-storage get cephobjectstore -o yaml | grep phase phase: Failure status of noobaa,backingstore,bucketclass: NAME MGMT-ENDPOINTS S3-ENDPOINTS IMAGE PHASE AGE noobaa.noobaa.io/noobaa ["https://10.1.161.61:32573"] ["https://10.1.161.63:30956"] quay.io/rhceph-dev/mcg-core@sha256:2f641e8a0b9183e72be800920dd531466717807225b293a37dcfa14465952273 Ready 3h2m NAME TYPE PHASE AGE backingstore.noobaa.io/noobaa-default-backing-store s3-compatible Ready 176m NAME PLACEMENT NAMESPACEPOLICY PHASE AGE bucketclass.noobaa.io/noobaa-default-bucket-class {"tiers":[{"backingStores":["noobaa-default-backing-store"]}]} Ready 176m
Jiffin this is merged upstream now, please open a downstream backport PR, thanks
(In reply to Travis Nielsen from comment #6) > Jiffin this is merged upstream now, please open a downstream backport PR, > thanks Backport posted
Tested Version: ---------------- OCP - 4.9.0-0.nightly-2021-11-17-044439 ODF - odf-operator.v4.9.0 OCS - ocs-operator.v4.9.0 Installed ODF on external mode via jenkins, I see that the ceph object store is in Connected state $ oc -n openshift-storage get cephobjectstore NAME AGE ocs-external-storagecluster-cephobjectstore 168m $ oc -n openshift-storage get cephobjectstore -o yaml | grep phase f:phase: {} phase: Connected From rook-ceph-operator logs ------------------------------- 2021-11-17 09:10:29.576422 I | ceph-object-store-user-controller: CephObjectStore "ocs-external-storagecluster-cephobjectstore" found 2021-11-17 09:10:29.576470 I | ceph-object-store-user-controller: CephObjectStore "ocs-external-storagecluster-cephobjectstore" found 2021-11-17 09:10:29.576504 I | ceph-object-store-user-controller: creating ceph object user "noobaa-ceph-objectstore-user" in namespace "openshift-storage" 2021-11-17 09:10:29.585695 E | ceph-object-store-user-controller: failed to reconcile failed to create/update object store user "noobaa-ceph-objectstore-user": failed to get details from ceph object user "noobaa-ceph-objectstore-user": Get "http://rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore.openshift-storage.svc:8080/admin/user?display-name=my%20display%20name&format=json&max-buckets=1000&uid=noobaa-ceph-objectstore-user": dial tcp: lookup rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore.openshift-storage.svc on 172.30.0.10:53: no such host 2021-11-17 09:10:29.991214 I | ceph-object-controller: reconciling external object store 2021-11-17 09:10:29.991243 I | ceph-object-controller: reconciling object store service 2021-11-17 09:10:30.009636 I | ceph-object-controller: ceph object store gateway service running at 172.30.116.24 2021-11-17 09:10:30.009655 I | ceph-object-controller: reconciling external object store endpoint 2021-11-17 09:10:30.009659 I | ceph-object-controller: reconciling external object store service 2021-11-17 09:10:30.017582 I | ceph-object-controller: ceph rgw status check interval for object store "ocs-external-storagecluster-cephobjectstore" is "1m0s" 2021-11-17 09:10:30.017605 I | ceph-object-controller: starting rgw health checker for CephObjectStore "openshift-storage/ocs-external-storagecluster-cephobjectstore" 2021-11-17 09:10:30.375359 I | op-mon: parsing mon endpoints: catalina=10.70.39.4:6789 2021-11-17 09:10:30.375395 I | op-mon: updating obsolete maxMonID 0 to actual value 24642648160 2021-11-17 09:10:31.393900 I | ceph-object-store-user-controller: CephObjectStore "ocs-external-storagecluster-cephobjectstore" found 2021-11-17 09:10:31.393933 I | ceph-object-store-user-controller: CephObjectStore "ocs-external-storagecluster-cephobjectstore" found 2021-11-17 09:10:31.393969 I | ceph-object-store-user-controller: creating ceph object user "noobaa-ceph-objectstore-user" in namespace "openshift-storage" 2021-11-17 09:10:32.567012 I | ceph-object-store-user-controller: retrieved existing ceph object user "noobaa-ceph-objectstore-user" 2021-11-17 09:10:32.573892 I | ceph-spec: created ceph *v1.Secret object "rook-ceph-object-user-ocs-external-storagecluster-cephobjectstore-noobaa-ceph-objectstore-user" Initially there were some "NOT FOUND" errors but i see in the later stages "ocs-external-storagecluster-cephobjectstore" is found. No errors found in UI Based on the above observations moving this BZ to Verified state Attaching clean UI screenshot for reference