Description of problem (please be detailed as possible and provide log snippests): External cluster deployment (vSphere)is failing with "noobaa-default-backing-store" not found Version of all relevant components (if applicable): openshift installer (4.8.0-0.nightly-2021-06-13-101614) ocs-registry:4.8.0-416.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? NA Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes ( 2/2 ) Can this issue reproduce from the UI? Not Tried If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Install External deployment using ocs-ci 2. 3. Actual results: Deployment fails with below error E ocs_ci.ocs.exceptions.CommandFailed: Error during execution of command: oc -n openshift-storage get backingstore noobaa-default-backing-store -n openshift-storage -o yaml. E Error is Error from server (NotFound): backingstores.noobaa.io "noobaa-default-backing-store" not found Expected results: Deployment should be successful without any errors Additional info: Job link: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1102//console Must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j017vu1ce33-t4an/j017vu1ce33-t4an_20210614T080457/logs/failed_testcase_ocs_logs_1623662552/deployment_ocs_logs/
> operator log ( noobaa-operator-65fdb68cd9-s9m49.log ) time="2021-06-14T10:18:08Z" level=info msg="❌ Not Found: BackingStore \"noobaa-default-backing-store\"\n" time="2021-06-14T10:18:08Z" level=info msg="CephObjectStoreUser \"noobaa-ceph-objectstore-user\" created. Creating default backing store on ceph objectstore" func=ReconcileDefaultBackingStore sys=openshift-storage/noobaa time="2021-06-14T10:18:08Z" level=info msg="✅ Exists: \"noobaa-ceph-objectstore-user\"\n" time="2021-06-14T10:18:08Z" level=info msg="Ceph objectstore user \"noobaa-ceph-objectstore-user\" is not ready. retry on next reconcile.." sys=openshift-storage/noobaa time="2021-06-14T10:18:08Z" level=info msg="SetPhase: temporary error during phase \"Configuring\"" sys=openshift-storage/noobaa time="2021-06-14T10:18:08Z" level=warning msg="⏳ Temporary Error: Ceph objectstore user \"noobaa-ceph-objectstore-user\" is not ready" sys=openshift-storage/noobaa time="2021-06-14T10:18:08Z" level=info msg="UpdateStatus: Done generation 1" sys=openshift-storage/noobaa
Logs from noobaa operator: ------------------------ time="2021-06-14T10:18:07Z" level=info msg="CephObjectStoreUser \"noobaa-ceph-objectstore-user\" created. Creating default backing store on ceph objectstore" func=ReconcileDefaultBackingStore sys=openshift-storage/noobaa time="2021-06-14T10:18:07Z" level=info msg="✅ Exists: \"noobaa-ceph-objectstore-user\"\n" time="2021-06-14T10:18:07Z" level=info msg="Ceph objectstore user \"noobaa-ceph-objectstore-user\" is not ready. retry on next reconcile.." sys=openshift-storage/noobaa time="2021-06-14T10:18:07Z" level=info msg="SetPhase: temporary error during phase \"Configuring\"" sys=openshift-storage/noobaa time="2021-06-14T10:18:07Z" level=warning msg="ⳠTemporary Error: Ceph objectstore user \"noobaa-ceph-objectstore-user\" is not ready" sys=openshift-storage/noobaa time="2021-06-14T10:18:07Z" level=info msg="RPC Handle: {Op: req, API: server_inter_process_api, Method: update_master_change, Error: <nil>, Params: map[is_master:true]}"
Requesting rook team to take an initial look, we have seen similar issues with external mode in the recent past.
The object store connection to the external cluster is available according to the Rook operator log [1] 2021-06-14T10:06:06.220608608Z 2021-06-14 10:06:06.220563 I | op-mon: parsing mon endpoints: dell-r730-018=10.1.8.28:6789,dell-r730-015=10.1.8.25:6789,dell-r730-017=10.1.8.27:6789 2021-06-14T10:06:06.227989039Z 2021-06-14 10:06:06.227966 I | ceph-object-store-user-controller: CephObjectStore "ocs-external-storagecluster-cephobjectstore" found However, I don't see the CephObjectStore CR in the must-gather, which should have the details about the bucket health check in the CR status. @Vijay Can you get the CephObjectStore CR status? If the external connection is not there, the ocs-ci test owner could take a look. If the external object connection is valid, the Noobaa team could look next. [1] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j017vu1ce33-t4an/j017vu1ce33-t4an_20210614T080457/logs/failed_testcase_ocs_logs_1623662552/deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-1f20a7053eb9d3e1ba9772f7da0b635f3777a9a7201b7fe9a3d186820590dbad/namespaces/openshift-storage/pods/rook-ceph-operator-78485bb655-ntztk/rook-ceph-operator/rook-ceph-operator/logs/current.log
(In reply to Travis Nielsen from comment #5) > The object store connection to the external cluster is available according > to the Rook operator log [1] > > 2021-06-14T10:06:06.220608608Z 2021-06-14 10:06:06.220563 I | op-mon: > parsing mon endpoints: > dell-r730-018=10.1.8.28:6789,dell-r730-015=10.1.8.25:6789,dell-r730-017=10.1. > 8.27:6789 > 2021-06-14T10:06:06.227989039Z 2021-06-14 10:06:06.227966 I | > ceph-object-store-user-controller: CephObjectStore > "ocs-external-storagecluster-cephobjectstore" found > > However, I don't see the CephObjectStore CR in the must-gather, which should > have the details about the bucket health check in the CR status. > @Vijay Can you get the CephObjectStore CR status? If the external connection > is not there, the ocs-ci test owner could take a look. If the external > object connection is valid, the Noobaa team could look next. > > [1] > http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j017vu1ce33- > t4an/j017vu1ce33-t4an_20210614T080457/logs/ > failed_testcase_ocs_logs_1623662552/deployment_ocs_logs/ocs_must_gather/quay- > io-rhceph-dev-ocs-must-gather-sha256- > 1f20a7053eb9d3e1ba9772f7da0b635f3777a9a7201b7fe9a3d186820590dbad/namespaces/ > openshift-storage/pods/rook-ceph-operator-78485bb655-ntztk/rook-ceph- > operator/rook-ceph-operator/logs/current.log I have reproduced the issue again ( https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1132//console ) $ oc get CephObjectStore NAME AGE ocs-external-storagecluster-cephobjectstore 45m $ $ oc get cephcluster NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL ocs-external-storagecluster-cephcluster 0 46m Connected Cluster connected successfully HEALTH_OK true $ $ oc describe CephObjectStore Name: ocs-external-storagecluster-cephobjectstore Namespace: openshift-storage Labels: <none> Annotations: <none> API Version: ceph.rook.io/v1 Kind: CephObjectStore Metadata: Creation Timestamp: 2021-06-16T12:24:32Z Finalizers: cephobjectstore.ceph.rook.io Generation: 1 Managed Fields: API Version: ceph.rook.io/v1 Fields Type: FieldsV1 fieldsV1: f:spec: .: f:dataPool: .: f:compressionMode: f:erasureCoded: .: f:codingChunks: f:dataChunks: f:mirroring: f:quotas: f:replicated: .: f:size: f:statusCheck: .: f:mirror: f:gateway: .: f:externalRgwEndpoints: f:instances: f:placement: f:port: f:priorityClassName: f:resources: f:healthCheck: .: f:bucket: .: f:interval: f:metadataPool: .: f:compressionMode: f:erasureCoded: .: f:codingChunks: f:dataChunks: f:mirroring: f:quotas: f:replicated: .: f:size: f:statusCheck: .: f:mirror: f:zone: .: f:name: Manager: ocs-operator Operation: Update Time: 2021-06-16T12:24:32Z API Version: ceph.rook.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:finalizers: .: v:"cephobjectstore.ceph.rook.io": f:status: .: f:info: .: f:endpoint: f:phase: Manager: rook Operation: Update Time: 2021-06-16T12:24:53Z Resource Version: 30857 UID: dff5ade5-c22c-455f-a703-e8d7503a7c6c Spec: Data Pool: Compression Mode: none Erasure Coded: Coding Chunks: 0 Data Chunks: 0 Mirroring: Quotas: Replicated: Size: 0 Status Check: Mirror: Gateway: External Rgw Endpoints: Ip: 10.1.8.28 Instances: 1 Placement: Port: 8080 Priority Class Name: openshift-user-critical Resources: Health Check: Bucket: Interval: 1m0s Metadata Pool: Compression Mode: none Erasure Coded: Coding Chunks: 0 Data Chunks: 0 Mirroring: Quotas: Replicated: Size: 0 Status Check: Mirror: Zone: Name: Status: Info: Endpoint: http://rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore.openshift-storage.svc:8080 Phase: Progressing Events: <none>
The status doesn't show that the object store has connected yet: Status: Info: Endpoint: http://rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore.openshift-storage.svc:8080 Phase: Progressing Is there an object store in the external cluster? Can you connect to it? The owner of this test in ocs-ci should really take a look to see if the test is configured as expected. OCS-CI issues are preferred opened in https://github.com/red-hat-storage/ocs-ci unless a product issue is identified
After enabling debug logging in the rook operator, we see the following error with the rgw secret not found: 2021-06-18 20:47:20.205270 I | ceph-object-store-user-controller: CephObjectStore "ocs-external-storagecluster-cephobjectstore" found 2021-06-18 20:47:20.205287 D | ceph-object-store-user-controller: CephObjectStore exists 2021-06-18 20:47:20.205325 I | ceph-object-store-user-controller: CephObjectStore "ocs-external-storagecluster-cephobjectstore" found 2021-06-18 20:47:20.205375 D | ceph-object-store-user-controller: ObjectStore resource not ready in namespace "openshift-storage", retrying in "10s". failed to fetch rgw admin ops api user credentials: Secret "rgw-admin-ops-user" not found 2021-06-18 20:47:20.210006 D | ceph-object-store-user-controller: object store user "openshift-storage/noobaa-ceph-objectstore-user" status updated to "ReconcileFailed" @Seb This secret should be exported/imported by the external cluster scripts, right? What might the test script be missing here?
@Travis, yes this is handled by the external script. Vijay, where can I see the output of the create-external-cluster-resources.py script?
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days