Bug 1971593 - [4.8] [External Mode] [vSphere] Deployment failed due to "noobaa-default-backing-store" not found
Summary: [4.8] [External Mode] [vSphere] Deployment failed due to "noobaa-default-back...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: rook
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Sébastien Han
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-14 12:02 UTC by Vijay Avuthu
Modified: 2023-09-15 01:09 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-21 15:57:24 UTC
Embargoed:


Attachments (Terms of Use)

Description Vijay Avuthu 2021-06-14 12:02:09 UTC
Description of problem (please be detailed as possible and provide log
snippests):

External cluster deployment (vSphere)is failing with "noobaa-default-backing-store" not found  

Version of all relevant components (if applicable):

openshift installer (4.8.0-0.nightly-2021-06-13-101614)
ocs-registry:4.8.0-416.ci


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes ( 2/2 )

Can this issue reproduce from the UI?
Not Tried

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install External deployment using ocs-ci
2. 
3.


Actual results:

Deployment fails with below error

E           ocs_ci.ocs.exceptions.CommandFailed: Error during execution of command: oc -n openshift-storage get backingstore noobaa-default-backing-store -n openshift-storage -o yaml.
E           Error is Error from server (NotFound): backingstores.noobaa.io "noobaa-default-backing-store" not found


Expected results:

Deployment should be successful without any errors


Additional info:

Job link: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1102//console

Must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j017vu1ce33-t4an/j017vu1ce33-t4an_20210614T080457/logs/failed_testcase_ocs_logs_1623662552/deployment_ocs_logs/

Comment 2 Vijay Avuthu 2021-06-14 12:14:08 UTC
> operator log ( noobaa-operator-65fdb68cd9-s9m49.log )

time="2021-06-14T10:18:08Z" level=info msg="❌ Not Found: BackingStore \"noobaa-default-backing-store\"\n"
time="2021-06-14T10:18:08Z" level=info msg="CephObjectStoreUser \"noobaa-ceph-objectstore-user\" created. Creating default backing store on ceph objectstore" func=ReconcileDefaultBackingStore sys=openshift-storage/noobaa
time="2021-06-14T10:18:08Z" level=info msg="✅ Exists:  \"noobaa-ceph-objectstore-user\"\n"
time="2021-06-14T10:18:08Z" level=info msg="Ceph objectstore user \"noobaa-ceph-objectstore-user\" is not ready. retry on next reconcile.." sys=openshift-storage/noobaa
time="2021-06-14T10:18:08Z" level=info msg="SetPhase: temporary error during phase \"Configuring\"" sys=openshift-storage/noobaa
time="2021-06-14T10:18:08Z" level=warning msg="⏳ Temporary Error: Ceph objectstore user \"noobaa-ceph-objectstore-user\" is not ready" sys=openshift-storage/noobaa
time="2021-06-14T10:18:08Z" level=info msg="UpdateStatus: Done generation 1" sys=openshift-storage/noobaa

Comment 3 Mudit Agarwal 2021-06-15 12:13:21 UTC
Logs from noobaa operator:
------------------------

time="2021-06-14T10:18:07Z" level=info msg="CephObjectStoreUser \"noobaa-ceph-objectstore-user\" created. Creating default backing store on ceph objectstore" func=ReconcileDefaultBackingStore sys=openshift-storage/noobaa
time="2021-06-14T10:18:07Z" level=info msg="✅ Exists:  \"noobaa-ceph-objectstore-user\"\n"
time="2021-06-14T10:18:07Z" level=info msg="Ceph objectstore user \"noobaa-ceph-objectstore-user\" is not ready. retry on next reconcile.." sys=openshift-storage/noobaa
time="2021-06-14T10:18:07Z" level=info msg="SetPhase: temporary error during phase \"Configuring\"" sys=openshift-storage/noobaa
time="2021-06-14T10:18:07Z" level=warning msg="â³ Temporary Error: Ceph objectstore user \"noobaa-ceph-objectstore-user\" is not ready" sys=openshift-storage/noobaa
time="2021-06-14T10:18:07Z" level=info msg="RPC Handle: {Op: req, API: server_inter_process_api, Method: update_master_change, Error: <nil>, Params: map[is_master:true]}"

Comment 4 Mudit Agarwal 2021-06-15 13:21:19 UTC
Requesting rook team to take an initial look, we have seen similar issues with external mode in the recent past.

Comment 5 Travis Nielsen 2021-06-15 23:10:14 UTC
The object store connection to the external cluster is available according to the Rook operator log [1]

2021-06-14T10:06:06.220608608Z 2021-06-14 10:06:06.220563 I | op-mon: parsing mon endpoints: dell-r730-018=10.1.8.28:6789,dell-r730-015=10.1.8.25:6789,dell-r730-017=10.1.8.27:6789
2021-06-14T10:06:06.227989039Z 2021-06-14 10:06:06.227966 I | ceph-object-store-user-controller: CephObjectStore "ocs-external-storagecluster-cephobjectstore" found

However, I don't see the CephObjectStore CR in the must-gather, which should have the details about the bucket health check in the CR status. 
@Vijay Can you get the CephObjectStore CR status? If the external connection is not there, the ocs-ci test owner could take a look. If the external object connection is valid, the Noobaa team could look next.

[1] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j017vu1ce33-t4an/j017vu1ce33-t4an_20210614T080457/logs/failed_testcase_ocs_logs_1623662552/deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-1f20a7053eb9d3e1ba9772f7da0b635f3777a9a7201b7fe9a3d186820590dbad/namespaces/openshift-storage/pods/rook-ceph-operator-78485bb655-ntztk/rook-ceph-operator/rook-ceph-operator/logs/current.log

Comment 6 Vijay Avuthu 2021-06-16 13:15:19 UTC
(In reply to Travis Nielsen from comment #5)
> The object store connection to the external cluster is available according
> to the Rook operator log [1]
> 
> 2021-06-14T10:06:06.220608608Z 2021-06-14 10:06:06.220563 I | op-mon:
> parsing mon endpoints:
> dell-r730-018=10.1.8.28:6789,dell-r730-015=10.1.8.25:6789,dell-r730-017=10.1.
> 8.27:6789
> 2021-06-14T10:06:06.227989039Z 2021-06-14 10:06:06.227966 I |
> ceph-object-store-user-controller: CephObjectStore
> "ocs-external-storagecluster-cephobjectstore" found
> 
> However, I don't see the CephObjectStore CR in the must-gather, which should
> have the details about the bucket health check in the CR status. 
> @Vijay Can you get the CephObjectStore CR status? If the external connection
> is not there, the ocs-ci test owner could take a look. If the external
> object connection is valid, the Noobaa team could look next.
> 
> [1]
> http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j017vu1ce33-
> t4an/j017vu1ce33-t4an_20210614T080457/logs/
> failed_testcase_ocs_logs_1623662552/deployment_ocs_logs/ocs_must_gather/quay-
> io-rhceph-dev-ocs-must-gather-sha256-
> 1f20a7053eb9d3e1ba9772f7da0b635f3777a9a7201b7fe9a3d186820590dbad/namespaces/
> openshift-storage/pods/rook-ceph-operator-78485bb655-ntztk/rook-ceph-
> operator/rook-ceph-operator/logs/current.log

I have reproduced the issue again ( https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1132//console )

$ oc get CephObjectStore
NAME                                          AGE
ocs-external-storagecluster-cephobjectstore   45m
$

$ oc get cephcluster
NAME                                      DATADIRHOSTPATH   MONCOUNT   AGE   PHASE       MESSAGE                          HEALTH      EXTERNAL
ocs-external-storagecluster-cephcluster                     0          46m   Connected   Cluster connected successfully   HEALTH_OK   true
$

$ oc describe CephObjectStore 
Name:         ocs-external-storagecluster-cephobjectstore
Namespace:    openshift-storage
Labels:       <none>
Annotations:  <none>
API Version:  ceph.rook.io/v1
Kind:         CephObjectStore
Metadata:
  Creation Timestamp:  2021-06-16T12:24:32Z
  Finalizers:
    cephobjectstore.ceph.rook.io
  Generation:  1
  Managed Fields:
    API Version:  ceph.rook.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:dataPool:
          .:
          f:compressionMode:
          f:erasureCoded:
            .:
            f:codingChunks:
            f:dataChunks:
          f:mirroring:
          f:quotas:
          f:replicated:
            .:
            f:size:
          f:statusCheck:
            .:
            f:mirror:
        f:gateway:
          .:
          f:externalRgwEndpoints:
          f:instances:
          f:placement:
          f:port:
          f:priorityClassName:
          f:resources:
        f:healthCheck:
          .:
          f:bucket:
            .:
            f:interval:
        f:metadataPool:
          .:
          f:compressionMode:
          f:erasureCoded:
            .:
            f:codingChunks:
            f:dataChunks:
          f:mirroring:
          f:quotas:
          f:replicated:
            .:
            f:size:
          f:statusCheck:
            .:
            f:mirror:
        f:zone:
          .:
          f:name:
    Manager:      ocs-operator
    Operation:    Update
    Time:         2021-06-16T12:24:32Z
    API Version:  ceph.rook.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"cephobjectstore.ceph.rook.io":
      f:status:
        .:
        f:info:
          .:
          f:endpoint:
        f:phase:
    Manager:         rook
    Operation:       Update
    Time:            2021-06-16T12:24:53Z
  Resource Version:  30857
  UID:               dff5ade5-c22c-455f-a703-e8d7503a7c6c
Spec:
  Data Pool:
    Compression Mode:  none
    Erasure Coded:
      Coding Chunks:  0
      Data Chunks:    0
    Mirroring:
    Quotas:
    Replicated:
      Size:  0
    Status Check:
      Mirror:
  Gateway:
    External Rgw Endpoints:
      Ip:       10.1.8.28
    Instances:  1
    Placement:
    Port:                 8080
    Priority Class Name:  openshift-user-critical
    Resources:
  Health Check:
    Bucket:
      Interval:  1m0s
  Metadata Pool:
    Compression Mode:  none
    Erasure Coded:
      Coding Chunks:  0
      Data Chunks:    0
    Mirroring:
    Quotas:
    Replicated:
      Size:  0
    Status Check:
      Mirror:
  Zone:
    Name:  
Status:
  Info:
    Endpoint:  http://rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore.openshift-storage.svc:8080
  Phase:       Progressing
Events:        <none>

Comment 7 Travis Nielsen 2021-06-16 22:29:52 UTC
The status doesn't show that the object store has connected yet:

Status:
  Info:
    Endpoint:  http://rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore.openshift-storage.svc:8080
  Phase:       Progressing

Is there an object store in the external cluster? Can you connect to it?

The owner of this test in ocs-ci should really take a look to see if the test is configured as expected. OCS-CI issues are preferred opened in https://github.com/red-hat-storage/ocs-ci unless a product issue is identified

Comment 8 Travis Nielsen 2021-06-18 20:53:45 UTC
After enabling debug logging in the rook operator, we see the following error with the rgw secret not found:

2021-06-18 20:47:20.205270 I | ceph-object-store-user-controller: CephObjectStore "ocs-external-storagecluster-cephobjectstore" found
2021-06-18 20:47:20.205287 D | ceph-object-store-user-controller: CephObjectStore exists
2021-06-18 20:47:20.205325 I | ceph-object-store-user-controller: CephObjectStore "ocs-external-storagecluster-cephobjectstore" found
2021-06-18 20:47:20.205375 D | ceph-object-store-user-controller: ObjectStore resource not ready in namespace "openshift-storage", retrying in "10s". failed to fetch rgw admin ops api user credentials: Secret "rgw-admin-ops-user" not found
2021-06-18 20:47:20.210006 D | ceph-object-store-user-controller: object store user "openshift-storage/noobaa-ceph-objectstore-user" status updated to "ReconcileFailed"

@Seb This secret should be exported/imported by the external cluster scripts, right? What might the test script be missing here?

Comment 9 Sébastien Han 2021-06-21 09:56:55 UTC
@Travis, yes this is handled by the external script.

Vijay, where can I see the output of the create-external-cluster-resources.py script?

Comment 12 Red Hat Bugzilla 2023-09-15 01:09:46 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.