Bug 1981849

Summary: [OCS with Multus] OBCs stuck in Pending due to failure in creating object user "rgw-admin-ops-user"
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Sidhant Agrawal <sagrawal>
Component: rookAssignee: Sébastien Han <shan>
Status: VERIFIED --- QA Contact: Sidhant Agrawal <sagrawal>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.8CC: muagarwa, nberry, rperiyas, shan
Target Milestone: ---Keywords: AutomationBackLog
Target Release: OCS 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: v4.8.0-455.ci Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sidhant Agrawal 2021-07-13 14:37:18 UTC
Description of problem (please be detailed as possible and provide log
snippests):
For OCS with Multus enabled cluster, OBCs provisioning doesn't succeed and are stuck in Pending state when using "ocs-storagecluster-ceph-rgw" storageclass
This issue is observed in both VMware and Baremetal clusters.

$ oc -n openshift-storage get obc
NAME   STORAGE-CLASS                 PHASE     AGE
obc1   openshift-storage.noobaa.io   Bound     3m52s
obc2   ocs-storagecluster-ceph-rgw   Pending   3m38s

rook-ceph-operator have following error logs:
```
2021-07-13 14:33:13.358266 I | exec: timeout waiting for process radosgw-admin to return. Sending interrupt signal to the process
E0713 14:33:13.360489       8 controller.go:199] error syncing 'openshift-storage/obc2': error provisioning bucket: failed to set admin ops api client: failed to retrieve rgw admin ops user: failed to create object user "rgw-admin-ops-user". error code 1 for object store "ocs-storagecluster-cephobjectstore": failed to create s3 user. . : signal: interrupt, requeuing
```

>> Status of other underlying resources:

$ oc -n openshift-storage get csv,storagecluster,noobaa,backingstore,bucketclass
NAME                                                                    DISPLAY                       VERSION        REPLACES   PHASE
clusterserviceversion.operators.coreos.com/ocs-operator.v4.8.0-452.ci   OpenShift Container Storage   4.8.0-452.ci              Succeeded

NAME                                                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
storagecluster.ocs.openshift.io/ocs-storagecluster   14m   Ready              2021-07-13T14:20:30Z   4.8.0

NAME                      MGMT-ENDPOINTS                   S3-ENDPOINTS                     IMAGE                                                                                                 PHASE   AGE
noobaa.noobaa.io/noobaa   ["https://10.1.161.201:31601"]   ["https://10.1.161.201:31272"]   quay.io/rhceph-dev/mcg-core@sha256:1569f6f70c235385a86a9fdadf632a3a188648de948404d352ed4679685ee8f0   Ready   11m

NAME                                                  TYPE            PHASE   AGE
backingstore.noobaa.io/noobaa-default-backing-store   s3-compatible   Ready   9m40s

NAME                                                PLACEMENT                                                        NAMESPACEPOLICY   PHASE   AGE
bucketclass.noobaa.io/noobaa-default-bucket-class   {"tiers":[{"backingStores":["noobaa-default-backing-store"]}]}                     Ready   9m40s


Version of all relevant components (if applicable):
VMware:
    OCP: 4.8.0-0.nightly-2021-07-09-181248
    OCS: ocs-operator.v4.8.0-452.ci
Bare Metal: 
    OCP: 4.8.0-0.nightly-2021-06-13-101614
    OCS: ocs-operator.v4.8.0-452.ci


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, it is not possible to use new OBC because it's stuck in Pending

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes (2/2)

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:
Deployment itself was blocked till now due to Bug 1978722

Steps to Reproduce:
1. Install OCS operator
2. Create Storage Cluster with Multus
3. Create an OBC using "ocs-storagecluster-ceph-rgw" StorageClass


Actual results:
OBC is stuck in a Pending state

Expected results:
OBC is Bound

Comment 6 Sidhant Agrawal 2021-07-19 09:30:06 UTC
Verified using ocs-operator.v4.8.0-455.ci on VMware and BareMetal. OBCs reach Bound phase when using "ocs-storagecluster-ceph-rgw" storage class

VMware
======

$ oc -n openshift-storage get storagecluster -o yaml
...
    network:
      provider: multus
      selectors:
        public: openshift-storage/ocs-public
...


OBC status:
---
NAME   STORAGE-CLASS                 PHASE   AGE
obc1   ocs-storagecluster-ceph-rgw   Bound   2m55s
obc2   ocs-storagecluster-ceph-rgw   Bound   2m48s
obc3   ocs-storagecluster-ceph-rgw   Bound   2m38s
obc4   openshift-storage.noobaa.io   Bound   2m30s
---


BareMetal
=========

$ oc -n openshift-storage get storagecluster -o yaml
...
    network:
      provider: multus
      selectors:
        cluster: openshift-storage/ocs-cluster
        public: openshift-storage/ocs-public
...

OBCs status:
NAMESPACE           NAME      STORAGE-CLASS                 PHASE   AGE
openshift-storage   obc3-os   ocs-storagecluster-ceph-rgw   Bound   63s
openshift-storage   obc4-os   ocs-storagecluster-ceph-rgw   Bound   50s
openshift-storage   obc5      openshift-storage.noobaa.io   Bound   38s
test                obc1      ocs-storagecluster-ceph-rgw   Bound   2m7s
test                obc2      openshift-storage.noobaa.io   Bound   117s