Bug 1963134 - [4.8] [External Mode] [vSphere] Deployment failure due to storagecluster stuck in Progressing state
Summary: [4.8] [External Mode] [vSphere] Deployment failure due to storagecluster stuc...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: OCS 4.8.0
Assignee: arun kumar mohan
QA Contact: Sidhant Agrawal
URL:
Whiteboard:
Depends On:
Blocks: 1908238 1925217
TreeView+ depends on / blocked
 
Reported: 2021-05-21 14:20 UTC by Sidhant Agrawal
Modified: 2021-08-03 18:17 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-03 18:16:39 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ocs-operator pull 1209 0 None open GateWay spec's 'Instances' field should be minimum 1 2021-06-04 16:16:12 UTC
Github openshift ocs-operator pull 1210 0 None open Bug 1963134: [release-4.8] GateWay spec's 'Instances' field should be minimum 1 2021-06-04 17:45:49 UTC
Red Hat Product Errata RHBA-2021:3003 0 None None None 2021-08-03 18:17:03 UTC

Description Sidhant Agrawal 2021-05-21 14:20:06 UTC
Description of problem (please be detailed as possible and provide log
snippests):
In OCS 4.8 External mode, deployment is failing as the storage cluster is stuck in Progressing state.

Following are the observations from the cluster:

1) storagecluster stuck in Progressing state
 
$ oc get storagecluster
NAME                          AGE    PHASE         EXTERNAL   CREATED AT             VERSION
ocs-external-storagecluster   131m   Progressing   true       2021-05-21T11:30:10Z   4.8.0

2) Even though storagecluster is in Progressing state, the csv is in Succeeded phase and operator pods in 1/1 state

$ oc get csv
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.8.0-399.ci   OpenShift Container Storage   4.8.0-399.ci              Succeeded

$ oc get pod | grep operator
noobaa-operator-5d46769fdc-htzlk                   1/1     Running   0          140m
ocs-operator-7f64b96dd5-9794h                      1/1     Running   0          140m
rook-ceph-operator-77bd5678b9-h7kz6                1/1     Running   0          140m

3) RGW is present in external RHCS cluster and details were passed when creating json output from exportor script.
   OCS creates a backingstore of `pv-pool` type instead of `s3-compatible`

$ oc get backingstore
NAME                           TYPE      PHASE   AGE
noobaa-default-backing-store   pv-pool   Ready   141m

rgw SC is also present in the cluster
$ oc get sc
NAME                                   PROVISIONER                             RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
ocs-external-storagecluster-ceph-rbd   openshift-storage.rbd.csi.ceph.com      Delete          Immediate           true                   145m
ocs-external-storagecluster-ceph-rgw   openshift-storage.ceph.rook.io/bucket   Delete          Immediate           false                  145m
ocs-external-storagecluster-cephfs     openshift-storage.cephfs.csi.ceph.com   Delete          Immediate           true                   145m
openshift-storage.noobaa.io            openshift-storage.noobaa.io/obc         Delete          Immediate           false                  143m
thin                                   kubernetes.io/vsphere-volume            Delete          Immediate           false                  168m

4) cephobjectstore and cephobjectstoreuser are absent

$ oc get cephobjectstore -A
No resources found

$ oc get cephobjectstoreuser -A
No resources found

5) OBC created using `ocs-external-storagecluster-ceph-rgw` sc stuck in Pending
$ oc get obc
NAME   STORAGE-CLASS                          PHASE     AGE
obc1   openshift-storage.noobaa.io            Bound     52m
obc2   ocs-external-storagecluster-ceph-rgw   Pending   52m

Error from rook-ceph-operator logs:

I0521 13:47:15.800923       7 controller.go:212]  "msg"="reconciling claim" "key"="openshift-storage/obc2" 
I0521 13:47:15.800955       7 helpers.go:107]  "msg"="getting claim for key" "key"="openshift-storage/obc2" 
I0521 13:47:15.802999       7 helpers.go:213]  "msg"="getting ObjectBucketClaim's StorageClass" "key"="openshift-storage/obc2" 
I0521 13:47:15.804495       7 helpers.go:218]  "msg"="got StorageClass" "key"="openshift-storage/obc2" "name"="ocs-external-storagecluster-ceph-rgw"
I0521 13:47:15.804516       7 helpers.go:90]  "msg"="checking OBC for OB name, this indicates provisioning is complete" "key"="openshift-storage/obc2" "obc2"=null
I0521 13:47:15.804531       7 resourcehandlers.go:446]  "msg"="updating status:" "key"="openshift-storage/obc2" "new status"="Pending" "obc"="openshift-storage/obc2" "old status"="Pending"
I0521 13:47:15.807810       7 controller.go:273]  "msg"="syncing obc creation" "key"="openshift-storage/obc2" 
I0521 13:47:15.807831       7 controller.go:620]  "msg"="updating OBC metadata" "key"="openshift-storage/obc2" 
I0521 13:47:15.807841       7 resourcehandlers.go:436]  "msg"="updating" "key"="openshift-storage/obc2" "obc"="openshift-storage/obc2"
I0521 13:47:15.811577       7 resourcehandlers.go:148]  "msg"="seeing if OB for OBC exists" "key"="openshift-storage/obc2" "checking for OB name"="obc-openshift-storage-obc2"
I0521 13:47:15.813455       7 controller.go:396]  "msg"="provisioning" "key"="openshift-storage/obc2" "bucket"="obc2-1bf185cb-3795-40b3-87e5-5eea52985f9a"
2021-05-21 13:47:15.813465 I | op-bucket-prov: initializing and setting CreateOrGrant services
2021-05-21 13:47:15.813474 I | op-bucket-prov: getting storage class "ocs-external-storagecluster-ceph-rgw"
E0521 13:47:15.816402       7 controller.go:199] error syncing 'openshift-storage/obc2': error provisioning bucket: failed to get cephObjectStore: cephObjectStore not found: cephobjectstores.ceph.rook.io "ocs-external-storagecluster-cephobjectstore" not found, requeuing
W0521 13:50:53.491520       7 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
W0521 13:56:26.493072       7 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget


Version of all relevant components (if applicable):
OCP: 4.8.0-0.nightly-2021-05-19-123944
OCS: ocs-operator.v4.8.0-399.ci
External RHCS: ceph version 14.2.11-146.el8cp (c5c2c77b05b124fcbbe81df2cd4b3739215f88ad) nautilus (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, deployment failure

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
Yes Deployment passes on OCS 4.7 with the same external RHCS cluster

Steps to Reproduce:
1. Deploy an OCP cluster
2. Install OCS 4.8 External mode
3. Check the storagecluster status


Actual results:
Deployment failure due to storagecluster stuck in Progressing state

Expected results:
Deployment should succeed and storagecluster should not get stuck in Progressing

Comment 5 Sébastien Han 2021-05-25 12:17:30 UTC
Someone the CephObjectStore CR was not injected by the OCS-Op so moving to ocs-op for further investigation.
For external mode, it is expected (and has been like this since 4.7) that a CephObjectStore needs to be created.

Comment 6 Sébastien Han 2021-05-25 12:18:36 UTC
Arun, PTAL since you were last writing this code :) Thanks

Comment 7 Petr Balogh 2021-06-01 08:36:40 UTC
Need to delete the cluster now cause of 4.6.5 and 4.7.1 testing. If you need to repro this and have running cluster, please let us know.

Comment 8 Jose A. Rivera 2021-06-02 15:15:42 UTC
We should definitely at least look into this, even if it turns out it's not a problem that requires code changes. Giving devel_ack+.

Comment 11 arun kumar mohan 2021-06-04 13:18:01 UTC
In `CephObjectStore` spec has 'GatewaySpec' object and gateway spec expects one of it's field `Instances` to be minimal ONE.

PR raised: https://github.com/openshift/ocs-operator/pull/1209

Jose, please take a look...

Comment 18 errata-xmlrpc 2021-08-03 18:16:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3003


Note You need to log in before you can comment on or make changes to this bug.