Bug 1964574

Summary: OCS 4.8 Fresh deployment: Storagecluster in ready state even when Cpehcluster is stuck in Progressing (Configuring MONs) for prolonged time
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Neha Berry <nberry>
Component: ocs-operatorAssignee: Nobody <nobody>
Status: VERIFIED --- QA Contact: Avi Liani <alayani>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.8CC: mbukatov, muagarwa, owasserm, rperiyas, sostapov
Target Milestone: ---Keywords: AutomationBackLog, Regression
Target Release: OCS 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.8.0-432.ci Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1951021    

Comment 5 Jose A. Rivera 2021-05-27 16:35:10 UTC
Just to be certain, can you provide a screenshot of the problem in the UI? Also provide the full StorageCluster YAML and ocs-operator logs. An ocs-must-gather would also suffice.

For the ocs-operator Pod, the readiness phase should not be impacted by the state of the StorageCluster to begin with. Installation of the operators is independent from the creation and management of its operands. I don't clearly remember how the previous behavior made it into the product, but really it's been a long-standing bug that (I believe) was recently cleared up as part of the SDK updates.

Comment 6 Jose A. Rivera 2021-06-02 15:17:52 UTC
Oops, forgot to set the needinfo. Nonetheless, also giving devel_ack+ since I believe this problem is reproducible.

Comment 7 Mudit Agarwal 2021-06-02 15:18:41 UTC
https://chat.google.com/room/AAAAREGEba8/2JSkNKg3_hY

Comment 8 Neha Berry 2021-06-04 10:33:30 UTC
(In reply to Jose A. Rivera from comment #5)
> Just to be certain, can you provide a screenshot of the problem in the UI?
> Also provide the full StorageCluster YAML and ocs-operator logs. An
> ocs-must-gather would also suffice.
> 
Hi Jose, apologies I do not have the screenshot of the UI. But the logs are provided here
https://bugzilla.redhat.com/show_bug.cgi?id=1964574#c2

> For the ocs-operator Pod, the readiness phase should not be impacted by the
> state of the StorageCluster to begin with. Installation of the operators is
> independent from the creation and management of its operands. I don't
> clearly remember how the previous behavior made it into the product, but
> really it's been a long-standing bug that (I believe) was recently cleared
> up as part of the SDK updates.

Comment 10 Martin Bukatovic 2021-06-04 22:14:24 UTC
Description of problem
======================

When deployment of StorageCluster/ocs-storagecluster begins, it's phase
immediately reaches "Ready" phase, even though the deployment just started
at that point. Phase stays "Ready" during deployment of ceph components.

Version-Release number of selected component
============================================

OCP 4.8.0-0.nightly-2021-06-03-055145
LSO 4.8.0-202106021817
OCS 4.8.0-407.ci

How reproducible
================

100%

Steps to Reproduce
==================

1. Install OCP cluster.
2. Install LSO and OCS operators.
3. Use "Create Storage Cluster" wizard in OCP Console to initiate deployment of
   ocs-storagecluster.
4. Observe Phase of StorageCluster/ocs-storagecluster during installation
   (either via cli or via OCP Console).

Actual results
==============

When deployment of ocs-storagecluster starts, it's phase is "Ready":

```
$ oc get storagecluster -n openshift-storage
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   35s   Ready              2021-06-04T19:12:04Z   4.8.0
```

Even though ceph components are being installed at that moment.

Only later when ceph deployment finishes and NooBaa installation is going on,
we see status of ocs-storagecluster as Progressing:

```
$ oc get storagecluster -n openshift-storage
NAME                 AGE     PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   2m56s   Progressing              2021-06-04T19:12:04Z   4.8.0
```

And when this installation finishes, the phase is back at "Ready".

Expected results
================

During deployment of ceph components, phase/state of ocs-storagecluster is
reported as "Progressing", in the same way as done during NooBaa deployment.

Comment 12 Jose A. Rivera 2021-06-07 15:12:52 UTC
Apologies if it was not clear, but giving devel_ack+ meant it's a valid bug that we should fix. Since it is marked as blocker? we need qa_ack+ to confirm it for 4.8.

I had an initial look through the must-gather information and confirmed the problem, but the logs were not sufficient to do a full RCA. This needs further investigation to determine a proper resolution, it just hasn't been assigned yet.

Comment 13 Martin Bukatovic 2021-06-07 20:58:19 UTC
Providing QA ack based on today's bug triage.

Comment 16 Martin Bukatovic 2021-06-15 12:36:50 UTC
Rechecked with OCP/OCS 4.7:

- OCP 4.7.0-0.nightly-2021-06-12-151209
- LSO 4.7.0-202105210300.p0
- OCS 4.7.1-410.ci

And I don't see the problem I originally observed with 4.8 (as noted in comment 10).

Comment 17 Avi Liani 2021-07-13 08:02:52 UTC
Just deploy a cluster on BareMetal environment with OCP & OCS 4.8, and it pass without any problems.

LSO version: 4.8.0-202106291913 

ceph version 14.2.11-184.el8cp

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0     True        False         11h     Cluster version is 4.8.0


$ oc get csv -n openshift-storage
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.8.0-450.ci   OpenShift Container Storage   4.8.0-450.ci              Succeeded


$ oc get storagecluster -n openshift-storage
NAME                 AGE     PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   7m43s   Ready              2021-07-13T07:47:21Z   4.8.0


$ oc get cephcluster -n openshift-storage
NAME                             DATADIRHOSTPATH   MONCOUNT   AGE     PHASE   MESSAGE                        HEALTH      EXTERNAL
ocs-storagecluster-cephcluster   /var/lib/rook     3          8m12s   Ready   Cluster created successfully   HEALTH_OK   


IMO, this can be verified, unless some more test need to be done.

Comment 18 Avi Liani 2021-07-13 10:55:03 UTC
trying to deploy cluster (on the same OCP from #17) where the MON deployment is stuck show :

$ oc get csv -n openshift-storage
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.8.0-450.ci   OpenShift Container Storage   4.8.0-450.ci              Installing


$ oc get storagecluster -n openshift-storage
NAME                 AGE   PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   12m   Progressing              2021-07-13T10:37:28Z   4.8.0


$ oc get cephcluster -n openshift-storage
NAME                             DATADIRHOSTPATH   MONCOUNT   AGE   PHASE         MESSAGE                 HEALTH   EXTERNAL
ocs-storagecluster-cephcluster   /var/lib/rook     3          12m   Progressing   Configuring Ceph Mons            


while the ceph cluster is Installing (Progressing mode), the storagecluster is in Progressing mode and the OCS is in Installing as well

I think that this BZ is verified.