1848387 – Independent mode OCS 4.5(v4.5.0-449): CSV is stuck in Installing state and never reached Succeeded state

Bug 1848387 - Independent mode OCS 4.5(v4.5.0-449): CSV is stuck in Installing state and never reached Succeeded state

Summary: Independent mode OCS 4.5(v4.5.0-449): CSV is stuck in Installing state and ne...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.5.0
Assignee:	Michael Adam
QA Contact:	Sidhant Agrawal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-18 09:32 UTC by Neha Berry
Modified:	2021-08-23 14:51 UTC (History)
CC List:	12 users (show)
Fixed In Version:	v4.5.0-484.ci
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-15 10:17:44 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ocs-operator pull 627	0	None	closed	external cluster: Fix propagation of connected status	2021-02-12 05:37:43 UTC
Red Hat Product Errata	RHBA-2020:3754	0	None	None	None	2020-09-15 10:18:06 UTC

Description Neha Berry 2020-06-18 09:32:47 UTC

Description of problem (please be detailed as possible and provide log
snippests):
----------------------------------------------------------------------
OCS ocs-operator.v4.5.0-449.ci + Independent Mode install : The OCS operator CSV is stuck in Installing state. Following are some of the observations:


1. The storagecluster is in ready state and the cephcluster is also in HEALTH_OK state.

2. The CSV could be in installing state due to known problem of absence of a default backingstore( Bug 1847875), but since the ocs-operator pod logs are not explicitly blaming on this, raised a separate BZ to ascertain the root cause

Reason for a separate BZ: 
********************************
With older OCS builds, even in the absence of backingstore(and bucketclass in rejected state), the CSV for OCS operator was in Succeeded state. hence want to reconfirm if the issue is indeed due to absence of noobaa backingstore OR some other problem as well.


3. Some outputs for reference are added in the additional Information.


$ oc get csv
NAME                         DISPLAY                       VERSION        REPLACES              PHASE
awss3operator.1.0.1          AWS S3 Operator               1.0.1          awss3operator.1.0.0   Succeeded
ocs-operator.v4.5.0-449.ci   OpenShift Container Storage   4.5.0-449.ci                         Installing



$ oc describe csv ocs-operator.v4.5.0-449.ci -n openshift-storage


  ----     ------               ----                 ----                        -------
  Normal   RequirementsUnknown  109m (x2 over 109m)  operator-lifecycle-manager  requirements not yet checked
  Normal   RequirementsNotMet   109m (x2 over 109m)  operator-lifecycle-manager  one or more requirements couldn't be found
  Normal   InstallWaiting       108m (x2 over 108m)  operator-lifecycle-manager  installing: waiting for deployment rook-ceph-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available...
  Normal   InstallSucceeded     107m                 operator-lifecycle-manager  install strategy completed with no errors
  Warning  ComponentUnhealthy   20m (x2 over 20m)    operator-lifecycle-manager  installing: waiting for deployment ocs-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available...
  Normal   InstallWaiting       15m (x5 over 108m)   operator-lifecycle-manager  installing: waiting for deployment ocs-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available...
  Normal   NeedsReinstall       10m (x6 over 20m)    operator-lifecycle-manager  installing: waiting for deployment ocs-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available...
  Normal   AllRequirementsMet   10m (x8 over 108m)   operator-lifecycle-manager  all requirements found, attempting install
  Normal   InstallSucceeded     10m (x6 over 108m)   operator-lifecycle-manager  waiting for install components to report healthy
  Warning  InstallCheckFailed   6s (x7 over 15m)     operator-lifecycle-manager  install timeout




Version of all relevant components (if applicable):
----------------------------------------------------------------------
OCS = ocs-operator.v4.5.0-449.ci

OCP  = 4.5.0-0.nightly-2020-06-17-001505

External Cluster (RHCS) = RHCS 4.1 = 14.2.8-59.el8cp


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
----------------------------------------------------------------------
Yes


Is there any workaround available to the best of your knowledge?
----------------------------------------------------------------------
Not that I am aware of

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
----------------------------------------------------------------------
3

Can this issue reproducible?
----------------------------------------------------------------------
CSV stuck in Installing phase: Tried once with v4.5.0-449

No default backingstore - Yes reproduced on all independnet mode clusters.

Can this issue reproduce from the UI?
----------------------------------------------------------------------
The OCS operator was installed via UI

If this is a regression, please provide more details to justify this:
----------------------------------------------------------------------
No.. Independent mode is a new feature of OCS 4.5


Steps to Reproduce:
----------------------------------------------------------------------
1. Create an OCP 4.5 cluster on BM  
2. Create an RHCS cluster with latest 4.1 with alteast 3 nodes
3. From UI, Install OCS in Independent mode. Official docs not ready yet, but steps can be found here [1] 
4. Check the status of CSV for ocs operator
5. check the noobaa backingstore and bucketclass. commands added in Additional info

Actual results:
----------------------------------------------------------------------
The CSV is stuck in Installing Phase with no clear indication to what is the error (atleast what i could find). it could be due to noobaa, but better to have a confirmation.

Expected results:
----------------------------------------------------------------------

The CSV should be in Succeeded state if the install completes.

Comment 4 Travis Nielsen 2020-06-23 22:50:04 UTC

The status on the CephCluster CR shows that everything is healthy.
@Jose, what determines the CSV state? 

status:
  ceph:
    health: HEALTH_OK
    lastChanged: "2020-06-17T15:19:30Z"
    lastChecked: "2020-06-17T15:36:41Z"
    previousHealth: HEALTH_WARN
  conditions:
  - lastHeartbeatTime: "2020-06-17T15:07:18Z"
    lastTransitionTime: "2020-06-17T15:07:18Z"
    status: "False"
    type: Failure
  - lastHeartbeatTime: "2020-06-17T15:07:18Z"
    lastTransitionTime: "2020-06-17T15:07:18Z"
    status: "False"
    type: Ignored
  - lastHeartbeatTime: "2020-06-17T15:07:18Z"
    lastTransitionTime: "2020-06-17T15:07:18Z"
    status: "False"
    type: Upgrading
  - lastHeartbeatTime: "2020-06-17T15:07:18Z"
    lastTransitionTime: "2020-06-17T15:07:18Z"
    message: Cluster is connecting
    reason: ClusterConnecting
    status: "True"
    type: Connecting
  - lastHeartbeatTime: "2020-06-17T15:07:21Z"
    lastTransitionTime: "2020-06-17T15:07:21Z"
    message: Cluster connected successfully
    reason: ClusterConnected
    status: "True"
    type: Connected
  message: Cluster connected successfully
  phase: Connected
  state: Connected

Comment 5 Jose A. Rivera 2020-07-01 13:50:21 UTC

Among other things, the StorageCluster conditions must all be healthy:

    conditions:
    - lastHeartbeatTime: "2020-06-17T15:37:28Z"
      lastTransitionTime: "2020-06-17T15:07:19Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "True"
      type: ReconcileComplete
    - lastHeartbeatTime: "2020-06-17T15:07:19Z"
      lastTransitionTime: "2020-06-17T15:07:19Z"
      message: CephCluster resource is not reporting status
      reason: CephClusterStatus
      status: "False"
      type: Available
    - lastHeartbeatTime: "2020-06-17T15:14:36Z"
      lastTransitionTime: "2020-06-17T15:07:19Z"
      message: Waiting on Nooba instance to finish initialization
      reason: NoobaaInitializing
      status: "True"
      type: Progressing
    - lastHeartbeatTime: "2020-06-17T15:37:28Z"
      lastTransitionTime: "2020-06-17T15:07:22Z"
      message: 'External CephCluster Unknown Condition: Cluster connected successfully'
      reason: ExternalClusterStateUnknownCondition
      status: "True"
      type: Degraded
    - lastHeartbeatTime: "2020-06-17T15:07:19Z"
      lastTransitionTime: "2020-06-17T15:07:19Z"
      message: CephCluster resource is not reporting status
      reason: CephClusterStatus
      status: "False"
      type: Upgradeable
    - lastHeartbeatTime: "2020-06-17T15:07:21Z"
      lastTransitionTime: "2020-06-17T15:07:20Z"
      message: 'External CephCluster is trying to connect: Cluster is connecting'
      reason: ExternalClusterStateConnecting
      status: "True"
      type: ExternalClusterConnecting
    - lastHeartbeatTime: "2020-06-17T15:07:21Z"
      lastTransitionTime: "2020-06-17T15:07:20Z"
      message: 'External CephCluster is trying to connect: Cluster is connecting'
      reason: ExternalClusterStateConnecting
      status: "False"
      type: ExternalClusterConnected

For some reason, it is still in ExternalClusterStateConnecting. Indeed, that's what the ocs-operator logs show:

2020-06-17T15:07:21.149289186Z {"level":"info","ts":"2020-06-17T15:07:21.149Z","logger":"controller_storagecluster","msg":"Waiting for the external ceph cluster to be connected before starting noobaa","Request.Namespace":"openshift-storage","Request.Name":"ocs-independent-storagecluster"}

So I'm not sure what's going on here.

Comment 6 Michael Adam 2020-07-02 22:13:32 UTC

@Seb, could you tell what's going on here?

Comment 7 Michael Adam 2020-07-02 22:17:17 UTC

Acking for 4.5. I think we need to address this, one way or another.

Changing component to unclassified, since we don't know yet where the problem lies.

Assigning to Seb for better visibility, for doing more analysis.

Comment 10 Sébastien Han 2020-07-06 10:46:16 UTC

As far as I can see, the CephCluster has the correct status since it's "Connected" but the operator shows otherwise...
Digging into the logs, I can also verify that the rook-ceph is happy:

op-config: CephCluster "openshift-storage" status: "Connected". "Cluster connected successfully"

Could it be that noobaa has been deployed but is failing?

Things are clear on the Rook-Ceph side, so someone with a better knowledge on ocs-op and noobaa should look into it.
Thanks.

Comment 12 Michael Adam 2020-07-08 12:11:48 UTC

(In reply to leseb from comment #10)
> As far as I can see, the CephCluster has the correct status since it's
> "Connected" but the operator shows otherwise...
> Digging into the logs, I can also verify that the rook-ceph is happy:
> 
> op-config: CephCluster "openshift-storage" status: "Connected". "Cluster
> connected successfully"
> 
> Could it be that noobaa has been deployed but is failing?
> 
> Things are clear on the Rook-Ceph side, so someone with a better knowledge
> on ocs-op and noobaa should look into it.
> Thanks.

@Nimrod, could you(r team) check if there's something going on in the noobaa side?
The last comment from qe indicates that there's a problem with the bucketclass and backingstore.

Comment 13 Nimrod Becker 2020-07-08 13:25:25 UTC

Default BackingStore has a problem, see https://bugzilla.redhat.com/show_bug.cgi?id=1854768

Comment 14 Michael Adam 2020-07-09 07:16:43 UTC

(In reply to Nimrod Becker from comment #13)
> Default BackingStore has a problem, see
> https://bugzilla.redhat.com/show_bug.cgi?id=1854768

What does this mean for this BZ? Is it a duplicate?

Comment 15 Nimrod Becker 2020-07-09 07:28:28 UTC

I think it is, we can wait for a couple of hours to verify deployment is passing

Comment 21 Michael Adam 2020-07-09 18:02:53 UTC

https://github.com/openshift/ocs-operator/pull/627

RFC PR

Comment 22 Jose A. Rivera 2020-07-09 19:16:36 UTC

Backport PR: https://github.com/openshift/ocs-operator/pull/628

Comment 23 Jose A. Rivera 2020-07-09 19:47:48 UTC

Backport PR has merged.

Comment 24 Michael Adam 2020-07-10 05:56:57 UTC

https://ceph-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/OCS%20Build%20Pipeline%204.5/62/

4.5.0-484.ci

has the fix

Comment 28 errata-xmlrpc 2020-09-15 10:17:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3754

Comment 29 Jilju Joy 2021-08-23 14:51:51 UTC

Removing AutomationBackLog keyword. This will be covered in installation phase of all automated tier runs.

Note You need to log in before you can comment on or make changes to this bug.