Bug 2004027 - StorageCluster and storageSystem ocs-storagecluster are in error state for few minutes when installing storageSystem
Summary: StorageCluster and storageSystem ocs-storagecluster are in error state for fe...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ODF 4.12.0
Assignee: Malay Kumar parida
QA Contact: Vijay Avuthu
URL:
Whiteboard:
Depends On:
Blocks: 2011326 2107226
TreeView+ depends on / blocked
 
Reported: 2021-09-14 11:40 UTC by Shay Rozen
Modified: 2023-08-09 17:00 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
.`StorageCluster` no longer goes into `Error` state while waiting for `StorageClass` creation When an {product-name} `StorageCluster` is created, it waits for the underlying pools to be created before the `StorageClass` is created. During this time, the cluster returns an error for the reconcile request until the pools are ready. Because of this error, the `Phase` of the `StorageCluster` is set to `Error`. With this update, this error is caught during pool creation, and the `Phase` of the `StorageCluster` is `Progressing`.
Clone Of:
Environment:
Last Closed: 2023-02-08 14:06:28 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-operator pull 1786 0 None open Fix StorageCluster is in error state for few minutes after installing 2022-08-29 04:25:25 UTC

Description Shay Rozen 2021-09-14 11:40:41 UTC
Description of problem (please be detailed as possible and provide log
snippests):
When installing storageSystem storagecluster is in error for 5 minutes. After that it is progressing and installed in the end. Storagesystem ocs-storagecluster will stay in error for few more minutes till it is succeeded too.


Version of all relevant components (if applicable):
odf operator 4.9.132-ci

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
wait

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. install latest ocp 4.9 and odf operator 4.9.132-ci
2. Create a storageSystem.
3. do "oc get storagecluster" and "oc get storagesystem"


Actual results:
storagecluster is in error phase for 5 minutes, after that progressing and it fine. StorageSystem continue to be error for few minutes and get succeed at the end

Expected results:
No error phase should be seen while installing in both resource

Additional info:

Comment 2 Shay Rozen 2021-09-14 11:51:26 UTC
must gether http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz2004027

Comment 3 Nitin Goyal 2021-09-14 12:46:02 UTC
@srozen Can you pls show me the output of the `oc get storageSystem -o yaml`?

Comment 5 Shay Rozen 2021-09-14 14:07:25 UTC
Although I didn't see the error in UI of storageSystem ocs-storagesystem-storagecluster (odf 4.9.138-ci) there wasn't any status till it was ready in UI. Storagecluster still error for 3 minutes till progressing phase.
You can look at the attachment above. It loops the storagesystem and the storagecluster through installation.
after creation and ready state this is the output.

apiVersion: odf.openshift.io/v1alpha1
kind: StorageSystem
metadata:
  creationTimestamp: "2021-09-14T13:54:50Z"
  finalizers:
  - storagesystem.odf.openshift.io
  generation: 1
  name: ocs-storagecluster-storagesystem
  namespace: openshift-storage
  resourceVersion: "39308"
  uid: 3d0cbbcc-42e9-48a3-936d-1665a4ec4439
spec:
  kind: storagecluster.ocs.openshift.io/v1
  name: ocs-storagecluster
  namespace: openshift-storage
status:
  conditions:
  - lastHeartbeatTime: "2021-09-14T13:54:51Z"
    lastTransitionTime: "2021-09-14T13:54:51Z"
    message: Reconcile is completed successfully
    reason: ReconcileCompleted
    status: "True"
    type: Available
  - lastHeartbeatTime: "2021-09-14T13:54:51Z"
    lastTransitionTime: "2021-09-14T13:54:51Z"
    message: Reconcile is completed successfully
    reason: ReconcileCompleted
    status: "False"
    type: Progressing
  - lastHeartbeatTime: "2021-09-14T13:54:50Z"
    lastTransitionTime: "2021-09-14T13:54:50Z"
    message: StorageSystem CR is valid
    reason: Valid
    status: "False"
    type: StorageSystemInvalid
  - lastHeartbeatTime: "2021-09-14T13:54:51Z"
    lastTransitionTime: "2021-09-14T13:54:51Z"
    reason: Ready
    status: "True"
    type: VendorCsvReady
  - lastHeartbeatTime: "2021-09-14T13:54:51Z"
    lastTransitionTime: "2021-09-14T13:54:51Z"
    reason: Found
    status: "True"
    type: VendorSystemPresent

Comment 6 Nitin Goyal 2021-09-14 14:24:20 UTC
I see all conditions are good in the storageSystem and there is no problem with them. Moreover, there is no Specific state in the StorageSystem, All of them are conditions only. 

So UI is reporting the state of the StorageCluster in the ODF dashboard and whenever StorageCluster is marked ready it should also be in the ready state. Adding need info on @afrahman to clarify.

Comment 7 Shay Rozen 2021-09-14 14:26:17 UTC
It probably got fixed in another PR. The problem was found on odf4.9.132-ci and this check was on odf4.9.138-ci.
What about the storagecluster error state that is still seen?

Comment 8 Shay Rozen 2021-09-14 14:27:11 UTC
NAME                 AGE     PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   2m43s   Error              2021-09-14T13:54:50Z   4.9.0

Comment 9 Nitin Goyal 2021-09-14 15:05:03 UTC
@jrivera or @uchapaga Can help us with the StorageCluster Phase.

Comment 11 Jose A. Rivera 2021-09-16 16:12:45 UTC
This seems to be a fairly minor thing, and I really don't have the bandwidth to look at such things right now. I dont see that anyone has even looked at any operator logs, so someone please do that before asking for further input.

Comment 13 Nitin Goyal 2021-09-27 05:44:00 UTC
(In reply to Shay Rozen from comment #8)
> NAME                 AGE     PHASE   EXTERNAL   CREATED AT            
> VERSION
> ocs-storagecluster   2m43s   Error              2021-09-14T13:54:50Z   4.9.0

@srozen Is it still the Case? If yes pls let me know So I can move this to the ocs-operator.

Comment 14 Mudit Agarwal 2021-10-11 14:23:33 UTC
Shay, is this still reproducible with the latest build?

Comment 15 Martin Bukatovic 2021-10-13 15:23:56 UTC
I thought that this is taken care of. Maybe there is a similar BZ somewhere? Opening needinfo on myself to provide new evidence this week (to demonstrate that it's still a problem).

Comment 20 Martin Bukatovic 2021-10-18 09:25:21 UTC
Dropping needinfo on me (from comment 15) as Aman provided the evidence instead in comment 16.

Comment 26 Nitin Goyal 2021-10-28 06:22:49 UTC
It can be I really don't have an idea what all went in the ocs-operator.

Comment 27 Jose A. Rivera 2021-11-09 15:48:54 UTC
Looks like the error was the following:

2021-10-13T14:07:57.890288519Z {"level":"info","ts":1634134077.8902707,"logger":"controllers.StorageCluster","msg":"Waiting for CephFilesystem to be Ready. Skip reconciling StorageClass","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","CephFilesystem":"ocs-storagecluster-cephfilesystem/openshift-storage","StorageClass":"ocs-storagecluster-cephfs"}

2021-10-13T14:07:57.890320832Z {"level":"info","ts":1634134077.8902915,"logger":"controllers.StorageCluster","msg":"Waiting for CephBlockPool to be Ready. Skip reconciling StorageClass","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","CephBlockPool":"ocs-storagecluster-cephblockpool/openshift-storage","StorageClass":"ocs-storagecluster-ceph-rbd"}

2021-10-13T14:07:57.908952480Z {"level":"error","ts":1634134077.908902,"logger":"controller-runtime.manager.controller.storagecluster","msg":"Reconciler error","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","name":"ocs-storagecluster","namespace":"openshift-storage","error":"some StorageClasses [ocs-storagecluster-cephfs,ocs-storagecluster-ceph-rbd] were skipped while waiting for pre-requisites to be met","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"}
2021-10-13T14:07:57.908997302Z {"level":"info","ts":1634134077.908976,"logger":"controllers.StorageCluster","msg":"Reconciling StorageCluster.","Request.Namespace":"openshift-storage","Request.Name":"ocs-sto

I think the messages all make sense. Offhand the naive solution would be to change the third message from Error to Warning here: https://github.com/red-hat-storage/ocs-operator/blob/main/controllers/storagecluster/storageclasses.go#L156-L158 However, if we look here, the StorageClass creation (generally) needs to succeed before we try to create the NooBaa system, as it (typically) depends on the RBD StorageClass: https://github.com/red-hat-storage/ocs-operator/blob/main/controllers/storagecluster/reconcile.go#L350 As such I think the more complete fix would be to set some set of Conditions while still returning `nil`.

Still, this is a bit of an anomaly. If you look here, all these CRs require a working CephCluster, but we can go ahead with creating them anyway because rook-ceph-operator takes care of ensuring its own correct sequence of reconciliation: https://github.com/red-hat-storage/ocs-operator/blob/main/controllers/storagecluster/reconcile.go#L344-L349 If we were to do the change I proposed above, we would need to make sure noobaa-operator can properly handle having an assigned StorageClass that is not yet capable of provioning volumes. @nimro

Comment 28 Jose A. Rivera 2021-11-09 15:49:34 UTC
I hate Bugzilla.... Nimrod, could you weigh in on whether this would be a problem?

Comment 31 Danny 2021-11-11 09:49:40 UTC
Hi Jose, so if I understand correctly, noobaa will be created with a reference to a storageclass that is not ready to provision. 
in noobaa reconciliation, we create a statefulset that uses a PVC with the given storageclass. AFAIK, a pod will not be created until its PV is created. So as long as the provisioner will eventually create the PV once it's ready, I think we're good. does it make sense?

Comment 32 Jose A. Rivera 2021-11-11 15:44:28 UTC
Yeah, that sounds fine. In this case it might make sense to change this to a warning and return `nil` so we can proceed with the rest of the reconciliation. I don't know how Snapshots and RBD mirrors would be affected, but I guess we'll find out. ;) Travis, Seb, or Umanga, if any of you could weigh in on what might happen there please do so.

For the time being, this is a valid BZ. I can't guarantee I will personally start working on it ASAP, but the fix shouldn't be too complicated.

Comment 33 Travis Nielsen 2021-11-11 18:58:51 UTC
What's the question exactly? Rook is designed to retry creating resources if the underlying cluster or related resources are not fully configured yet. If RBD mirroring CRs are created before the cluster is ready, the operator should also keep retrying there. It's not an error condition, just in a state of retrying until ready. Was an error reported unexpectedly from Rook somewhere? Or the errors referenced were just from the OCS operator?

Comment 34 umanga 2021-11-12 06:23:25 UTC
We went with Stop The World approach due to https://bugzilla.redhat.com/show_bug.cgi?id=1910790#c28.
We could log it as warning/info and move on but we'd end up with other issues (so instead we chose to error out early).
Also, we can only return errors (not warnings) in Go. So making changes to this behavior may require extra effort.

StorageCluster is in Error phase because we waited for things to be ready but it wasn't ready in time. But, it doesn't mean we stopped trying.
It is really difficult to explain this to users via Status updates.

Maybe we should look into why it took CephBlockPool and CephFilesystem 15mins to be Ready?
I believe it didn't take that long before?

Comment 35 Sébastien Han 2021-11-15 10:31:12 UTC
I didn't see any errors in the Rook logs, not even a status update error. So I'm not sure how those 15min come from. In any case, as Travis explained, the current behavior is expected.

Comment 36 Jose A. Rivera 2021-11-15 16:59:38 UTC
Again, as Umanga stated, this behavior was very specifically to solve a problem where PVC creation was happening *before* the actual storage was ready:

> By creating the StorageClass(es) after the CephBlockPool and CephFilesystem are created and ready, the there should not be any problems/hangs during PVC creation.

See the BZ the linked.

As such, if we *don't* return an error and restart the reconciliation, we will absolutely run into this again. Is that acceptable?

Comment 37 Jose A. Rivera 2021-11-15 17:02:15 UTC
Just to be pedantic, you can see the full upstream discussion for that particular BZ here: https://github.com/red-hat-storage/ocs-operator/pull/1224

Comment 38 Mudit Agarwal 2021-11-16 08:16:36 UTC
Looks like we don't have a good solution for this.

IMO, if it is in error state for a very long time(which shouldn't always the case) then we should put some effort and try fix it.
If the error state is just there for a couple of minutes, it not worth fixing it.

Bipin, wdyt?

Also, if I understand it correctly this should not happen for upgrades, only for fresh installations.
Umanga, please correct me if I am wrong.

Comment 39 Bipin Kunal 2021-11-16 13:34:41 UTC
(In reply to Mudit Agarwal from comment #38)
> Looks like we don't have a good solution for this.
> 
> IMO, if it is in error state for a very long time(which shouldn't always the
> case) then we should put some effort and try fix it.
> If the error state is just there for a couple of minutes, it not worth
> fixing it.
> 
> Bipin, wdyt?
> 
> Also, if I understand it correctly this should not happen for upgrades, only
> for fresh installations.
> Umanga, please correct me if I am wrong.

In my opinion, good to fix this but may not be the blocker. We can have this documented as a known issue for now.

Comment 40 Jose A. Rivera 2021-11-16 15:57:55 UTC
Can we move this specific BZ out to ODF 4.10? What needs to happen to get some known issue text in the release notes?

Comment 41 Mudit Agarwal 2021-11-17 09:19:29 UTC
Moving it to 4.10 and adding it as a known issue for 4.9 as suggested by Bipin/Jose.

Comment 47 Mudit Agarwal 2022-02-22 17:38:57 UTC
Never said that it is not a valid bug, I closed it as a WONTFIX because there is no easy way to fix it.
Can't be fixed by dev freeze and not a blocker, moving it out of 4.10

Comment 52 Mudit Agarwal 2022-10-11 12:18:06 UTC
Fixed in the latest 4.12 builds

Comment 55 Nitin Goyal 2022-11-11 05:56:07 UTC
*** Bug 2141915 has been marked as a duplicate of this bug. ***

Comment 56 Vijay Avuthu 2022-11-30 07:05:00 UTC
Update:
=============

Verified with 4.12.0-0.nightly-2022-11-29-131548 and ocs-registry:4.12.0-120

job: https://url.corp.redhat.com/8d04007

must gather: https://url.corp.redhat.com/6f2ed90

> didnot see error state for storagecluster

2022-11-30 11:55:22  06:25:21 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc apply -f /home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/templates/ocs-deployment/storagesystem_odf.yaml
$ oc get storagesystem
NAME                               STORAGE-SYSTEM-KIND                  STORAGE-SYSTEM-NAME
ocs-storagecluster-storagesystem   storagecluster.ocs.openshift.io/v1   ocs-storagecluster
$ while true; do date;oc get storagecluster -n openshift-storage; done
Wed Nov 30 11:55:31 IST 2022
NAME                 AGE   PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   8s    Progressing              2022-11-30T06:25:24Z   4.12.0
Wed Nov 30 11:55:32 IST 2022
NAME                 AGE   PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   9s    Progressing              2022-11-30T06:25:24Z   4.12.0
Wed Nov 30 11:55:33 IST 2022
NAME                 AGE   PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   10s   Progressing              2022-11-30T06:25:24Z   4.12.0
Wed Nov 30 11:55:34 IST 2022
NAME                 AGE   PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   11s   Progressing              2022-11-30T06:25:24Z   4.12.0
Wed Nov 30 11:55:35 IST 2022
NAME                 AGE   PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   12s   Progressing              2022-11-30T06:25:24Z   4.12.0
.
.
.
NAME                 AGE    PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   7m1s   Progressing              2022-11-30T06:25:24Z   4.12.0
Wed Nov 30 12:02:25 IST 2022
NAME                 AGE    PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   7m2s   Progressing              2022-11-30T06:25:24Z   4.12.0
Wed Nov 30 12:02:26 IST 2022
NAME                 AGE    PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   7m3s   Progressing              2022-11-30T06:25:24Z   4.12.0
Wed Nov 30 12:02:27 IST 2022
NAME                 AGE    PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   7m5s   Ready              2022-11-30T06:25:24Z   4.12.0
Wed Nov 30 12:02:30 IST 2022
NAME                 AGE    PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   7m7s   Ready              2022-11-30T06:25:24Z   4.12.0
Wed Nov 30 12:02:31 IST 2022
NAME                 AGE    PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   7m7s   Ready              2022-11-30T06:25:24Z   4.12.0
Wed Nov 30 12:02:34 IST 2022
NAME                 AGE     PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   7m11s   Ready              2022-11-30T06:25:24Z   4.12.0


Note You need to log in before you can comment on or make changes to this bug.