Description of problem (please be detailed as possible and provide log snippests): When installing storageSystem storagecluster is in error for 5 minutes. After that it is progressing and installed in the end. Storagesystem ocs-storagecluster will stay in error for few more minutes till it is succeeded too. Version of all relevant components (if applicable): odf operator 4.9.132-ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? wait Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? yes Can this issue reproduce from the UI? yes If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. install latest ocp 4.9 and odf operator 4.9.132-ci 2. Create a storageSystem. 3. do "oc get storagecluster" and "oc get storagesystem" Actual results: storagecluster is in error phase for 5 minutes, after that progressing and it fine. StorageSystem continue to be error for few minutes and get succeed at the end Expected results: No error phase should be seen while installing in both resource Additional info:
must gether http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz2004027
@srozen Can you pls show me the output of the `oc get storageSystem -o yaml`?
Although I didn't see the error in UI of storageSystem ocs-storagesystem-storagecluster (odf 4.9.138-ci) there wasn't any status till it was ready in UI. Storagecluster still error for 3 minutes till progressing phase. You can look at the attachment above. It loops the storagesystem and the storagecluster through installation. after creation and ready state this is the output. apiVersion: odf.openshift.io/v1alpha1 kind: StorageSystem metadata: creationTimestamp: "2021-09-14T13:54:50Z" finalizers: - storagesystem.odf.openshift.io generation: 1 name: ocs-storagecluster-storagesystem namespace: openshift-storage resourceVersion: "39308" uid: 3d0cbbcc-42e9-48a3-936d-1665a4ec4439 spec: kind: storagecluster.ocs.openshift.io/v1 name: ocs-storagecluster namespace: openshift-storage status: conditions: - lastHeartbeatTime: "2021-09-14T13:54:51Z" lastTransitionTime: "2021-09-14T13:54:51Z" message: Reconcile is completed successfully reason: ReconcileCompleted status: "True" type: Available - lastHeartbeatTime: "2021-09-14T13:54:51Z" lastTransitionTime: "2021-09-14T13:54:51Z" message: Reconcile is completed successfully reason: ReconcileCompleted status: "False" type: Progressing - lastHeartbeatTime: "2021-09-14T13:54:50Z" lastTransitionTime: "2021-09-14T13:54:50Z" message: StorageSystem CR is valid reason: Valid status: "False" type: StorageSystemInvalid - lastHeartbeatTime: "2021-09-14T13:54:51Z" lastTransitionTime: "2021-09-14T13:54:51Z" reason: Ready status: "True" type: VendorCsvReady - lastHeartbeatTime: "2021-09-14T13:54:51Z" lastTransitionTime: "2021-09-14T13:54:51Z" reason: Found status: "True" type: VendorSystemPresent
I see all conditions are good in the storageSystem and there is no problem with them. Moreover, there is no Specific state in the StorageSystem, All of them are conditions only. So UI is reporting the state of the StorageCluster in the ODF dashboard and whenever StorageCluster is marked ready it should also be in the ready state. Adding need info on @afrahman to clarify.
It probably got fixed in another PR. The problem was found on odf4.9.132-ci and this check was on odf4.9.138-ci. What about the storagecluster error state that is still seen?
NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 2m43s Error 2021-09-14T13:54:50Z 4.9.0
@jrivera or @uchapaga Can help us with the StorageCluster Phase.
This seems to be a fairly minor thing, and I really don't have the bandwidth to look at such things right now. I dont see that anyone has even looked at any operator logs, so someone please do that before asking for further input.
(In reply to Shay Rozen from comment #8) > NAME AGE PHASE EXTERNAL CREATED AT > VERSION > ocs-storagecluster 2m43s Error 2021-09-14T13:54:50Z 4.9.0 @srozen Is it still the Case? If yes pls let me know So I can move this to the ocs-operator.
Shay, is this still reproducible with the latest build?
I thought that this is taken care of. Maybe there is a similar BZ somewhere? Opening needinfo on myself to provide new evidence this week (to demonstrate that it's still a problem).
Dropping needinfo on me (from comment 15) as Aman provided the evidence instead in comment 16.
It can be I really don't have an idea what all went in the ocs-operator.
Looks like the error was the following: 2021-10-13T14:07:57.890288519Z {"level":"info","ts":1634134077.8902707,"logger":"controllers.StorageCluster","msg":"Waiting for CephFilesystem to be Ready. Skip reconciling StorageClass","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","CephFilesystem":"ocs-storagecluster-cephfilesystem/openshift-storage","StorageClass":"ocs-storagecluster-cephfs"} 2021-10-13T14:07:57.890320832Z {"level":"info","ts":1634134077.8902915,"logger":"controllers.StorageCluster","msg":"Waiting for CephBlockPool to be Ready. Skip reconciling StorageClass","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","CephBlockPool":"ocs-storagecluster-cephblockpool/openshift-storage","StorageClass":"ocs-storagecluster-ceph-rbd"} 2021-10-13T14:07:57.908952480Z {"level":"error","ts":1634134077.908902,"logger":"controller-runtime.manager.controller.storagecluster","msg":"Reconciler error","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","name":"ocs-storagecluster","namespace":"openshift-storage","error":"some StorageClasses [ocs-storagecluster-cephfs,ocs-storagecluster-ceph-rbd] were skipped while waiting for pre-requisites to be met","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"} 2021-10-13T14:07:57.908997302Z {"level":"info","ts":1634134077.908976,"logger":"controllers.StorageCluster","msg":"Reconciling StorageCluster.","Request.Namespace":"openshift-storage","Request.Name":"ocs-sto I think the messages all make sense. Offhand the naive solution would be to change the third message from Error to Warning here: https://github.com/red-hat-storage/ocs-operator/blob/main/controllers/storagecluster/storageclasses.go#L156-L158 However, if we look here, the StorageClass creation (generally) needs to succeed before we try to create the NooBaa system, as it (typically) depends on the RBD StorageClass: https://github.com/red-hat-storage/ocs-operator/blob/main/controllers/storagecluster/reconcile.go#L350 As such I think the more complete fix would be to set some set of Conditions while still returning `nil`. Still, this is a bit of an anomaly. If you look here, all these CRs require a working CephCluster, but we can go ahead with creating them anyway because rook-ceph-operator takes care of ensuring its own correct sequence of reconciliation: https://github.com/red-hat-storage/ocs-operator/blob/main/controllers/storagecluster/reconcile.go#L344-L349 If we were to do the change I proposed above, we would need to make sure noobaa-operator can properly handle having an assigned StorageClass that is not yet capable of provioning volumes. @nimro
I hate Bugzilla.... Nimrod, could you weigh in on whether this would be a problem?
Hi Jose, so if I understand correctly, noobaa will be created with a reference to a storageclass that is not ready to provision. in noobaa reconciliation, we create a statefulset that uses a PVC with the given storageclass. AFAIK, a pod will not be created until its PV is created. So as long as the provisioner will eventually create the PV once it's ready, I think we're good. does it make sense?
Yeah, that sounds fine. In this case it might make sense to change this to a warning and return `nil` so we can proceed with the rest of the reconciliation. I don't know how Snapshots and RBD mirrors would be affected, but I guess we'll find out. ;) Travis, Seb, or Umanga, if any of you could weigh in on what might happen there please do so. For the time being, this is a valid BZ. I can't guarantee I will personally start working on it ASAP, but the fix shouldn't be too complicated.
What's the question exactly? Rook is designed to retry creating resources if the underlying cluster or related resources are not fully configured yet. If RBD mirroring CRs are created before the cluster is ready, the operator should also keep retrying there. It's not an error condition, just in a state of retrying until ready. Was an error reported unexpectedly from Rook somewhere? Or the errors referenced were just from the OCS operator?
We went with Stop The World approach due to https://bugzilla.redhat.com/show_bug.cgi?id=1910790#c28. We could log it as warning/info and move on but we'd end up with other issues (so instead we chose to error out early). Also, we can only return errors (not warnings) in Go. So making changes to this behavior may require extra effort. StorageCluster is in Error phase because we waited for things to be ready but it wasn't ready in time. But, it doesn't mean we stopped trying. It is really difficult to explain this to users via Status updates. Maybe we should look into why it took CephBlockPool and CephFilesystem 15mins to be Ready? I believe it didn't take that long before?
I didn't see any errors in the Rook logs, not even a status update error. So I'm not sure how those 15min come from. In any case, as Travis explained, the current behavior is expected.
Again, as Umanga stated, this behavior was very specifically to solve a problem where PVC creation was happening *before* the actual storage was ready: > By creating the StorageClass(es) after the CephBlockPool and CephFilesystem are created and ready, the there should not be any problems/hangs during PVC creation. See the BZ the linked. As such, if we *don't* return an error and restart the reconciliation, we will absolutely run into this again. Is that acceptable?
Just to be pedantic, you can see the full upstream discussion for that particular BZ here: https://github.com/red-hat-storage/ocs-operator/pull/1224
Looks like we don't have a good solution for this. IMO, if it is in error state for a very long time(which shouldn't always the case) then we should put some effort and try fix it. If the error state is just there for a couple of minutes, it not worth fixing it. Bipin, wdyt? Also, if I understand it correctly this should not happen for upgrades, only for fresh installations. Umanga, please correct me if I am wrong.
(In reply to Mudit Agarwal from comment #38) > Looks like we don't have a good solution for this. > > IMO, if it is in error state for a very long time(which shouldn't always the > case) then we should put some effort and try fix it. > If the error state is just there for a couple of minutes, it not worth > fixing it. > > Bipin, wdyt? > > Also, if I understand it correctly this should not happen for upgrades, only > for fresh installations. > Umanga, please correct me if I am wrong. In my opinion, good to fix this but may not be the blocker. We can have this documented as a known issue for now.
Can we move this specific BZ out to ODF 4.10? What needs to happen to get some known issue text in the release notes?
Moving it to 4.10 and adding it as a known issue for 4.9 as suggested by Bipin/Jose.
Never said that it is not a valid bug, I closed it as a WONTFIX because there is no easy way to fix it. Can't be fixed by dev freeze and not a blocker, moving it out of 4.10
Fixed in the latest 4.12 builds
*** Bug 2141915 has been marked as a duplicate of this bug. ***
Update: ============= Verified with 4.12.0-0.nightly-2022-11-29-131548 and ocs-registry:4.12.0-120 job: https://url.corp.redhat.com/8d04007 must gather: https://url.corp.redhat.com/6f2ed90 > didnot see error state for storagecluster 2022-11-30 11:55:22 06:25:21 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc apply -f /home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/templates/ocs-deployment/storagesystem_odf.yaml $ oc get storagesystem NAME STORAGE-SYSTEM-KIND STORAGE-SYSTEM-NAME ocs-storagecluster-storagesystem storagecluster.ocs.openshift.io/v1 ocs-storagecluster $ while true; do date;oc get storagecluster -n openshift-storage; done Wed Nov 30 11:55:31 IST 2022 NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 8s Progressing 2022-11-30T06:25:24Z 4.12.0 Wed Nov 30 11:55:32 IST 2022 NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 9s Progressing 2022-11-30T06:25:24Z 4.12.0 Wed Nov 30 11:55:33 IST 2022 NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 10s Progressing 2022-11-30T06:25:24Z 4.12.0 Wed Nov 30 11:55:34 IST 2022 NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 11s Progressing 2022-11-30T06:25:24Z 4.12.0 Wed Nov 30 11:55:35 IST 2022 NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 12s Progressing 2022-11-30T06:25:24Z 4.12.0 . . . NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 7m1s Progressing 2022-11-30T06:25:24Z 4.12.0 Wed Nov 30 12:02:25 IST 2022 NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 7m2s Progressing 2022-11-30T06:25:24Z 4.12.0 Wed Nov 30 12:02:26 IST 2022 NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 7m3s Progressing 2022-11-30T06:25:24Z 4.12.0 Wed Nov 30 12:02:27 IST 2022 NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 7m5s Ready 2022-11-30T06:25:24Z 4.12.0 Wed Nov 30 12:02:30 IST 2022 NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 7m7s Ready 2022-11-30T06:25:24Z 4.12.0 Wed Nov 30 12:02:31 IST 2022 NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 7m7s Ready 2022-11-30T06:25:24Z 4.12.0 Wed Nov 30 12:02:34 IST 2022 NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 7m11s Ready 2022-11-30T06:25:24Z 4.12.0