Description of problem (please be detailed as possible and provide log snippests): - After the first onboarding attempt, the operator reconciler tries to update the StorageCluster with the StorageConsumer UID. When this updates fails, the reconciler tries to onboard again because the consumer ID is not updated in the storageCluster status This results in failure because the ticket was already used for creating storage consumer. Error: "msg":"Uninstall: Default uninstall annotations has been set on StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster3","StorageCluster":"openshift-storage/ocs-storagecluster3"} {"level":"error","ts":1645440552.705047,"logger":"controllers.StorageCluster","msg":"External-OCS:GetStorageConfig:StorageConsumer is not ready yet. Will requeue after 5 second","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster3","error":"rpc error: code = Unavailable desc = storage consumer status is not set","stacktrace":"github.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).getExternalConfigFromProvider\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/controllers/storagecluster/external_ocs.go:138\ngithub.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*ocsExternalResources).ensureCreated\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/controllers/storagecluster/external_resources.go:256\ngithub.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).reconcilePhases\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/controllers/storagecluster/reconcile.go:398\ngithub.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).Reconcile\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/controllers/storagecluster/reconcile.go:161\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"} {"level":"info","ts":1645440552.7127352,"logger":"controllers.StorageCluster","msg":"Could not update StorageCluster status.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster3","StorageCluster":"openshift-storage/ocs-storagecluster3"} {"level":"error","ts":1645440552.712778,"logger":"controller-runtime.manager.controller.storagecluster","msg":"Reconciler error","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","name":"ocs-storagecluster3","namespace":"openshift-storage","error":"Operation cannot be fulfilled on storageclusters.ocs.openshift.io \"ocs-storagecluster3\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"} {"level":"info","ts":1645440552.7128482,"logger":"controllers.StorageCluster","msg":"Reconciling external StorageCluster.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster3","StorageCluster":"openshift-storage/ocs-storagecluster3"} {"level":"error","ts":1645440552.724739,"logger":"controllers.StorageCluster","msg":"External-OCS:OnboardConsumer:Token is already used. Contact provider admin for a new token","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster3","error":"rpc error: code = AlreadyExists desc = failed to create storageConsumer \"storageconsumer-88a03266-93d7-4a5e-85f4-f97e78a6c042\". onboarding ticket already used by another storageConsumer","stacktrace":"github.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).onboardConsumer\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/controllers/storagecluster/external_ocs.go:68\ngithub.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*ocsExternalResources).ensureCreated\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/controllers/storagecluster/external_resources.go:244\ngithub.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).reconcilePhases\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/controllers/storagecluster/reconcile.go:398\ngithub.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).Reconcile\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/controllers/storagecluster/reconcile.go:161\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"} Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Yesterday I saw the error again: ===================================Consumer storagecluster ================ Multi Cloud Gateway: Reconcile Strategy: ignore Version: 4.10.0 Status: Conditions: Last Heartbeat Time: 2022-02-28T14:25:36Z Last Transition Time: 2022-02-28T13:13:45Z Message: Error while reconciling: rpc error: code = AlreadyExists desc = failed to create storageConsumer "storageconsumer-d9f173d8-71dc-4d77-a0f0-763694bd8814". storageconsumers.ocs.openshift.io "storageconsumer-d9f173d8-71dc-4d77-a0f0-763694bd8814" already exists Reason: ReconcileFailed Status: False Type: ReconcileComplete Last Heartbeat Time: 2022-02-28T13:13:45Z Last Transition Time: 2022-02-28T13:13:45Z Message: Initializing StorageCluster Reason: Init Status: False Type: Available Last Heartbeat Time: 2022-02-28T13:13:45Z Last Transition Time: 2022-02-28T13:13:45Z Message: Initializing StorageCluster Reason: Init Status: True Type: Progressing Last Heartbeat Time: 2022-02-28T13:13:45Z Last Transition Time: 2022-02-28T13:13:45Z Message: Initializing StorageCluster Reason: Init Status: False Type: Degraded Last Heartbeat Time: 2022-02-28T13:13:45Z Last Transition Time: 2022-02-28T13:13:45Z Message: Initializing StorageCluster Reason: Init Status: Unknown Type: Upgradeable External Storage: Granted Capacity: 0 Images: Ceph: Desired Image: quay.io/rhceph-dev/rhceph@sha256:e38fab78a061bda5cc73bbedd9b25f6b18ca6ec9f328feb14e8999abf5936b61 Noobaa Core: Desired Image: quay.io/rhceph-dev/odf4-mcg-core-rhel8@sha256:8b35df89d348710b3664cd065ac94965a81b2d608b63898f37b187f1c7f0eba2 Noobaa DB: Desired Image: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:da0b8d525b173ef472ff4c71fae60b396f518860d6313c4f3287b844aab6d622 Phase: Error Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal CreationSucceeded 86m StorageCluster controller StorageSystem ocs-storagecluster-storagesystem created for the StorageCluster ocs-storagecluster. Normal NotReady 86m controller_storagecluster StorageConsumer is not ready yet. Will requeue after 5 second Warning TokenAlreadyUsed 14m (x2 over 86m) controller_storagecluster Token is already used. Contact provider admin for a new token ================================================================================================================================== However, with the below workaround, it got resolved On the provider cluster: 1. Delete the storageConsumer - oc delete storageconsumers.ocs.openshift.io <storageconsumer-*> 2. Restart the ocs provider server pod by deleting it. So that it loses the reference of the ticket. - oc delete pod <ocs-provider-server-*> On the Consumer Cluster: Restart the ocs operator pod by deleting it. oc delete pod <ocs-operator-*>
reason: failedqa
we had a discussion regarding the bug and came up with a solution that required work on all 3 components ocs-operator, API-server and storageConsumer controller To fix the bug we are thinking to introduce a new API call which will be similar to the onboarding API call it will take one more argument other than onboarding arguments which is a string(ack, nack), and the provider API server will return the uid again and again via this new API call until we send an ack to the server. In order to achieve this, we need to make the below changes - Create a new API call in the API server and client code to call the API - Change the ocs-operator to call this new API and implement ack/nack logic - Change the storageconsumer controller to skip reconcile until we get an ack from the consumer
Is this available in Mar 21 build (4.10.0-201)?
Yes, it is available. Moving it to ON_QA