Bug 2056634 - Onboarding of the storageConsumer fails due to failure in status update on the consumer side. Issue happens intermittently
Summary: Onboarding of the storageConsumer fails due to failure in status update on th...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ODF 4.10.0
Assignee: Nitin Goyal
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-21 16:50 UTC by Santosh Pillai
Modified: 2023-08-09 17:00 UTC (History)
12 users (show)

Fixed In Version: 4.10.0-201
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-04-21 09:12:47 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-operator pull 1556 0 None open controllers: exit reconcile loop immediately after updating the storagecluster annotations 2022-02-24 17:33:29 UTC
Github red-hat-storage ocs-operator pull 1561 0 None open Bug 2056634:[release-4.10] controllers: exit reconcile loop immediately after updating the storagecluster annotations 2022-02-25 04:37:02 UTC
Github red-hat-storage ocs-operator pull 1584 0 None open ocs to ocs: add AcknowledgeOnboarding service 2022-03-15 06:35:28 UTC
Github red-hat-storage ocs-operator pull 1589 0 None open Bug 2056634:[release-4.10] ocs to ocs: add AcknowledgeOnboarding service 2022-03-17 03:50:00 UTC
Github red-hat-storage ocs-operator pull 1590 0 None open ocs to ocs: fix onboarding acknowledegement 2022-03-17 14:35:28 UTC
Github red-hat-storage ocs-operator pull 1591 0 None open Bug 2056634:[release-4.10] ocs to ocs: fix onboarding acknowledegement 2022-03-18 05:53:21 UTC
Github red-hat-storage odf-operator pull 189 0 None Merged controllers: stop adding annotation to the storagecluster 2022-03-01 09:22:21 UTC
Github red-hat-storage odf-operator pull 190 0 None open Bug 2056634:[release-4.10] controllers: stop adding annotation to the storagecluster 2022-03-01 09:22:56 UTC

Description Santosh Pillai 2022-02-21 16:50:33 UTC
Description of problem (please be detailed as possible and provide log
snippests):

- After the first onboarding attempt, the operator reconciler tries to update the StorageCluster with the StorageConsumer UID.  When this updates fails, the reconciler tries to onboard again because the consumer ID is not updated in the storageCluster status 

This results in failure because the ticket was already used for creating storage consumer.

Error:

"msg":"Uninstall: Default uninstall annotations has been set on StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster3","StorageCluster":"openshift-storage/ocs-storagecluster3"}
{"level":"error","ts":1645440552.705047,"logger":"controllers.StorageCluster","msg":"External-OCS:GetStorageConfig:StorageConsumer is not ready yet. Will requeue after 5 second","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster3","error":"rpc error: code = Unavailable desc = storage consumer status is not set","stacktrace":"github.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).getExternalConfigFromProvider\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/controllers/storagecluster/external_ocs.go:138\ngithub.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*ocsExternalResources).ensureCreated\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/controllers/storagecluster/external_resources.go:256\ngithub.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).reconcilePhases\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/controllers/storagecluster/reconcile.go:398\ngithub.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).Reconcile\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/controllers/storagecluster/reconcile.go:161\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"}
{"level":"info","ts":1645440552.7127352,"logger":"controllers.StorageCluster","msg":"Could not update StorageCluster status.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster3","StorageCluster":"openshift-storage/ocs-storagecluster3"}
{"level":"error","ts":1645440552.712778,"logger":"controller-runtime.manager.controller.storagecluster","msg":"Reconciler error","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","name":"ocs-storagecluster3","namespace":"openshift-storage","error":"Operation cannot be fulfilled on storageclusters.ocs.openshift.io \"ocs-storagecluster3\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"}
{"level":"info","ts":1645440552.7128482,"logger":"controllers.StorageCluster","msg":"Reconciling external StorageCluster.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster3","StorageCluster":"openshift-storage/ocs-storagecluster3"}
{"level":"error","ts":1645440552.724739,"logger":"controllers.StorageCluster","msg":"External-OCS:OnboardConsumer:Token is already used. Contact provider admin for a new token","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster3","error":"rpc error: code = AlreadyExists desc = failed to create storageConsumer \"storageconsumer-88a03266-93d7-4a5e-85f4-f97e78a6c042\". onboarding ticket already used by another storageConsumer","stacktrace":"github.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).onboardConsumer\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/controllers/storagecluster/external_ocs.go:68\ngithub.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*ocsExternalResources).ensureCreated\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/controllers/storagecluster/external_resources.go:244\ngithub.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).reconcilePhases\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/controllers/storagecluster/reconcile.go:398\ngithub.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).Reconcile\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/controllers/storagecluster/reconcile.go:161\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/sapillai/go/src/github.com/red-hat-storage/ocs-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"}


Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 4 suchita 2022-03-01 09:15:38 UTC
Yesterday I saw the error again: 

===================================Consumer storagecluster ================
 Multi Cloud Gateway:
    Reconcile Strategy:  ignore
  Version:               4.10.0
Status:
  Conditions:
    Last Heartbeat Time:   2022-02-28T14:25:36Z
    Last Transition Time:  2022-02-28T13:13:45Z
    Message:               Error while reconciling: rpc error: code = AlreadyExists desc = failed to create storageConsumer "storageconsumer-d9f173d8-71dc-4d77-a0f0-763694bd8814". storageconsumers.ocs.openshift.io "storageconsumer-d9f173d8-71dc-4d77-a0f0-763694bd8814" already exists
    Reason:                ReconcileFailed
    Status:                False
    Type:                  ReconcileComplete
    Last Heartbeat Time:   2022-02-28T13:13:45Z
    Last Transition Time:  2022-02-28T13:13:45Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Available
    Last Heartbeat Time:   2022-02-28T13:13:45Z
    Last Transition Time:  2022-02-28T13:13:45Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                True
    Type:                  Progressing
    Last Heartbeat Time:   2022-02-28T13:13:45Z
    Last Transition Time:  2022-02-28T13:13:45Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Degraded
    Last Heartbeat Time:   2022-02-28T13:13:45Z
    Last Transition Time:  2022-02-28T13:13:45Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                Unknown
    Type:                  Upgradeable
  External Storage:
    Granted Capacity:  0
  Images:
    Ceph:
      Desired Image:  quay.io/rhceph-dev/rhceph@sha256:e38fab78a061bda5cc73bbedd9b25f6b18ca6ec9f328feb14e8999abf5936b61
    Noobaa Core:
      Desired Image:  quay.io/rhceph-dev/odf4-mcg-core-rhel8@sha256:8b35df89d348710b3664cd065ac94965a81b2d608b63898f37b187f1c7f0eba2
    Noobaa DB:
      Desired Image:  quay.io/rhceph-dev/rhel8-postgresql-12@sha256:da0b8d525b173ef472ff4c71fae60b396f518860d6313c4f3287b844aab6d622
  Phase:              Error
Events:
  Type     Reason             Age                From                       Message
  ----     ------             ----               ----                       -------
  Normal   CreationSucceeded  86m                StorageCluster controller  StorageSystem ocs-storagecluster-storagesystem created for the StorageCluster ocs-storagecluster.
  Normal   NotReady           86m                controller_storagecluster  StorageConsumer is not ready yet. Will requeue after 5 second
  Warning  TokenAlreadyUsed   14m (x2 over 86m)  controller_storagecluster  Token is already used. Contact provider admin for a new token

==================================================================================================================================

However, with the below workaround, it got resolved 
On the provider cluster:  
    1. Delete the storageConsumer  - oc delete storageconsumers.ocs.openshift.io <storageconsumer-*> 
    2. Restart the ocs provider server pod by deleting it. So that it loses the reference of the ticket.  - oc delete pod <ocs-provider-server-*>

On the Consumer Cluster: 
     Restart the ocs operator pod by deleting it. oc delete pod <ocs-operator-*>

Comment 5 suchita 2022-03-01 09:16:13 UTC
Yesterday I saw the error again: 

===================================Consumer storagecluster ================
 Multi Cloud Gateway:
    Reconcile Strategy:  ignore
  Version:               4.10.0
Status:
  Conditions:
    Last Heartbeat Time:   2022-02-28T14:25:36Z
    Last Transition Time:  2022-02-28T13:13:45Z
    Message:               Error while reconciling: rpc error: code = AlreadyExists desc = failed to create storageConsumer "storageconsumer-d9f173d8-71dc-4d77-a0f0-763694bd8814". storageconsumers.ocs.openshift.io "storageconsumer-d9f173d8-71dc-4d77-a0f0-763694bd8814" already exists
    Reason:                ReconcileFailed
    Status:                False
    Type:                  ReconcileComplete
    Last Heartbeat Time:   2022-02-28T13:13:45Z
    Last Transition Time:  2022-02-28T13:13:45Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Available
    Last Heartbeat Time:   2022-02-28T13:13:45Z
    Last Transition Time:  2022-02-28T13:13:45Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                True
    Type:                  Progressing
    Last Heartbeat Time:   2022-02-28T13:13:45Z
    Last Transition Time:  2022-02-28T13:13:45Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Degraded
    Last Heartbeat Time:   2022-02-28T13:13:45Z
    Last Transition Time:  2022-02-28T13:13:45Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                Unknown
    Type:                  Upgradeable
  External Storage:
    Granted Capacity:  0
  Images:
    Ceph:
      Desired Image:  quay.io/rhceph-dev/rhceph@sha256:e38fab78a061bda5cc73bbedd9b25f6b18ca6ec9f328feb14e8999abf5936b61
    Noobaa Core:
      Desired Image:  quay.io/rhceph-dev/odf4-mcg-core-rhel8@sha256:8b35df89d348710b3664cd065ac94965a81b2d608b63898f37b187f1c7f0eba2
    Noobaa DB:
      Desired Image:  quay.io/rhceph-dev/rhel8-postgresql-12@sha256:da0b8d525b173ef472ff4c71fae60b396f518860d6313c4f3287b844aab6d622
  Phase:              Error
Events:
  Type     Reason             Age                From                       Message
  ----     ------             ----               ----                       -------
  Normal   CreationSucceeded  86m                StorageCluster controller  StorageSystem ocs-storagecluster-storagesystem created for the StorageCluster ocs-storagecluster.
  Normal   NotReady           86m                controller_storagecluster  StorageConsumer is not ready yet. Will requeue after 5 second
  Warning  TokenAlreadyUsed   14m (x2 over 86m)  controller_storagecluster  Token is already used. Contact provider admin for a new token

==================================================================================================================================

However, with the below workaround, it got resolved 
On the provider cluster:  
    1. Delete the storageConsumer  - oc delete storageconsumers.ocs.openshift.io <storageconsumer-*> 
    2. Restart the ocs provider server pod by deleting it. So that it loses the reference of the ticket.  - oc delete pod <ocs-provider-server-*>

On the Consumer Cluster: 
     Restart the ocs operator pod by deleting it. oc delete pod <ocs-operator-*>

Comment 8 Renan Campos 2022-03-07 17:58:37 UTC
reason: failedqa

Comment 9 Nitin Goyal 2022-03-08 12:43:45 UTC
we had a discussion regarding the bug and came up with a solution that required work on all 3 components ocs-operator, API-server and storageConsumer controller

To fix the bug we are thinking to introduce a new API call which will be similar to the onboarding API call it will take one more argument other than onboarding arguments which is a string(ack, nack), and the provider API server will return the uid again and again via this new API call until we send an ack to the server.

In order to achieve this, we need to make the below changes
 - Create a new API call in the API server and client code to call the API
 - Change the ocs-operator to call this new API and implement ack/nack logic
 - Change the storageconsumer controller to skip reconcile until we get an ack from the consumer

Comment 10 Sahina Bose 2022-03-22 09:41:38 UTC
Is this available in Mar 21 build (4.10.0-201)?

Comment 11 Nitin Goyal 2022-03-22 09:55:24 UTC
Yes, it is available. Moving it to ON_QA


Note You need to log in before you can comment on or make changes to this bug.