Bug 1859317

Summary: Volume expansion should time out after certain number of retries
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Jilju Joy <jijoy>
Component: csi-driverAssignee: Humble Chirammal <hchiramm>
Status: CLOSED NOTABUG QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.5CC: hchiramm, madam, ocs-bugs, rcyriac
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-11 11:39:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Screenshot showing 74 tries in 4 hours none

Description Jilju Joy 2020-07-21 17:30:07 UTC
Description of problem (please be detailed as possible and provide log
snippests):

While producing https://bugzilla.redhat.com/show_bug.cgi?id=1859183 , it is observed that volume expansion is going on endless retries and is not recovered from failure. As I understand, the issue with failure recovery is not limited to the scenario mentioned in bug 1859183. This bug will be used to track the fix of recovery from failure.


---------------------------------------------------------------
Version of all relevant components (if applicable):
ocs-operator.v4.5.0-494.ci
Cluster version is 4.5.0-0.nightly-2020-07-20-152128

--------------------------------------------------------------------


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Multiple retry attempts for expansion and lot of error events will be shown in UI. This failure state is not automatically recovered.



Is there any workaround available to the best of your knowledge?
No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
(sharing same steps of bug 1859183. But this bug is valid for volume expansion failure due to any other cause)
1. On an OCS4.4 cluster, create a ceph-rbd or cephfs based PVC using and attach to pod.
2. Upgrade to OCS4.5.
3. Manually re-create SCs ocs-storagecluster-ceph-rbd and ocs-storagecluster-cephfs to support volume expansion. This is done manually due to the issue https://bugzilla.redhat.com/show_bug.cgi?id=1846085 . The SCs will be updated/re-created once the bug is fixed.
4. Try to expand PVC which is created in step 1



Actual results:

Expansion goes on endless retries.

Expected results:
Expansion should not go on endless retries but should recover from failure after certain number of attempts.


Additional info:
Logs collected for bz 1859183 is valid in this case as well:
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1859183

Comment 2 Jilju Joy 2020-07-21 17:31:19 UTC
Created attachment 1701952 [details]
Screenshot showing 74 tries in 4 hours

Comment 3 Humble Chirammal 2020-07-29 11:37:06 UTC
The number of retries are from OCP and its not specific to "resize" , this is same as "create" or "delete" where as long as driver returns "failure" on the operations, the controller retries. Its also in the kube norm that, try to reach to the desired state eventually.. I would like to close this as "NOT A BUG" and work as designed. 

If not at all we are in agreement to close this, please  open a bug against OCP and give a try.

Comment 4 Jilju Joy 2020-08-05 07:49:55 UTC
(In reply to Humble Chirammal from comment #3)
> The number of retries are from OCP and its not specific to "resize" , this
> is same as "create" or "delete" where as long as driver returns "failure" on
> the operations, the controller retries. 
Is there a manual recovery mechanism in these cases ?

> Its also in the kube norm that, try
> to reach to the desired state eventually.. I would like to close this as
> "NOT A BUG" and work as designed. 
> 
> If not at all we are in agreement to close this, please  open a bug against
> OCP and give a try.

Comment 5 Humble Chirammal 2020-08-05 08:20:46 UTC
(In reply to Jilju Joy from comment #4)
> (In reply to Humble Chirammal from comment #3)
> > The number of retries are from OCP and its not specific to "resize" , this
> > is same as "create" or "delete" where as long as driver returns "failure" on
> > the operations, the controller retries. 
> Is there a manual recovery mechanism in these cases ?
> 


yes, correct the mistake for the failure or delete the object thats in general. 
Specifically for 'resize' there is a documented process too.

Comment 7 Humble Chirammal 2020-08-10 13:16:09 UTC
(In reply to Humble Chirammal from comment #5)
> (In reply to Jilju Joy from comment #4)
> > (In reply to Humble Chirammal from comment #3)
> > > The number of retries are from OCP and its not specific to "resize" , this
> > > is same as "create" or "delete" where as long as driver returns "failure" on
> > > the operations, the controller retries. 
> > Is there a manual recovery mechanism in these cases ?
> > 
> 
> 
> yes, correct the mistake for the failure or delete the object thats in
> general. 
> Specifically for 'resize' there is a documented process too.

are we in agreement to close this based on above comments?

Comment 9 Humble Chirammal 2020-08-11 11:39:01 UTC
Closing this for now as we cant do anything from OCS to stop the retry in case of a failure in general.

Comment 10 Humble Chirammal 2022-02-09 09:17:38 UTC
IN upstream there is an enhancement in place which could help to recover from failed attempts by reducing the size of the PVC.. Mentioning it here for future references.