1859317 – Volume expansion should time out after certain number of retries

Bug 1859317 - Volume expansion should time out after certain number of retries

Summary: Volume expansion should time out after certain number of retries

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	csi-driver
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Humble Chirammal
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-21 17:30 UTC by Jilju Joy
Modified:	2022-02-09 09:17 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-08-11 11:39:01 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Screenshot showing 74 tries in 4 hours (123.04 KB, image/png) 2020-07-21 17:31 UTC, Jilju Joy	no flags	Details
View All

Description Jilju Joy 2020-07-21 17:30:07 UTC

Description of problem (please be detailed as possible and provide log
snippests):

While producing https://bugzilla.redhat.com/show_bug.cgi?id=1859183 , it is observed that volume expansion is going on endless retries and is not recovered from failure. As I understand, the issue with failure recovery is not limited to the scenario mentioned in bug 1859183. This bug will be used to track the fix of recovery from failure.

---------------------------------------------------------------
Version of all relevant components (if applicable):
ocs-operator.v4.5.0-494.ci
Cluster version is 4.5.0-0.nightly-2020-07-20-152128

--------------------------------------------------------------------

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Multiple retry attempts for expansion and lot of error events will be shown in UI. This failure state is not automatically recovered.

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
(sharing same steps of bug 1859183. But this bug is valid for volume expansion failure due to any other cause)
1. On an OCS4.4 cluster, create a ceph-rbd or cephfs based PVC using and attach to pod.
2. Upgrade to OCS4.5.
3. Manually re-create SCs ocs-storagecluster-ceph-rbd and ocs-storagecluster-cephfs to support volume expansion. This is done manually due to the issue https://bugzilla.redhat.com/show_bug.cgi?id=1846085 . The SCs will be updated/re-created once the bug is fixed.
4. Try to expand PVC which is created in step 1

Actual results:

Expansion goes on endless retries.

Expected results:
Expansion should not go on endless retries but should recover from failure after certain number of attempts.

Additional info:
Logs collected for bz 1859183 is valid in this case as well:
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1859183

Comment 2 Jilju Joy 2020-07-21 17:31:19 UTC

Created attachment 1701952 [details]
Screenshot showing 74 tries in 4 hours

Comment 3 Humble Chirammal 2020-07-29 11:37:06 UTC

The number of retries are from OCP and its not specific to "resize" , this is same as "create" or "delete" where as long as driver returns "failure" on the operations, the controller retries. Its also in the kube norm that, try to reach to the desired state eventually.. I would like to close this as "NOT A BUG" and work as designed. 

If not at all we are in agreement to close this, please  open a bug against OCP and give a try.

Comment 4 Jilju Joy 2020-08-05 07:49:55 UTC

(In reply to Humble Chirammal from comment #3)
> The number of retries are from OCP and its not specific to "resize" , this
> is same as "create" or "delete" where as long as driver returns "failure" on
> the operations, the controller retries. 
Is there a manual recovery mechanism in these cases ?

> Its also in the kube norm that, try
> to reach to the desired state eventually.. I would like to close this as
> "NOT A BUG" and work as designed. 
> 
> If not at all we are in agreement to close this, please  open a bug against
> OCP and give a try.

Comment 5 Humble Chirammal 2020-08-05 08:20:46 UTC

(In reply to Jilju Joy from comment #4)
> (In reply to Humble Chirammal from comment #3)
> > The number of retries are from OCP and its not specific to "resize" , this
> > is same as "create" or "delete" where as long as driver returns "failure" on
> > the operations, the controller retries. 
> Is there a manual recovery mechanism in these cases ?
> 


yes, correct the mistake for the failure or delete the object thats in general. 
Specifically for 'resize' there is a documented process too.

Comment 7 Humble Chirammal 2020-08-10 13:16:09 UTC

(In reply to Humble Chirammal from comment #5)
> (In reply to Jilju Joy from comment #4)
> > (In reply to Humble Chirammal from comment #3)
> > > The number of retries are from OCP and its not specific to "resize" , this
> > > is same as "create" or "delete" where as long as driver returns "failure" on
> > > the operations, the controller retries. 
> > Is there a manual recovery mechanism in these cases ?
> > 
> 
> 
> yes, correct the mistake for the failure or delete the object thats in
> general. 
> Specifically for 'resize' there is a documented process too.

are we in agreement to close this based on above comments?

Comment 9 Humble Chirammal 2020-08-11 11:39:01 UTC

Closing this for now as we cant do anything from OCS to stop the retry in case of a failure in general.

Comment 10 Humble Chirammal 2022-02-09 09:17:38 UTC

IN upstream there is an enhancement in place which could help to recover from failed attempts by reducing the size of the PVC.. Mentioning it here for future references.

Note You need to log in before you can comment on or make changes to this bug.