Bug 1859317

Summary:

Volume expansion should time out after certain number of retries

Product:

[Red Hat Storage] Red Hat OpenShift Container Storage

Reporter:

Jilju Joy <jijoy>

Component:

csi-driver

Assignee:

Humble Chirammal <hchiramm>

Status:

CLOSED NOTABUG

QA Contact:

Elad <ebenahar>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.5

CC:

hchiramm, madam, ocs-bugs, rcyriac

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-08-11 11:39:01 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Screenshot showing 74 tries in 4 hours	none

Description Jilju Joy 2020-07-21 17:30:07 UTC

Description of problem (please be detailed as possible and provide log
snippests):

While producing https://bugzilla.redhat.com/show_bug.cgi?id=1859183 , it is observed that volume expansion is going on endless retries and is not recovered from failure. As I understand, the issue with failure recovery is not limited to the scenario mentioned in bug 1859183. This bug will be used to track the fix of recovery from failure.

---------------------------------------------------------------
Version of all relevant components (if applicable):
ocs-operator.v4.5.0-494.ci
Cluster version is 4.5.0-0.nightly-2020-07-20-152128

--------------------------------------------------------------------

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Multiple retry attempts for expansion and lot of error events will be shown in UI. This failure state is not automatically recovered.

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
(sharing same steps of bug 1859183. But this bug is valid for volume expansion failure due to any other cause)
1. On an OCS4.4 cluster, create a ceph-rbd or cephfs based PVC using and attach to pod.
2. Upgrade to OCS4.5.
3. Manually re-create SCs ocs-storagecluster-ceph-rbd and ocs-storagecluster-cephfs to support volume expansion. This is done manually due to the issue https://bugzilla.redhat.com/show_bug.cgi?id=1846085 . The SCs will be updated/re-created once the bug is fixed.
4. Try to expand PVC which is created in step 1

Actual results:

Expansion goes on endless retries.

Expected results:
Expansion should not go on endless retries but should recover from failure after certain number of attempts.

Additional info:
Logs collected for bz 1859183 is valid in this case as well:
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1859183

Comment 2 Jilju Joy 2020-07-21 17:31:19 UTC

Created attachment 1701952 [details]
Screenshot showing 74 tries in 4 hours

Comment 3 Humble Chirammal 2020-07-29 11:37:06 UTC

The number of retries are from OCP and its not specific to "resize" , this is same as "create" or "delete" where as long as driver returns "failure" on the operations, the controller retries. Its also in the kube norm that, try to reach to the desired state eventually.. I would like to close this as "NOT A BUG" and work as designed. 

If not at all we are in agreement to close this, please  open a bug against OCP and give a try.

Comment 4 Jilju Joy 2020-08-05 07:49:55 UTC

(In reply to Humble Chirammal from comment #3)
> The number of retries are from OCP and its not specific to "resize" , this
> is same as "create" or "delete" where as long as driver returns "failure" on
> the operations, the controller retries. 
Is there a manual recovery mechanism in these cases ?

> Its also in the kube norm that, try
> to reach to the desired state eventually.. I would like to close this as
> "NOT A BUG" and work as designed. 
> 
> If not at all we are in agreement to close this, please  open a bug against
> OCP and give a try.

Comment 5 Humble Chirammal 2020-08-05 08:20:46 UTC

(In reply to Jilju Joy from comment #4)
> (In reply to Humble Chirammal from comment #3)
> > The number of retries are from OCP and its not specific to "resize" , this
> > is same as "create" or "delete" where as long as driver returns "failure" on
> > the operations, the controller retries. 
> Is there a manual recovery mechanism in these cases ?
> 


yes, correct the mistake for the failure or delete the object thats in general. 
Specifically for 'resize' there is a documented process too.

Comment 7 Humble Chirammal 2020-08-10 13:16:09 UTC

(In reply to Humble Chirammal from comment #5)
> (In reply to Jilju Joy from comment #4)
> > (In reply to Humble Chirammal from comment #3)
> > > The number of retries are from OCP and its not specific to "resize" , this
> > > is same as "create" or "delete" where as long as driver returns "failure" on
> > > the operations, the controller retries. 
> > Is there a manual recovery mechanism in these cases ?
> > 
> 
> 
> yes, correct the mistake for the failure or delete the object thats in
> general. 
> Specifically for 'resize' there is a documented process too.

are we in agreement to close this based on above comments?

Comment 9 Humble Chirammal 2020-08-11 11:39:01 UTC

Closing this for now as we cant do anything from OCS to stop the retry in case of a failure in general.

Comment 10 Humble Chirammal 2022-02-09 09:17:38 UTC

IN upstream there is an enhancement in place which could help to recover from failed attempts by reducing the size of the PVC.. Mentioning it here for future references.