Description of problem (please be detailed as possible and provide log snippests): While producing https://bugzilla.redhat.com/show_bug.cgi?id=1859183 , it is observed that volume expansion is going on endless retries and is not recovered from failure. As I understand, the issue with failure recovery is not limited to the scenario mentioned in bug 1859183. This bug will be used to track the fix of recovery from failure. --------------------------------------------------------------- Version of all relevant components (if applicable): ocs-operator.v4.5.0-494.ci Cluster version is 4.5.0-0.nightly-2020-07-20-152128 -------------------------------------------------------------------- Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Multiple retry attempts for expansion and lot of error events will be shown in UI. This failure state is not automatically recovered. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: (sharing same steps of bug 1859183. But this bug is valid for volume expansion failure due to any other cause) 1. On an OCS4.4 cluster, create a ceph-rbd or cephfs based PVC using and attach to pod. 2. Upgrade to OCS4.5. 3. Manually re-create SCs ocs-storagecluster-ceph-rbd and ocs-storagecluster-cephfs to support volume expansion. This is done manually due to the issue https://bugzilla.redhat.com/show_bug.cgi?id=1846085 . The SCs will be updated/re-created once the bug is fixed. 4. Try to expand PVC which is created in step 1 Actual results: Expansion goes on endless retries. Expected results: Expansion should not go on endless retries but should recover from failure after certain number of attempts. Additional info: Logs collected for bz 1859183 is valid in this case as well: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1859183
Created attachment 1701952 [details] Screenshot showing 74 tries in 4 hours
The number of retries are from OCP and its not specific to "resize" , this is same as "create" or "delete" where as long as driver returns "failure" on the operations, the controller retries. Its also in the kube norm that, try to reach to the desired state eventually.. I would like to close this as "NOT A BUG" and work as designed. If not at all we are in agreement to close this, please open a bug against OCP and give a try.
(In reply to Humble Chirammal from comment #3) > The number of retries are from OCP and its not specific to "resize" , this > is same as "create" or "delete" where as long as driver returns "failure" on > the operations, the controller retries. Is there a manual recovery mechanism in these cases ? > Its also in the kube norm that, try > to reach to the desired state eventually.. I would like to close this as > "NOT A BUG" and work as designed. > > If not at all we are in agreement to close this, please open a bug against > OCP and give a try.
(In reply to Jilju Joy from comment #4) > (In reply to Humble Chirammal from comment #3) > > The number of retries are from OCP and its not specific to "resize" , this > > is same as "create" or "delete" where as long as driver returns "failure" on > > the operations, the controller retries. > Is there a manual recovery mechanism in these cases ? > yes, correct the mistake for the failure or delete the object thats in general. Specifically for 'resize' there is a documented process too.
(In reply to Humble Chirammal from comment #5) > (In reply to Jilju Joy from comment #4) > > (In reply to Humble Chirammal from comment #3) > > > The number of retries are from OCP and its not specific to "resize" , this > > > is same as "create" or "delete" where as long as driver returns "failure" on > > > the operations, the controller retries. > > Is there a manual recovery mechanism in these cases ? > > > > > yes, correct the mistake for the failure or delete the object thats in > general. > Specifically for 'resize' there is a documented process too. are we in agreement to close this based on above comments?
Closing this for now as we cant do anything from OCS to stop the retry in case of a failure in general.
IN upstream there is an enhancement in place which could help to recover from failed attempts by reducing the size of the PVC.. Mentioning it here for future references.