Description of problem (please be detailed as possible and provide log snippests): PV expansion goes on continuous retry even if PV spec does not contain expansion params like expand secret and namespace. Moreover, this failure state is not re recovered by itself. This is observed while expanding the PVCs which are created before upgrading to OCS 4.5. The expansion should fail instantly without retries if the PV does not support expansion. Log snippet from csi-rbdplugin-provisioner pod csi-resizer container. These errors are repeated many times. E0715 14:04:34.438152 1 controller.go:361] Resize volume "pvc-4635bd77-846b-47e0-9542-5b7d40aa7a7c" by resizer "openshift-storage.rbd.csi.ceph.com" failed: rpc error: code = Internal desc = provided secret is empty I0715 14:04:34.438232 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-77568bca51874c368ac8fa5d633c0a0c", Name:"pvc-test-ee2409c820574708be26902b657bcbae", UID:"4635bd77-846b-47e0-9542-5b7d40aa7a7c", APIVersion:"v1", ResourceVersion:"318543", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-4635bd77-846b-47e0-9542-5b7d40aa7a7c failed: rpc error: code = Internal desc = provided secret is empty -------------------------------------------------------------------------- Version of all relevant components (if applicable): ocs-operator.v4.5.0-487.ci cephcsi:4.5-20.d792a51c.release_4.5 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? This cause multiple retry attempts for expansion and lot of error events will be shown in UI. This failure state is not automatically recovered. Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. On an OCS4.4 cluster, create a ceph-rbd or cephfs based PVC using and attach to pod. 2. Upgrade to OCS4.5. 3. Manually re-create SCs ocs-storagecluster-ceph-rbd and ocs-storagecluster-cephfs to support volume expansion. This is done manually due to the issue https://bugzilla.redhat.com/show_bug.cgi?id=1846085 . The SCs will be updated/re-created once the bug is fixed. 4. Try to expand PVC which is created in step 1 Actual results: Continuous retries for expansion and all of them failed. The retry keeps on continuing. Expected results: PV expansion should fail without retry if the PV spec does not contain expansion params. Additional info: I will share the logs shortly.
Created a bug in management-console to disable expansion if the PV is not actually expandable. https://bugzilla.redhat.com/show_bug.cgi?id=1859165
My immediate response remain same as mentioned in Jira and also in other discussions: This is **NOT** expected to work at all!! The new feature expansion is allowed for PVs which are created by a SC which got the expansion secrets in it. Otherwise its a case where you dont have prerequisites met for expansion at all. May be a security issue too as you are operating on a volume which was not really OWNED by the SC instance which was capable of producing expandable PVs. The resize controller is the one pushing and retrying or keep attempting to make a successful request. OCS plugin can not control this !! Also, think about how we could land in this situation. You are recreating a SC ( with new params) which already have PVs in it. Trying an expansion on old PVs is a 'user error'. The maximum we could do here is documenting that, old PVs wont be allowed to be expanded. I would like to close this bug as NOT A BUG.
Hi Humble (In reply to Humble Chirammal from comment #3) > Trying an expansion on > old PVs is a 'user error'. The maximum we could do here is documenting that, > old PVs wont be allowed to be expanded. > > I would like to close this bug as NOT A BUG. I don't think it's a user error. If we are announcing resize as supported (even as tech preview), doesn't mean user will be wrong if he will try to resize a PVC created prior to the upgrade to 4.5. Moreever, are we expecting the user to distinguish between PVCs created before an after the upgrade? Do we expect the user to remember when the upgrade took place and do the correlation? Finally, the fact that the resize attempt will end up with endless retries is very bad, it can lead to bad implications on the cluster as a whole, and not specifically on the resize operation.
Created attachment 1701930 [details] repeated resize failure events
(In reply to Elad from comment #4) > Hi Humble > > (In reply to Humble Chirammal from comment #3) > > Trying an expansion on > > old PVs is a 'user error'. The maximum we could do here is documenting that, > > old PVs wont be allowed to be expanded. > > > > I would like to close this bug as NOT A BUG. > > I don't think it's a user error. If we are announcing resize as supported > (even as tech preview), doesn't mean user will be wrong if he will try to > resize a PVC created prior to the upgrade to 4.5. > Moreever, are we expecting the user to distinguish between PVCs created > before an after the upgrade? Do we expect the user to remember when the > upgrade took place and do the correlation? > Finally, the fact that the resize attempt will end up with endless retries > is very bad, it can lead to bad implications on the cluster as a whole, and > not specifically on the resize operation. As the issue in recovery from expansion failure is not just limited to this particular scenario, I have created a new bug to track that. - https://bugzilla.redhat.com/show_bug.cgi?id=1859317
Hi Elad, Comments inline (In reply to Elad from comment #4) > Hi Humble > > (In reply to Humble Chirammal from comment #3) > > Trying an expansion on > > old PVs is a 'user error'. The maximum we could do here is documenting that, > > old PVs wont be allowed to be expanded. > > > > I would like to close this bug as NOT A BUG. > > I don't think it's a user error. If we are announcing resize as supported > (even as tech preview), doesn't mean user will be wrong if he will try to > resize a PVC created prior to the upgrade to 4.5. Let me put this way. New features are supported from that version or new deployment onwards. If we think a new feature will work on older versions of deployment or objects from older version, the expectation is indeed WRONG. For example, resize/expansion is a new feature on OCS 4.5. It should be ideally applicable on new deployments, new SCs. Thats it. However we gave one more try or best attempt by allowing the users to recreate the default SC with same name to enable this feature for new PVs created on default SCs. Ideally or we shouldnt have even go with this route rather we would have restricted the new feature on new SCs, not on existing SC with a hack of "recreation of SC. The reason for this hack was that, OCS product support was restricted to one/single storage class atleast till OCS 4.5. > Moreever, are we expecting the user to distinguish between PVCs created > before an after the upgrade? Do we expect the user to remember when the > upgrade took place and do the correlation? The old PVs limitation is *** well known *** and discussed in various forums. This was also pointed out in Jira: https://issues.redhat.com/browse/KNIP-664?focusedCommentId=14149203&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14149203 This is not ***new*** , unfortunately I dont understand why this come as a 'requirement' now at this last stage. > Finally, the fact that the resize attempt will end up with endless retries > is very bad, it can lead to bad implications on the cluster as a whole, and > not specifically on the resize operation. Again , create/delete/resize operation's infinite retry in controllers if it fail, is the design in upstream. But the question here would be whats causing the failure? If its an attempt to resize a volume which is not expected to work, I still think its a user error. ps # we could open an issue in ocp and get upstream or OCP team input on allowing 'old PVs or allowing PVs which dont have expand secret' in it. But thats not going to happen in near term or may not get accepted at all!
@Humble I just looked through the logs and saw that the `Data` field in secret has keys with empty string. ``` - apiVersion: v1 data: userID: "" userKey: "" kind: Secret metadata: creationTimestamp: "2020-07-21T11:48:58Z" name: rook-csi-rbd-provisioner namespace: openshift-storage ... type: kubernetes.io/rook ``` Same was the case with Ceph FS Storage Class Secret. Could this empty value be the reason why resizer controller retries infinitely? If it didn't find any secret it would probably stop trying, but here it's finding an empty secret and expecting it to be non-empty (an unhandled corner case??).
Humble, few questions: 1) Are we expecting users to re-create existing storage classes post OCS upgrade? Is there a requirement for that? The upgrade procedure says nothing about it. 2) Regarding: > Let me put this way. New features are supported from that version or new deployment onwards. If we think a new feature > will work on older versions of deployment or objects from older version, the expectation is indeed WRONG. This is not the case of using a new feature on an unsupported OCS version. We are talking about clusters that were upgraded to 4.5, which is the correct OCS version in which resize should be supported. New features should be applicable for upgraded clusters, that's part of the reason we ask customers to upgrade to the latest release. 3) Regarding: > The old PVs limitation is *** well known *** and discussed in various forums. This was also pointed out in Jira Sorry for jumping late here but I think it's better late than never. Also, I don't think customers will be aware about the fact that it was discussed and agreed in our internal Jira. User will try to resize a PV post upgrade to 4.5, in which this feature is supported, and encounter a failure with endless retries. This is bad and if this is the case, we should block the resize altogether.
(In reply to Elad from comment #18) > Humble, few questions: > > 1) Are we expecting users to re-create existing storage classes post OCS > upgrade? Is there a requirement for that? The upgrade procedure says nothing > about it. For expansion to work, SC should have these secrets and also 'allowVolumeExpansion' field to be set to 'true' in it. Upgrade triggered by the operator supposed to be taken care this and details are here in this https://bugzilla.redhat.com/show_bug.cgi?id=1846085. Umanga can add more details on it. > 2) Regarding: > > Let me put this way. New features are supported from that version or new deployment onwards. If we think a new feature > > will work on older versions of deployment or objects from older version, the expectation is indeed WRONG. > This is not the case of using a new feature on an unsupported OCS version. > We are talking about clusters that were upgraded to 4.5, which is the > correct OCS version in which resize should be supported. New features should > be applicable for upgraded clusters, that's part of the reason we ask > customers to upgrade to the latest release. True. in that sense new feature is available in this upgraded cluster, but with new objects. > 3) Regarding: > > The old PVs limitation is *** well known *** and discussed in various forums. This was also pointed out in Jira > Sorry for jumping late here but I think it's better late than never. Also, I > don't think customers will be aware about the fact that it was discussed and > agreed in our internal Jira. User will try to resize a PV post upgrade to > 4.5, in which this feature is supported, and encounter a failure with > endless retries. This is bad and if this is the case, we should block the > resize altogether. If its about blocking the resize on old PVs: Disabling it from the UI would be the solution. https://bugzilla.redhat.com/show_bug.cgi?id=1859165 track that.
(In reply to Humble Chirammal from comment #19) > (In reply to Elad from comment #18) > > Humble, few questions: > > > > 1) Are we expecting users to re-create existing storage classes post OCS > > upgrade? Is there a requirement for that? The upgrade procedure says nothing > > about it. > > For expansion to work, SC should have these secrets and also > 'allowVolumeExpansion' field to be set to 'true' in it. > Upgrade triggered by the operator supposed to be taken care this and details > are here in this https://bugzilla.redhat.com/show_bug.cgi?id=1846085. > Umanga can add more details on it. Just to make sure: Per the current design, the upgrade of OCS does not automatically upgrade the pools or storage classes. This has to be manually triggered by deleting the StorageClusterInitialization CR. There are many misconceptions and discussions around this in the other bug and multiple areas.
Just to summarize: * Currently, in order to enable expansion, the re-creation of the storage classes has to be triggered by deleting the StorageClusterInitialization CR. THis is being discussed elsewhere. * PVs that have been created before the upgrade that introduces expansion will not be expandable. This is known and was always known. * The problem is that if an expansion is attempted on such an old PV, the expansion fails and the system gets into an endless retry loop. There are instructions at OCP how to resolve that situation. * Blocking the request at the CSI level would not help, as kubernetes would still retry. * The only place where we could reasonably block the expand request for old PVCs is in the GUI. * We should in any case document this clearly. * I don't think we should pursue the hack of patching the old PVs. If at all, IMHO it should be done in the csi-driver and not the ocs-operator.
As per the convention, validations(if required) should be done at a place where UI and CLI gets the same response. So if we handle this at UI, and the user initiates expansion request from CLI, user gets into same situation as described in the BZ. This is against the pattern and doesn't make sense handle this from UI.
(In reply to Nishanth Thomas from comment #24) > As per the convention, validations(if required) should be done at a place > where UI and CLI gets the same response. So if we handle this at UI, and the > user initiates expansion request from CLI, user gets into same situation as > described in the BZ. This is against the pattern and doesn't make sense > handle this from UI. The problem here is that the UI is specific for this operation and we can do some stuff in it. But the cli is generic oc cli, and the only place we could put such checks to change the cli behavior is the csi driver. As Humble explained, this won't help because the generic kubernetes retry mechanism would still kick in. So my understanding is: - We can *not* fix this for the CLI. - We *can* fix this for the GUI. @Humble, is this correct?
> 2) Regarding: > > Let me put this way. New features are supported from that version or new deployment onwards. If we think a new feature > > will work on older versions of deployment or objects from older version, the expectation is indeed WRONG. > This is not the case of using a new feature on an unsupported OCS version. > We are talking about clusters that were upgraded to 4.5, which is the > correct OCS version in which resize should be supported. New features should > be applicable for upgraded clusters, that's part of the reason we ask > customers to upgrade to the latest release. This ( Backward compatibility) is really subjective. That said, all the functionalities ***can not work*** this way. One simple example would be that, suppose, we have a feature in ceph ( ex: decoupling of subvolume and snapshot) , if the subvolume is created before upgrade and has snapshots, even after upgrade of the cluster to new feature supported cluster version, the decoupling wont work on old subvolumes. The functionality is applicable to new subvolumes. There may be many examples like this. So, lets keep backward compatibility aside. Overall summary or thoughts from my side: ---------------------------------------- Whats the use case or what we are we trying to solve here? Is it the case that: 1) Do we need to support OLD PVs before upgrade to support 'expansion' ? 2) Is it the concern to 'disallow' or 'mask' such operations from admin ? 3) Or if (2) happened, how to recover? On question 1) This has to be answered by the PM or stakeholders. Till now, I dont see a requirement to support the PV expansion on old PVs. If we want to support it, few options I can think of: a) Hack/Patch all the PVs in one shot at upgrade time - Who does this ? my immediate response is *not* the CSI driver, as it dont talk to kube api server directly with a client Or SC, secret information..etc are unknown to the driver at this stage...etc. [some more concerns as later discussed in this comment] b) Or make the secret available to CSI Pods with a design for the issue ( https://github.com/ceph/ceph-csi/issues/1239), then if CSI driver does not see the secret in the request, use the secret which is known and use that for connecting to backend cluster and expand or do whatever operation it supposed to do. But this is really a big design change in CSI driver and its not specific to "EXPAND" , rather its applicable to CREATE, DELETE or all RPC calls where we want to connect to the backend cluster On question 2) If we want to mask or disallow the operation from its triggering point: *) UI is the obvious choice. iic, UI filter out SCs or display PV objects based on the provisioner by a look ( for ocs provisioner entries) and filter mechanism already , similar thing can be done for PV spec and presence of 'expansion' secrets.. too *) If admin decide to do it from CLI , well in this case, admin already checking/operating on the PVC and may be PV and he can notice that 'expansion secrets' are not there , then he should not even proceed with an "attempt" to expand!! On question 3) or a follow-up of a case where admin triggered "completely ignoring the warnings" or "without looking is it capable of doing" or ... How to recover?: My simple solution here would be "hack/patch the "specific" PV and add expansion secret and it will allow him to recover from that situation and expansion to succeed!! To summarize: here is what I propose: ------------------------------------------------------ Documentation which says OLD PVs are NOT supported. If needed a **KCS** article to come out of a situation where in case admin tried and failed which mention the single command to hack/patch that "particular" PV. Thats it. I say or prefer "particular/that failed PVC" or a "case by case", because blindly patching all the PVs at upgrade time or one short ***may not*** be a good idea at all!!! The main concern with that approach is, admin dont have a control knob on "Which PVCs are expected to be resized", otherwise we have to get an input from admin to select PVCs and do the hack/update spec.etc before we do that automatically from our component or at upgrade time. Also, if we hack all the PVs in one shot, any user is allowed to do that operation at any stage and to whatever they want with the PVCs existed in the setup...etc which may not be preferred from cluster admin POV. [But , from OCS 4.5 it will be possible for any new PVCs as we have only one SC in place]. If we dont even want to allow the old PV expansion at all: "Mask it from UI" as a first step! Again, Documentation + KCS article is preferred from my end.
Meanwhile, on recovery path: While a PVC expansion is failing, can we edit PV spec **from UI** or Do #oc edit <PV> and find the result? it should go through.
(In reply to Humble Chirammal from comment #28) > Meanwhile, on recovery path: > > While a PVC expansion is failing, can we edit PV spec **from UI** or Do #oc > edit <PV> and find the result? it should go through. Hi Humble, I tried this in my OCS 4.5 cluster (fresh installation). As discussed, I created used new SC with allowVolumeExpansion set to False to perform this test and then recreating it later to allow volume expansion. Step 1: Create new SC ocs-storagecluster-cephfs-test-non-expandable with allowVolumeExpansion set to false. Step 2. Use SC ocs-storagecluster-cephfs-test-non-expandable to create a PVC pvc-test-non-expand1 Step 3: Try to expand PVC pvc-test-non-expand1 from UI persistentvolumeclaims "ppvc-test-non-expand1" is forbidden: only dynamically provisioned pvc can be resized and the storageclass that provisions the pvc must support resize Step 4: Step 4: Delete and recreate SC ocs-storagecluster-cephfs-test-non-expandable with allowVolumeExpansion true and expansion secret. Step 5: Try to resize PVC pvc-test-non-expand1. # oc logs csi-cephfsplugin-provisioner-657697dd75-dcskj -c csi-resizer -f I0724 06:38:37.250300 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"project-test-non-expandable", Name:"pvc-test-non-expand1", UID:"13b334f8-1139-4668-a929-2ea41d9dc1e6", APIVersion:"v1", ResourceVersion:"58308", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-13b334f8-1139-4668-a929-2ea41d9dc1e6 E0724 06:38:37.250822 1 controller.go:361] Resize volume "pvc-13b334f8-1139-4668-a929-2ea41d9dc1e6" by resizer "openshift-storage.cephfs.csi.ceph.com" failed: rpc error: code = InvalidArgument desc = provided secret is empty Step 6: Edit PV pvc-13b334f8-1139-4668-a929-2ea41d9dc1e6 from UI and add these parameters under spec:csi: controllerExpandSecretRef: name: rook-csi-cephfs-provisioner namespace: openshift-storage Step 7: Monitor logs # oc logs csi-cephfsplugin-provisioner-657697dd75-dcskj -c csi-resizer -f I0724 06:52:08.367311 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"project-test-non-expandable", Name:"pvc-test-non-expand1", UID:"13b334f8-1139-4668-a929-2ea41d9dc1e6", APIVersion:"v1", ResourceVersion:"68987", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-13b334f8-1139-4668-a929-2ea41d9dc1e6 I0724 06:52:10.767150 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"project-test-non-expandable", Name:"pvc-test-non-expand1", UID:"13b334f8-1139-4668-a929-2ea41d9dc1e6", APIVersion:"v1", ResourceVersion:"68987", FieldPath:""}): type: 'Normal' reason: 'VolumeResizeSuccessful' Resize volume succeeded Step 7: Check PVC. Observation: Volume resize succeeded
Created attachment 1702321 [details] add expand secret to PV and verify resize - Test steps and result Attachment contain result and yaml output of SC, PVC and PV for each step performed as described in comment29
> external-resizer will get the secret name from the PV spec not from the storageclass or anywhere else. if the secret is present in pv spec it send the secret data to CSI when it sends controllerExpand RPC call, if the secret is not present the request wont have the creds to connect to ceph cluster. This is true but before moving the feature to beta, it did support reading secrets from storageclass but it was deemed unnecessary to keep reading from SC later on and hence the support was dropped. I think it is fair question for upstream/external-resizer too, I will discuss in upstream. But reading secrets from SC might not be an acceptable solution in upstream because there are CSI drivers that want to support per-PV secrets and those can not be expanded if they were provisioned before CSI expansion was introduced.
(In reply to Hemant Kumar from comment #33) > > external-resizer will get the secret name from the PV spec not from the storageclass or anywhere else. if the secret is present in > pv spec it send the secret data to CSI when it sends controllerExpand RPC > call, if the secret is not present the request wont have > the creds to connect to ceph cluster. > This is true but before moving the feature to beta, it did support reading > secrets from storageclass but it was deemed unnecessary to keep reading from > SC later on and hence the support was dropped. True.. it was dropped in beta, More or less the other reason or the general practice or consensus in various sidecars we are arriving is that, the SC should be a forgettable entity for an "existing" PV. That said, we should not be looking at SC for the existing PV operation, for ex: delete. Isnt it ? So, I dont think going back to SC and reading secret from it is a good idea. > I think it is fair question > for upstream/external-resizer too, I will discuss in upstream. But reading > secrets from SC might not be an acceptable solution in upstream because > there are CSI drivers that want to support per-PV secrets and those can not > be expanded if they were provisioned before CSI expansion was introduced. Yeah true. Also for the reason of untangling SC from PV lifecycle as mentioned above.
I think the proposed approach of patching PV is best idea to allow expansion of older PVs.
If we are ***planning to support Old PVs*** I see few options: 1) Can console/UI do or execute one command ( equivalant to #oc edit/patch) when someone try to click on "Expand PVC" ? The advantage with this approach is that, we dont want to do or operate on all the PVs in one shot and dont want to change any other component. Its a minimal operation for that particular PV. Nishanth/Ankush is above possible ? 2) Give a command for admin to do it manually. 3) Enable ocs operator to perform this as one time operation.
https://bugzilla.redhat.com/show_bug.cgi?id=1859165#c15
Jose, iic this will be fixed by enhancing ocs operator, Isnt it ? based on this assumption, I am moving this to ocs-operator atm. Please feel free to revert if I misinterpreted the plan.
Per conversations with QE, we will work around this issue with documentation. Moving this to OCS 4.6.
Going ahead and moving this to documentation, please feel free to move back if this is not appropriate.
@Jose, moving back to ocs-operator. I know you are working on the job to update the PVs, and if possible, we should get it into 4.6
Upstream PR: https://github.com/openshift/ocs-operator/pull/796
Jose, please add doc text
Backport PR: https://github.com/openshift/ocs-operator/pull/883 I've also provided some doctext for the BZ. I'm not sure if it's appropriate for this kind of BZ. Eran, please review.
Old PVC expansion is working in the upgraded cluster. Based on https://bugzilla.redhat.com/show_bug.cgi?id=1872119#c30 , moving the bug to verified state.
Hi @edonnell typo in preexisitng in doc text ... IMO, we need to write as "pre-existing"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605
Adding AutomationTriaged keyword. Not in scope for regular regression test because the scenario is applicable only on clusters upgraded to OCS 4.6.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days