Description of problem: Clone could not be continued after virtctl stop the vm if the clone dv have been created for more than 3 minutes Version-Release number of selected component (if applicable): OCP4.5, CNV2.4 kubernetes v1.18.3+6025c28 virt-cdi-operator-container-v2.4.0-26 How reproducible: Always Steps to Reproduce: 1. Create a new project test1 and create source dv --- apiVersion: cdi.kubevirt.io/v1alpha1 kind: DataVolume metadata: name: dv-source spec: source: http: url: http://$URL/files/cnv-tests/cirros-images/cirros-0.4.0-x86_64-disk.qcow2 pvc: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi storageClassName: hostpath-provisioner volumeMode: Filesystem contentType: kubevirt 2. Create vm and make the vm running --- apiVersion: kubevirt.io/v1alpha3 kind: VirtualMachine metadata: name: cirros-vmi-src spec: template: spec: domain: resources: requests: memory: 1024M devices: rng: {} disks: - disk: bus: virtio name: dv-disk volumes: - name: dv-disk dataVolume: name: dv-source metadata: labels: kubevirt.io/vm: cirros-vm-src kubevirt.io/domain: cirros-vm-src running: true 3. Clone the source dv --- apiVersion: cdi.kubevirt.io/v1alpha1 kind: DataVolume metadata: name: dv-target2 spec: source: pvc: name: dv-source namespace: test1 pvc: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi storageClassName: hostpath-provisioner volumeMode: Filesystem contentType: kubevirt 4. wait till the clone dv running for more than 3 minutes - the clone will not proceed due to the pvc is in used by the vm $ oc get pod NAME READY STATUS RESTARTS AGE cdi-upload-dv-target2 1/1 Running 0 3m3s virt-launcher-cirros-vmi-src-ztk5f 1/1 Running 0 4m13s 5. Stop the vm $ virtctl stop cirros-vmi-src VM cirros-vmi-src was scheduled to stop 6. Monitor whether the clone will be continue $ oc get pod NAME READY STATUS RESTARTS AGE cdi-upload-dv-target2 1/1 Running 0 3m18s virt-launcher-cirros-vmi-src-ztk5f 0/1 Terminating 0 4m28s $ oc get vmi No resources found. $ oc get pod -n test1 NAME READY STATUS RESTARTS AGE cdi-upload-dv-target2 1/1 Running 0 46m Actual results: If clone dv running less then 3 minutes, stop the vm could make clone continue. If wait clone dv running more than 3 minutes, stop vm could not make clone continue. Expected results: It's better to have a periodic inspection for the clone that once the pvc is not in used, the clone should be continue, even the dv have been running for more than 3 minutes. Additional info:
Sounds more like a bug and not an RFE.
The clone will eventually continue. The question is whether periodic checking (every x seconds) is better than exponential backoff (the current strategy and most common way to retry in k8s).
Yan, Can you please try to run this scenario again and see if and when it resumes?
If the clone has been running for more than 3 minutes, then stop the vm can not make the clone continue, I already wait for more than 30 minutes, and I didn't see the clone process continue. $ oc get pod NAME READY STATUS RESTARTS AGE cdi-upload-dv-target2 1/1 Running 0 3m30s virt-launcher-cirros-vmi-src-xvj6l 1/1 Running 0 5m41s $ virtctl stop cirros-vmi-src VM cirros-vmi-src was scheduled to stop $ oc get vmi No resources found in default namespace. $ oc get pod NAME READY STATUS RESTARTS AGE cdi-upload-dv-target2 1/1 Running 0 42m $ oc logs cdi-upload-dv-target2 I0831 13:20:10.331516 1 uploadserver.go:63] Upload destination: /data/disk.img I0831 13:20:10.331783 1 uploadserver.go:65] Running server on 0.0.0.0:8443
Looks like this may be an issue with the clone token expiry. It expires after 5 minutes. So if the source pvc is in use during that time, the clone will never complete. To address, we should somehow make the user better aware of the fact the token has expired. And maybe make the expiry longer.
We missed the blockers only cutoff and this should not block 2.5. Pushing out.
Michael, we missed an opportunity to fix this for 2.6? What is the technical plan to resolve this?
Shouldn't be an issue when we have a namespace transfer API. The NamespaceTransfer resource can be created up front and early in the token lifecycle.
What about clones occurring within the same namespace?
> What about clones occurring within the same namespace? The token can be handled differently when within namespace OR Namespace transfer within the same namespace is supported and is basically a "rename" operation. I think it would be wise to have all CDI operations to do this eventually.
Michael, I have the following PR for the known issue associated with this bug: https://github.com/openshift/openshift-docs/pull/29662 Can you please review it? It will also need QE review from Yan Du before I can merge it. Thank you both!
Looks good!
LGTM
Thank you, Michael and Yan! The known issue PR has been merged.
Michael, since we are not doing all clones via namespace transfer it seems we still have this issue to resolve. What shall we do about it?
Good question! This is no longer an issue for immediate binding + smart clone PVCs. Could easily be updated to support ALL immediate binding PVCs by using namespace transfer. We should do that. But the real issue is what to do about WaitForFirstConsumer PVCs. There is no solution I can think of that doesn't involve a very long term of forever token. And I am hesitant to do that for security reasons. When volume populators go beta, there will be proper WaitForFirstConsumer handling. And everything we do for WaitForFirstConsumer now should be revisited then. Give all that, do we want to implement long term/forever tokens now or wait for populators? Regardless of above, we can add support all immediate binding PVCs now.
I think we should support all immediate binding PVCs now and not address WFFC PVCs until the populators integration stage. This is medium severity because the flow is not common (cloning a DV attached to a running VM). Given this, I'll defer this bug to 4.9 where we can fix it for all immediate binding PVCs.
This missed 4.9.0 and I think it is too involved to qualify for a 4.9.z fix. Pushing to 4.10.0
Test on latest CNV 4.10 cluster, issue has been fixed.
@mgarrell for the v4.10 release notes, remove BZ1855185 from the known issues. This issue is now fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0947