Description of problem: Create a template with source url and then create the vm from the template, starts the vm, the vm is stucking with pending status. The dv phase is always blank. on cnv 2.5, the dv is always blank. $ oc get dv NAME PHASE PROGRESS RESTARTS AGE vm-test-adjmo-desktop-tiny-hsq3j-clone-yf8ah-rootdisk-v57k6 3m4s vm-test-adjmo-desktop-tiny-hsq3j-rootdisk-5npys Succeeded 100.0% 3m12s on cnv 2.4, the dv is in `CloneInProgress` immediately. $ oc get dv NAME PHASE PROGRESS RESTARTS AGE urlvm-rootdisk CloneInProgress N/A 0 66s urltemplate-rootdisk Succeeded 100.0% 0 78s Version-Release number of selected component (if applicable): cnv 2.5 How reproducible: 100% Steps to Reproduce: 1. Create a template from url 2. Create vm from the template 3. Start the VM Actual results: VM is stucking in pending. Expected results: VM is started. Additional info:
Created attachment 1710586 [details] vm template
Created attachment 1710587 [details] vm yaml
@Michael, could this be related to the safety checks that prevent cloning an in-use source?
Looks like a smart clone in hanging. Would not be in "in progress" if hitting the safety check. What is the storage provisioner for the PVCs? OCS? If so, smart cloning does not work and will hang forever. This is what I saw on Kevin's system.
Sorry, looks like I misread this earlier. CloneInProgress is happening in 2.4 but not in 2.5. Got it. Yes, this may in fact be related to the safety check. Check the event log for a CloneSourceInUse event. It should tell you what pods are using the source PVC.
On latest cnv 2.5, cdi does not start to work at all.
(In reply to Guohua Ouyang from comment #6) > On latest cnv 2.5, cdi does not start to work at all. It's issue https://bugzilla.redhat.com/show_bug.cgi?id=1880950
Per comment #5 can you please provide the event log for the DataVolume?
The issue is gone on latest cnv 2.5 env, close it.
See this issue on CNV 2.5 cluster again. The DV is in pending state and has no much events by describing it, Could you tell me how to "Check the event log for a CloneSourceInUse event"?
The issue can be easily reproduced with below steps: 1. create a VM template with provision source URL 2. create 3 VMs from the template(don't start it) 3. start the 3 VMs at the same time
> Could you tell me how to "Check the event log for a CloneSourceInUse event"? kubectl get events -n <target namespace> Look for CloneSourceInUse and SmartCloneSourceInUse events
There are no any events related to CloneSourceInUse and SmartCloneSourceInUse when the issue happens.
> There are no any events related to CloneSourceInUse and SmartCloneSourceInUse when the issue happens. Okay, then something else must be going on. Are the source/target PVC in the same namespace? What storage provisioner are you using? Does it support snapshots? Does the target PVC exist? If so, what's the "describe pvc" output? If using snapshots, does the volumesnapshot resource exist? What is the "describe" output? If not using snapshots, do the source/target pods exist? What is their "describe" output? Anything in the logs?
(In reply to Michael Henriksen from comment #14) > > There are no any events related to CloneSourceInUse and SmartCloneSourceInUse when the issue happens. > > Okay, then something else must be going on. > > Are the source/target PVC in the same namespace? yes > > What storage provisioner are you using? Does it support snapshots? I have tried rook-ceph and ocs-storagecluster-ceph-rbd, not sure them support snapshots or not > > Does the target PVC exist? If so, what's the "describe pvc" output? > $ oc describe pvc ghvm3-rootdisk-sqk4t Name: ghvm3-rootdisk-sqk4t Namespace: default StorageClass: rook-ceph Status: Bound Volume: pvc-0d3f80a3-863d-4f0d-b082-29f073077912 Labels: app=containerized-data-importer Annotations: cdi.kubevirt.io/storage.clone.token: eyJhbGciOiJQUzI1NiIsImtpZCI6IiJ9.eyJleHAiOjE2MDI0OTM3MzEsImlhdCI6MTYwMjQ5MzQzMSwiaXNzIjoiY2RpLWFwaXNlcnZlciIsIm5hbWUiOiJ1cmwtcm9vdGRpc2stb... cdi.kubevirt.io/storage.condition.bound: true cdi.kubevirt.io/storage.condition.bound.message: cdi.kubevirt.io/storage.condition.bound.reason: cdi.kubevirt.io/storage.condition.running: true cdi.kubevirt.io/storage.condition.running.message: cdi.kubevirt.io/storage.condition.running.reason: Pod is running cdi.kubevirt.io/storage.condition.source.running.message: cdi.kubevirt.io/storage.condition.source.running.reason: ContainerCreating cdi.kubevirt.io/storage.pod.phase: Running cdi.kubevirt.io/storage.pod.ready: true cdi.kubevirt.io/storage.pod.restarts: 0 cdi.kubevirt.io/storage.sourceClonePodName: 0d3f80a3-863d-4f0d-b082-29f073077912-source-pod cdi.kubevirt.io/storage.uploadPodName: cdi-upload-ghvm3-rootdisk-sqk4t cdi.kubevirt.io/uploadClientName: default/url-rootdisk-opfak-default/ghvm3-rootdisk-sqk4t k8s.io/CloneRequest: default/url-rootdisk-opfak pv.kubernetes.io/bind-completed: yes pv.kubernetes.io/bound-by-controller: yes volume.beta.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com Finalizers: [kubernetes.io/pvc-protection cdi.kubevirt.io/cloneSource] Capacity: 15Gi Access Modes: RWO VolumeMode: Filesystem Mounted By: cdi-upload-ghvm3-rootdisk-sqk4t Events: <none> > If using snapshots, does the volumesnapshot resource exist? What is the > "describe" output? > > If not using snapshots, do the source/target pods exist? What is their > "describe" output? Anything in the logs? Could you be able to reproduce the issue with steps in c#11? I think it does not related to StorageClass type and whether it support snapshots.
Start 3 VMs(create from template url) at the same time: 1. sometimes 2 VMs are up and 1 VM is stucking with pending state. 2. sometimes 3 VMs are stucking with pending state. I believe this issue exists since cnv 2.4.
> Could you be able to reproduce the issue with steps in c#11? I am unable to reproduce > I think it does not related to StorageClass type and whether it support snapshots. This info actually is very critical to debug the issue. And based on info you provided, it appears that snapshots are not supported. Take a look at the clone source pod. It is named <target-pvc-uid>-source-pod. What is in the log/describe output?
It has 3 pod like this: $ oc get pod | grep source-pod 0ce647fd-5061-4c50-a5cb-35fe109cc92e-source-pod 0/1 CrashLoopBackOff 79 6h22m 0d3f80a3-863d-4f0d-b082-29f073077912-source-pod 0/1 ContainerCreating 0 4h27m f8e723cd-6e54-415f-8898-39b941d47dd1-source-pod 0/1 CrashLoopBackOff 79 6h22m $ oc logs 0d3f80a3-863d-4f0d-b082-29f073077912-source-pod Error from server (BadRequest): container "cdi-clone-source" in pod "0d3f80a3-863d-4f0d-b082-29f073077912-source-pod" is waiting to start: ContainerCreating $ oc logs 0ce647fd-5061-4c50-a5cb-35fe109cc92e-source-pod VOLUME_MODE=filesystem MOUNT_POINT=/var/run/cdi/clone/source /var/run/cdi/clone/source / UPLOAD_BYTES=16045219862 ./ ./disk.img I1012 13:28:31.297524 11 clone-source.go:108] content_type is "filesystem-clone" I1012 13:28:31.297627 11 clone-source.go:109] upload_bytes is 16045219862 I1012 13:28:31.297648 11 clone-source.go:119] Starting cloner target I1012 13:28:31.864243 11 clone-source.go:131] Set header to filesystem-clone F1012 13:28:31.882779 11 clone-source.go:136] Error Post https://cdi-upload-gh-vm3-rootdisk-jojn2.default.svc/v1beta1/upload: dial tcp: lookup cdi-upload-gh-vm3-rootdisk-jojn2.default.svc on 172.30.0.10:53: no such host POSTing to https://cdi-upload-gh-vm3-rootdisk-jojn2.default.svc/v1beta1/upload $ oc logs f8e723cd-6e54-415f-8898-39b941d47dd1-source-pod VOLUME_MODE=filesystem MOUNT_POINT=/var/run/cdi/clone/source /var/run/cdi/clone/source / UPLOAD_BYTES=16045219862 I1012 13:28:16.377299 11 clone-source.go:108] content_type is "filesystem-clone" ./I1012 13:28:16.377379 11 clone-source.go:109] upload_bytes is 16045219862 I1012 13:28:16.377398 11 clone-source.go:119] Starting cloner target ./disk.img I1012 13:28:16.894508 11 clone-source.go:131] Set header to filesystem-clone F1012 13:28:16.917515 11 clone-source.go:136] Error Post https://cdi-upload-gh-vm2-rootdisk-xc5ys.default.svc/v1beta1/upload: dial tcp: lookup cdi-upload-gh-vm2-rootdisk-xc5ys.default.svc on 172.30.0.10:53: no such host POSTing to https://cdi-upload-gh-vm2-rootdisk-xc5ys.default.svc/v1beta1/upload
> F1012 13:28:16.917515 11 clone-source.go:136] Error Post https://cdi-upload-gh-vm2-rootdisk-xc5ys.default.svc/v1beta1/upload: dial tcp: lookup cdi-upload-gh-vm2-rootdisk-xc5ys.default.svc on 172.30.0.10:53: no such host POSTing to https://cdi-upload-gh-vm2-rootdisk-xc5ys.default.svc/v1beta1/upload Looks like clone source pod cannot communicate with the target. Verify that the target pod is running and ready. Verify that the service cdi-upload-gh-vm2-rootdisk-xc5ys exists. If all that looks good, you're going to have to dig into the cluster DNS to figure out why the service (cdi-upload-gh-vm2-rootdisk-xc5ys.default.svc) is not resolving.
After getting access to the system, the problem seemed to have been caused by two clone source pods stuck in CrashLoopBackoff. Not sure how they got there. The target PVCs were deleted so the clone source pods should have been as well. But that appears to be the bug. After those pods were deleted, I was able to successfully create/start 3 VMs from a template with URL source.
Based on comment 20 this is not severe enough to block the 2.5 release. Pushing to 2.6. We need to attempt to reproduce this in order to be able to identify a root cause and produce a fix.
Oops, reverting previous because the request is for a fix in 2.5.0.
Sorry for the noise. I see that it did get pushed to 2.6.0.
Could not see this issue on latest CNV 2.5 environment, feel free to close it as fixed.
verified on CNV: 2.5, CDI: Containerized Data Importer v1.23.7 ============================================================ - Create VM template in UI - Create 3 VM from template - $ oc get dv NAME PHASE PROGRESS RESTARTS AGE source-dv Succeeded 100.0% 4h16m vm1-disk-0-kptig Succeeded N/A 43s vm2-disk-0-blmp8 Succeeded N/A 21s vm3-disk-0-gzx0z Succeeded N/A 7s $ virtctl start vm1 VM vm1 was scheduled to start $ virtctl start vm2 VM vm2 was scheduled to start $ virtctl start vm3 VM vm3 was scheduled to start $ oc get vmi NAME AGE PHASE IP NODENAME vm1 35s Running 10.128.2.55 dafrank25-fxgg5-worker-0-769g9 vm2 32s Running 10.128.2.54 dafrank25-fxgg5-worker-0-769g9 vm3 30s Running 10.131.0.43 dafrank25-fxgg5-worker-0-fb9t4 ============================================================ All 3 DVs are cloned successfully and the VMs are running.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 2.6.0 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:0799