1866593 – CDI is not handling vm disk clone

Bug 1866593 - CDI is not handling vm disk clone

Summary: CDI is not handling vm disk clone

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	2.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	2.6.0
Assignee:	Adam Litke
QA Contact:	dalia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-06 02:02 UTC by Guohua Ouyang
Modified:	2021-03-10 11:19 UTC (History)
CC List:	6 users (show)
Fixed In Version:	virt-cdi-cloner 2.6.0-7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-10 11:18:00 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
vm template (3.20 KB, application/octet-stream) 2020-08-06 02:51 UTC, Guohua Ouyang	no flags	Details
vm yaml (4.96 KB, application/octet-stream) 2020-08-06 02:51 UTC, Guohua Ouyang	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	kubevirt containerized-data-importer pull 1429	0	None	closed	add finalizer to target PVC before creating clone source pod	2021-01-26 00:45:29 UTC
Red Hat Product Errata	RHSA-2021:0799	0	None	None	None	2021-03-10 11:19:14 UTC

Description Guohua Ouyang 2020-08-06 02:02:29 UTC

Description of problem:
Create a template with source url and then create the vm from the template, starts the vm, the vm is stucking with pending status. The dv phase is always blank.

on cnv 2.5, the dv is always blank.
$ oc get dv
NAME                                                          PHASE       PROGRESS   RESTARTS   AGE
vm-test-adjmo-desktop-tiny-hsq3j-clone-yf8ah-rootdisk-v57k6                                     3m4s
vm-test-adjmo-desktop-tiny-hsq3j-rootdisk-5npys               Succeeded   100.0%                3m12s

on cnv 2.4, the dv is in `CloneInProgress` immediately.
$ oc get dv
NAME           PHASE             PROGRESS   RESTARTS   AGE
urlvm-rootdisk   CloneInProgress   N/A        0          66s
urltemplate-rootdisk   Succeeded         100.0%     0          78s


Version-Release number of selected component (if applicable):
cnv 2.5

How reproducible:
100%

Steps to Reproduce:
1. Create a template from url
2. Create vm from the template
3. Start the VM

Actual results:
VM is stucking in pending.

Expected results:
VM is started.

Additional info:

Comment 1 Guohua Ouyang 2020-08-06 02:51:21 UTC

Created attachment 1710586 [details]
vm template

Comment 2 Guohua Ouyang 2020-08-06 02:51:50 UTC

Created attachment 1710587 [details]
vm yaml

Comment 3 Adam Litke 2020-08-19 18:11:02 UTC

@Michael, could this be related to the safety checks that prevent cloning an in-use source?

Comment 4 Michael Henriksen 2020-08-19 18:26:41 UTC

Looks like a smart clone in hanging.  Would not be in "in progress" if hitting the safety check.  

What is the storage provisioner for the PVCs?  OCS?  If so, smart cloning does not work and will hang forever.  This is what I saw on Kevin's system.

Comment 5 Michael Henriksen 2020-08-20 00:39:23 UTC

Sorry, looks like I misread this earlier.  CloneInProgress is happening in 2.4 but not in 2.5.  Got it.

Yes, this may in fact be related to the safety check.  Check the event log for a CloneSourceInUse event.  It should tell you what pods are using the source PVC.

Comment 6 Guohua Ouyang 2020-09-23 01:53:33 UTC

On latest cnv 2.5, cdi does not start to work at all.

Comment 7 Guohua Ouyang 2020-09-23 02:09:05 UTC

(In reply to Guohua Ouyang from comment #6)
> On latest cnv 2.5, cdi does not start to work at all.

It's issue https://bugzilla.redhat.com/show_bug.cgi?id=1880950

Comment 8 Adam Litke 2020-09-28 14:56:34 UTC

Per comment #5 can you please provide the event log for the DataVolume?

Comment 9 Guohua Ouyang 2020-09-29 00:38:49 UTC

The issue is gone on latest cnv 2.5 env, close it.

Comment 10 Guohua Ouyang 2020-10-12 02:16:31 UTC

See this issue on CNV 2.5 cluster again.
The DV is in pending state and has no much events by describing it, Could you tell me how to "Check the event log for a CloneSourceInUse event"?

Comment 11 Guohua Ouyang 2020-10-12 09:12:20 UTC

The issue can be easily reproduced with below steps:
1. create a VM template with provision source URL
2. create 3 VMs from the template(don't start it)
3. start the 3 VMs at the same time

Comment 12 Michael Henriksen 2020-10-12 11:41:02 UTC

> Could you tell me how to "Check the event log for a CloneSourceInUse event"?

kubectl get events -n <target namespace>

Look for CloneSourceInUse and SmartCloneSourceInUse events

Comment 13 Guohua Ouyang 2020-10-12 11:54:00 UTC

There are no any events related to CloneSourceInUse and SmartCloneSourceInUse when the issue happens.

Comment 14 Michael Henriksen 2020-10-12 12:09:58 UTC

> There are no any events related to CloneSourceInUse and SmartCloneSourceInUse when the issue happens.

Okay, then something else must be going on.

Are the source/target PVC in the same namespace?

What storage provisioner are you using?  Does it support snapshots?

Does the target PVC exist?  If so, what's the "describe pvc" output?

If using snapshots, does the volumesnapshot resource exist?  What is the "describe" output?

If not using snapshots, do the source/target pods exist?  What is their "describe" output?  Anything in the logs?

Comment 15 Guohua Ouyang 2020-10-12 12:24:06 UTC

(In reply to Michael Henriksen from comment #14)
> > There are no any events related to CloneSourceInUse and SmartCloneSourceInUse when the issue happens.
> 
> Okay, then something else must be going on.
> 
> Are the source/target PVC in the same namespace?
yes

> 
> What storage provisioner are you using?  Does it support snapshots?
I have tried rook-ceph and ocs-storagecluster-ceph-rbd, not sure them support snapshots or not

> 
> Does the target PVC exist?  If so, what's the "describe pvc" output?
> 

$ oc describe pvc ghvm3-rootdisk-sqk4t
Name:          ghvm3-rootdisk-sqk4t
Namespace:     default
StorageClass:  rook-ceph
Status:        Bound
Volume:        pvc-0d3f80a3-863d-4f0d-b082-29f073077912
Labels:        app=containerized-data-importer
Annotations:   cdi.kubevirt.io/storage.clone.token:
                 eyJhbGciOiJQUzI1NiIsImtpZCI6IiJ9.eyJleHAiOjE2MDI0OTM3MzEsImlhdCI6MTYwMjQ5MzQzMSwiaXNzIjoiY2RpLWFwaXNlcnZlciIsIm5hbWUiOiJ1cmwtcm9vdGRpc2stb...
               cdi.kubevirt.io/storage.condition.bound: true
               cdi.kubevirt.io/storage.condition.bound.message: 
               cdi.kubevirt.io/storage.condition.bound.reason: 
               cdi.kubevirt.io/storage.condition.running: true
               cdi.kubevirt.io/storage.condition.running.message: 
               cdi.kubevirt.io/storage.condition.running.reason: Pod is running
               cdi.kubevirt.io/storage.condition.source.running.message: 
               cdi.kubevirt.io/storage.condition.source.running.reason: ContainerCreating
               cdi.kubevirt.io/storage.pod.phase: Running
               cdi.kubevirt.io/storage.pod.ready: true
               cdi.kubevirt.io/storage.pod.restarts: 0
               cdi.kubevirt.io/storage.sourceClonePodName: 0d3f80a3-863d-4f0d-b082-29f073077912-source-pod
               cdi.kubevirt.io/storage.uploadPodName: cdi-upload-ghvm3-rootdisk-sqk4t
               cdi.kubevirt.io/uploadClientName: default/url-rootdisk-opfak-default/ghvm3-rootdisk-sqk4t
               k8s.io/CloneRequest: default/url-rootdisk-opfak
               pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com
Finalizers:    [kubernetes.io/pvc-protection cdi.kubevirt.io/cloneSource]
Capacity:      15Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Mounted By:    cdi-upload-ghvm3-rootdisk-sqk4t
Events:        <none>

> If using snapshots, does the volumesnapshot resource exist?  What is the
> "describe" output?
> 
> If not using snapshots, do the source/target pods exist?  What is their
> "describe" output?  Anything in the logs?

Could you be able to reproduce the issue with steps in c#11? I think it does not related to StorageClass type and whether it support snapshots.

Comment 16 Guohua Ouyang 2020-10-12 12:28:04 UTC

Start 3 VMs(create from template url) at the same time:
1. sometimes 2 VMs are up and 1 VM is stucking with pending state. 
2. sometimes 3 VMs are stucking with pending state.

I believe this issue exists since cnv 2.4.

Comment 17 Michael Henriksen 2020-10-12 13:21:47 UTC

> Could you be able to reproduce the issue with steps in c#11?

I am unable to reproduce

> I think it does not related to StorageClass type and whether it support snapshots.

This info actually is very critical to debug the issue.  And based on info you provided, it appears that snapshots are not supported.

Take a look at the clone source pod.  It is named <target-pvc-uid>-source-pod.  What is in the log/describe output?

Comment 18 Guohua Ouyang 2020-10-12 13:32:06 UTC

It has 3 pod like this:
$ oc get pod | grep source-pod
0ce647fd-5061-4c50-a5cb-35fe109cc92e-source-pod                   0/1     CrashLoopBackOff    79         6h22m
0d3f80a3-863d-4f0d-b082-29f073077912-source-pod                   0/1     ContainerCreating   0          4h27m
f8e723cd-6e54-415f-8898-39b941d47dd1-source-pod                   0/1     CrashLoopBackOff    79         6h22m


$ oc logs 0d3f80a3-863d-4f0d-b082-29f073077912-source-pod
Error from server (BadRequest): container "cdi-clone-source" in pod "0d3f80a3-863d-4f0d-b082-29f073077912-source-pod" is waiting to start: ContainerCreating

$ oc logs 0ce647fd-5061-4c50-a5cb-35fe109cc92e-source-pod
VOLUME_MODE=filesystem
MOUNT_POINT=/var/run/cdi/clone/source
/var/run/cdi/clone/source /
UPLOAD_BYTES=16045219862
./
./disk.img
I1012 13:28:31.297524      11 clone-source.go:108] content_type is "filesystem-clone"
I1012 13:28:31.297627      11 clone-source.go:109] upload_bytes is 16045219862
I1012 13:28:31.297648      11 clone-source.go:119] Starting cloner target
I1012 13:28:31.864243      11 clone-source.go:131] Set header to filesystem-clone
F1012 13:28:31.882779      11 clone-source.go:136] Error Post https://cdi-upload-gh-vm3-rootdisk-jojn2.default.svc/v1beta1/upload: dial tcp: lookup cdi-upload-gh-vm3-rootdisk-jojn2.default.svc on 172.30.0.10:53: no such host POSTing to https://cdi-upload-gh-vm3-rootdisk-jojn2.default.svc/v1beta1/upload

$ oc logs f8e723cd-6e54-415f-8898-39b941d47dd1-source-pod
VOLUME_MODE=filesystem
MOUNT_POINT=/var/run/cdi/clone/source
/var/run/cdi/clone/source /
UPLOAD_BYTES=16045219862
I1012 13:28:16.377299      11 clone-source.go:108] content_type is "filesystem-clone"
./I1012 13:28:16.377379      11 clone-source.go:109] upload_bytes is 16045219862

I1012 13:28:16.377398      11 clone-source.go:119] Starting cloner target
./disk.img
I1012 13:28:16.894508      11 clone-source.go:131] Set header to filesystem-clone
F1012 13:28:16.917515      11 clone-source.go:136] Error Post https://cdi-upload-gh-vm2-rootdisk-xc5ys.default.svc/v1beta1/upload: dial tcp: lookup cdi-upload-gh-vm2-rootdisk-xc5ys.default.svc on 172.30.0.10:53: no such host POSTing to https://cdi-upload-gh-vm2-rootdisk-xc5ys.default.svc/v1beta1/upload

Comment 19 Michael Henriksen 2020-10-12 13:53:42 UTC

> F1012 13:28:16.917515      11 clone-source.go:136] Error Post https://cdi-upload-gh-vm2-rootdisk-xc5ys.default.svc/v1beta1/upload: dial tcp: lookup cdi-upload-gh-vm2-rootdisk-xc5ys.default.svc on 172.30.0.10:53: no such host POSTing to https://cdi-upload-gh-vm2-rootdisk-xc5ys.default.svc/v1beta1/upload

Looks like clone source pod cannot communicate with the target.  Verify that the target pod is running and ready.  Verify that the service cdi-upload-gh-vm2-rootdisk-xc5ys exists.  If all that looks good, you're going to have to dig into the cluster DNS to figure out why the service (cdi-upload-gh-vm2-rootdisk-xc5ys.default.svc) is not resolving.

Comment 20 Michael Henriksen 2020-10-12 16:38:21 UTC

After getting access to the system, the problem seemed to have been caused by two clone source pods stuck in CrashLoopBackoff.  Not sure how they got there.  The target PVCs were deleted so the clone source pods should have been as well.  But that appears to be the bug.  After those pods were deleted, I was able to successfully create/start 3 VMs from a template with URL source.

Comment 21 Adam Litke 2020-10-12 19:02:46 UTC

Based on comment 20 this is not severe enough to block the 2.5 release.  Pushing to 2.6.  We need to attempt to reproduce this in order to be able to identify a root cause and produce a fix.

Comment 22 Maya Rashish 2020-11-17 13:40:46 UTC

Oops, reverting previous because the request is for a fix in 2.5.0.

Comment 23 Maya Rashish 2020-11-17 13:44:33 UTC

Sorry for the noise. I see that it did get pushed to 2.6.0.

Comment 24 Guohua Ouyang 2020-11-19 02:50:07 UTC

Could not see this issue on latest CNV 2.5 environment, feel free to close it as fixed.

Comment 25 dalia 2020-11-22 09:51:31 UTC

verified on CNV: 2.5, CDI: Containerized Data Importer v1.23.7

============================================================

- Create VM template in UI

- Create 3 VM from template

- $ oc get dv
NAME               PHASE       PROGRESS   RESTARTS   AGE
source-dv          Succeeded   100.0%                4h16m
vm1-disk-0-kptig   Succeeded   N/A                   43s
vm2-disk-0-blmp8   Succeeded   N/A                   21s
vm3-disk-0-gzx0z   Succeeded   N/A                   7s

$ virtctl start vm1
VM vm1 was scheduled to start
$ virtctl start vm2
VM vm2 was scheduled to start
$ virtctl start vm3
VM vm3 was scheduled to start

$ oc get vmi
NAME   AGE   PHASE     IP            NODENAME
vm1    35s   Running   10.128.2.55   dafrank25-fxgg5-worker-0-769g9
vm2    32s   Running   10.128.2.54   dafrank25-fxgg5-worker-0-769g9
vm3    30s   Running   10.131.0.43   dafrank25-fxgg5-worker-0-fb9t4

============================================================

All 3 DVs are cloned successfully and the VMs are running.

Comment 28 errata-xmlrpc 2021-03-10 11:18:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 2.6.0 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0799

Note You need to log in before you can comment on or make changes to this bug.