Description of problem: Always met timeout or EOF when uploading an qcow2 image Version-Release number of selected component (if applicable): cnv2.1 How reproducible: Always Steps to Reproduce: 1. Upload an image by virtctl image-upload 2. 3. Actual results: $ virtctl image-upload --pvc-name=winpv --pvc-size=10Gi --image-path=/mnt/cnv-netapp/WindowsImages/qe-images/WINDOWS_2019_X64/disk.qcow2 --insecure --block-volume PVC default/winpv created Waiting for PVC winpv upload pod to be running... Pod now running Uploading data to https://cdi-uploadproxy-openshift-cnv.apps.working.oc4 5.73 GiB / 5.73 GiB [=========================================================================================================================================================] 100.00% 3m28s Post https://cdi-uploadproxy-openshift-cnv.apps.working.oc4/v1alpha1/upload: EOF or $ virtctl image-upload --pvc-name=winpv1 --pvc-size=10Gi --image-path=/mnt/cnv-netapp/WindowsImages/qe-images/WINDOWS_2019_X64/disk.qcow2 --insecure --block-volume PVC default/winpv1 created timed out waiting for the condition Expected results: upload succeed Additional info:
Could this be a timeout during conversion to raw?
Can you provide the uploadserver logs?
The EOF happened during validating the image: $ oc logs cdi-upload-winpv --follow I0926 02:27:57.144845 1 uploadserver.go:61] Upload destination: /dev/cdi-block-volume I0926 02:27:57.145301 1 uploadserver.go:63] Running server on 0.0.0.0:8443 2019/09/26 02:28:06 http: TLS handshake error from 10.129.1.172:43478: EOF I0926 02:28:06.449258 1 uploadserver.go:244] Content type header is "" I0926 02:28:06.449278 1 data-processor.go:252] Calculating available size I0926 02:28:06.471530 1 data-processor.go:256] Checking out block volume size. I0926 02:28:06.473668 1 data-processor.go:264] Request image size not empty. I0926 02:28:06.473690 1 data-processor.go:269] Target size 10Gi. I0926 02:28:06.473873 1 data-processor.go:182] New phase: TransferScratch I0926 02:28:06.476689 1 util.go:143] Writing data... I0926 02:30:31.957120 1 data-processor.go:182] New phase: Process I0926 02:30:31.957174 1 data-processor.go:182] New phase: Convert I0926 02:30:31.957179 1 data-processor.go:188] Validating image error: unexpected EOF $ oc describe pvc Name: winpv Namespace: default StorageClass: rook-ceph-block Status: Bound Volume: pvc-27b5186e-e005-11e9-b28e-fa163e6997df Labels: <none> Annotations: cdi.kubevirt.io/storage.pod.phase: Succeeded cdi.kubevirt.io/storage.pod.ready: false cdi.kubevirt.io/storage.upload.target: pv.kubernetes.io/bind-completed: yes pv.kubernetes.io/bound-by-controller: yes volume.beta.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com Finalizers: [kubernetes.io/pvc-protection] Capacity: 10Gi Access Modes: RWO VolumeMode: Block Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Provisioning 6m51s rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-fd98c466-28944_29df07b3-dfc2-11e9-9fe5-0a580a81000e External provisioner is provisioning volume for claim "default/winpv" Normal ExternalProvisioning 6m48s (x2 over 6m50s) persistentvolume-controller waiting for a volume to be created, either by external provisioner "rook-ceph.rbd.csi.ceph.com" or manually created by system administrator Mounted By: <none>
The above qcow2 image info: $ qemu-img info disk.qcow2 image: disk.qcow2 file format: qcow2 virtual size: 6.6G (7140155392 bytes) disk size: 5.7G cluster_size: 65536 Format specific information: compat: 1.1 lazy refcounts: false If I convert the same qcow2 image to raw by qemu-img, then upload succeed. $ qemu-img info disk_raw.img image: disk_raw.img file format: raw virtual size: 6.6G (7140155392 bytes) disk size: 5.7G $ virtctl image-upload --pvc-name=winpv --pvc-size=10Gi --image-path=/mnt/cnv-netapp/WindowsImages/qe-images/WINDOWS_2019_X64/disk_raw.img --insecure --block-volume PVC default/winpv created Waiting for PVC winpv upload pod to be running... Pod now running Uploading data to https://cdi-uploadproxy-openshift-cnv.apps.working.oc4 6.65 GiB / 6.65 GiB [=======================================================================================================================================================================] 100.00% 1m47s Uploading /mnt/cnv-netapp/WindowsImages/qe-images/WINDOWS_2019_X64/disk_raw.img completed successfully $ oc logs cdi-upload-winpv --follow I0926 05:50:53.540103 1 uploadserver.go:61] Upload destination: /dev/cdi-block-volume I0926 05:50:53.540435 1 uploadserver.go:63] Running server on 0.0.0.0:8443 2019/09/26 05:51:02 http: TLS handshake error from 10.129.3.227:44558: EOF I0926 05:51:02.534025 1 uploadserver.go:244] Content type header is "" I0926 05:51:02.534043 1 data-processor.go:252] Calculating available size I0926 05:51:02.556587 1 data-processor.go:256] Checking out block volume size. I0926 05:51:02.559810 1 data-processor.go:264] Request image size not empty. I0926 05:51:02.559846 1 data-processor.go:269] Target size 10Gi. I0926 05:51:02.560289 1 data-processor.go:182] New phase: TransferDataFile I0926 05:51:02.564028 1 util.go:143] Writing data... I0926 05:52:50.305765 1 data-processor.go:182] New phase: Resize I0926 05:52:50.424873 1 data-processor.go:182] New phase: Complete I0926 05:52:50.424953 1 uploadserver.go:263] Wrote data to /dev/cdi-block-volume I0926 05:52:50.425208 1 uploadserver.go:151] Shutting down http server after successful upload I0926 05:52:50.425399 1 uploadserver.go:71] UploadServer successfully exited
I also got it when I used curl with token: $ cat dv-upload.yaml apiVersion: cdi.kubevirt.io/v1alpha1 kind: DataVolume metadata: name: upload-datavolume spec: source: upload: {} pvc: volumeMode: Block accessModes: - ReadWriteOnce resources: requests: storage: 500Mi $ cat token.yaml apiVersion: upload.cdi.kubevirt.io/v1alpha1 kind: UploadTokenRequest metadata: name: upload-datavolume-token namespace: src spec: pvcName: upload-datavolume 1. oc create -f dv-upload.yaml 2. TOKEN=$(kubectl apply -f token.yaml -o="jsonpath={.status.token}") 3. curl -v --insecure -H "Authorization: Bearer $TOKEN" --data-binary @cirros-0.4.0-x86_64-disk.qcow2 https://cdi-uploadproxy-openshift-cnv.apps.working.oc4/v1alpha1/upload $ oc get pods NAME READY STATUS RESTARTS AGE cdi-upload-upload-datavolume 0/1 Pending 0 11m $ oc logs -f cdi-uploadproxy-5f7c458c79-jlmkh -n openshift-cnv I0923 11:28:13.801613 1 uploadproxy.go:46] Note: increase the -v level in the api deployment for more detailed logging, eg. -v=2 or -v=3 W0923 11:28:13.802283 1 client_config.go:549] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. 2019/09/26 03:01:25 http: TLS handshake error from 10.128.0.1:56988: EOF I0926 12:32:12.003289 1 uploadproxy.go:179] Received valid token: pvc: upload-datavolume, namespace: src E0926 12:32:22.114270 1 uploadproxy.go:211] Error connecting to https://cdi-upload-upload-datavolume.src.svc/v1alpha1/upload: timed out 5 times connecting to https://cdi-upload-upload-datavolume.src.svc/v1alpha1/upload kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).testConnect /go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:274 kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).proxyUploadRequest /go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:210 kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest /go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:188 kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest-fm /go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:140 net/http.HandlerFunc.ServeHTTP /usr/lib/golang/src/net/http/server.go:1964 net/http.(*ServeMux).ServeHTTP /usr/lib/golang/src/net/http/server.go:2361 kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).ServeHTTP /go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:144 net/http.serverHandler.ServeHTTP /usr/lib/golang/src/net/http/server.go:2741 net/http.(*conn).serve /usr/lib/golang/src/net/http/server.go:1847 runtime.goexit /usr/lib/golang/src/runtime/asm_amd64.s:1333 I0926 12:33:54.625050 1 uploadproxy.go:179] Received valid token: pvc: upload-datavolume-file, namespace: src E0926 12:34:03.695972 1 uploadproxy.go:211] Error connecting to https://cdi-upload-upload-datavolume-file.src.svc/v1alpha1/upload: timed out 5 times connecting to https://cdi-upload-upload-datavolume-file.src.svc/v1alpha1/upload kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).testConnect /go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:274 kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).proxyUploadRequest /go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:210 kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest /go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:188 kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest-fm /go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:140 net/http.HandlerFunc.ServeHTTP /usr/lib/golang/src/net/http/server.go:1964 net/http.(*ServeMux).ServeHTTP /usr/lib/golang/src/net/http/server.go:2361 kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).ServeHTTP /go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:144 net/http.serverHandler.ServeHTTP /usr/lib/golang/src/net/http/server.go:2741 net/http.(*conn).serve /usr/lib/golang/src/net/http/server.go:1847 runtime.goexit /usr/lib/golang/src/runtime/asm_amd64.s:1333 I0926 13:09:49.192674 1 uploadproxy.go:179] Received valid token: pvc: upload-datavolume-file, namespace: src E0926 13:09:58.248112 1 uploadproxy.go:211] Error connecting to https://cdi-upload-upload-datavolume-file.src.svc/v1alpha1/upload: timed out 5 times connecting to https://cdi-upload-upload-datavolume-file.src.svc/v1alpha1/upload kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).testConnect /go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:274 kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).proxyUploadRequest /go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:210 kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest /go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:188 kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest-fm /go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:140 net/http.HandlerFunc.ServeHTTP /usr/lib/golang/src/net/http/server.go:1964 net/http.(*ServeMux).ServeHTTP /usr/lib/golang/src/net/http/server.go:2361 kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).ServeHTTP /go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:144 net/http.serverHandler.ServeHTTP /usr/lib/golang/src/net/http/server.go:2741 net/http.(*conn).serve /usr/lib/golang/src/net/http/server.go:1847 runtime.goexit /usr/lib/golang/src/runtime/asm_amd64.s:1333 I0926 13:11:57.457918 1 uploadproxy.go:179] Received valid token: pvc: upload-datavolume, namespace: src E0926 13:12:06.504110 1 uploadproxy.go:211] Error connecting to https://cdi-upload-upload-datavolume.src.svc/v1alpha1/upload: timed out 5 times connecting to https://cdi-upload-upload-datavolume.src.svc/v1alpha1/upload kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).testConnect /go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:274 kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).proxyUploadRequest /go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:210 kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest /go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:188 kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest-fm /go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:140 net/http.HandlerFunc.ServeHTTP /usr/lib/golang/src/net/http/server.go:1964 net/http.(*ServeMux).ServeHTTP /usr/lib/golang/src/net/http/server.go:2361 kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).ServeHTTP /go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:144 net/http.serverHandler.ServeHTTP /usr/lib/golang/src/net/http/server.go:2741 net/http.(*conn).serve /usr/lib/golang/src/net/http/server.go:1847 runtime.goexit /usr/lib/golang/src/runtime/asm_amd64.s:1333 error: unexpected EOF
Try to reproduce it with a BM
1. I wasn't able to get a real BM, so I got a libvirt env, the upload there failed with: "timed out waiting for the condition", upload pod was in ContainerCreating (I suspect rook isn't stable on this env, will try to investigate). 2. I tried testing again on PSI, and when using small qcow2 files (hundreds of MB) - the upload works fine (no errors) 3. When I used a larger image: 6.24 GiB, it uploaded 100% of the file and then issued: Post https://cdi-uploadproxy-openshift-cnv.apps.working.oc4/v1alpha1/upload: EOF. In the pod logs it seems to happen when "Validating image", after waiting patiently for another 10 minutes the upload completed and the scratch pvc was removed: $ virtctl image-upload --pvc-name=test-pvc --pvc-size=36Gi --image-path=win_16.qcow2 --insecure --block-volume PVC default/test-pvc created Waiting for PVC test-pvc upload pod to be running... Pod now running Uploading data to https://cdi-uploadproxy-openshift-cnv.apps.working.oc4 6.24 GiB / 6.24 GiB [=============================================================================================================================================] 100.00% 12m18s Post https://cdi-uploadproxy-openshift-cnv.apps.working.oc4/v1alpha1/upload: EOF upload pod log: $ oc logs -f cdi-upload-test-pvc I1128 15:13:35.995401 1 uploadserver.go:61] Upload destination: /dev/cdi-block-volume I1128 15:13:35.996194 1 uploadserver.go:63] Running server on 0.0.0.0:8443 2019/11/28 15:13:43 http: TLS handshake error from 10.131.0.32:54908: EOF I1128 15:13:43.304401 1 uploadserver.go:244] Content type header is "" I1128 15:13:43.304451 1 data-processor.go:252] Calculating available size I1128 15:13:43.324150 1 data-processor.go:256] Checking out block volume size. I1128 15:13:43.327674 1 data-processor.go:264] Request image size not empty. I1128 15:13:43.327718 1 data-processor.go:269] Target size 36Gi. I1128 15:13:43.327962 1 util.go:37] deleting file: /scratch/lost+found I1128 15:13:43.330062 1 data-processor.go:182] New phase: TransferScratch I1128 15:13:43.333064 1 util.go:169] Writing data... I1128 15:25:30.028058 1 data-processor.go:182] New phase: Process I1128 15:25:30.034149 1 data-processor.go:182] New phase: Convert I1128 15:25:30.034166 1 data-processor.go:188] Validating image I1128 15:35:08.057641 1 data-processor.go:182] New phase: Resize I1128 15:35:08.117081 1 data-processor.go:182] New phase: Complete I1128 15:35:08.197618 1 util.go:37] deleting file: /scratch/tmpimage I1128 15:35:08.481357 1 uploadserver.go:263] Wrote data to /dev/cdi-block-volume I1128 15:35:08.481851 1 uploadserver.go:151] Shutting down http server after successful upload I1128 15:35:08.982421 1 uploadserver.go:71] UploadServer successfully exited @Michael, is it possible for virtctl image-upload to wait until the UploadServer exits?
In normal cases (no EOF), virtctl image-upload will complete right before the UploadServer pod exits. Although possible, I don't think it would be wise to ignore errors such as this and instead wait for the UploadServer pod to exit. Apparently, this error was benign, but most probably are not. Then the program will be waiting forever.
Approximately how long was "virtctl image-upload" running before the EOF?
Michael, below the progress bar, it shows 12m18s - so let's add a few more seconds for creating the pvc, in total it will be around 12.5-13 minutes.
@michael, any updates?
The issue from https://bugzilla.redhat.com/show_bug.cgi?id=1754920#c5 was addressed in the following PR: https://github.com/kubevirt/containerized-data-importer/pull/996 What's described in https://bugzilla.redhat.com/show_bug.cgi?id=1754920#c11 is different and I'm not sure what is going on. Should we make a separate issue for it?
Adam: In response to c16, let's split the issue reported in c11 into a new bug and resolve this one according to the status of PR#996.
I used the latest CDI 2.2 build, and tried to upload a large image (10G). The first try resulted in: $ virtctl image-upload --pvc-name=test-pvc --pvc-size=36Gi --image-path=win_16.qcow2 --insecure --block-volume PVC default/test-pvc created Waiting for PVC test-pvc upload pod to be running... timed out waiting for the condition $ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE test-pvc Bound pvc-51daa9e8-29d1-4657-b261-307692667513 36Gi RWO rook-ceph-block 27m test-pvc-scratch Bound pvc-773b3a70-74d9-4b00-9cf9-45d746a67fe3 36Gi RWO rook-ceph-block 27m The second attempt (didn't change a thing!) resulted with: $ virtctl image-upload --pvc-name=test-pvc --pvc-size=36Gi --image-path=win_16.qcow2 --insecure --block-volume PVC default/test-pvc created Waiting for PVC test-pvc upload pod to be running... Pod now running Uploading data to https://cdi-uploadproxy-openshift-cnv.apps.working.oc4 10.33 GiB [=======================================================================================================================================================================================================================================] 100.00% 57m6s Post https://cdi-uploadproxy-openshift-cnv.apps.working.oc4/v1alpha1/upload: EOF So both issues still relevant
Natalie are you sure the first issue is still relevant? Did you see the same stacktrace in uploadproxy log? The output in comment 18 looks like the second issue.
There are a few different issues making this bug confusing so I am going to summarize our plan to resolve this: There were two distinct cases mentioned in the original bug description: * Timed out waiting for the condition: - Natalie will split this into its own bug and it will be handled with lower priority in 2.3 because it does not reproduce regularly. * EOF during conversion to RAW - The upload eventually succeeds despite the confusing user experience (EOF error). - I'm happy to hear that adjusting haproxy configuration works around the problem but I think it's too heavy-handed to ask CNV users to update their k8s configuration to work around this. - For 2.2 we will address this with a release note - For 2.3 we will change the upload state machine so that the conversion happens asynchronously after the client finishes uploading data
re: "I think it's too heavy-handed to ask CNV users to update their k8s configuration to work around this" This is not a k8s configuration, it is a CNV QE test environment configuration.
(In reply to Michael Henriksen from comment #22) > re: "I think it's too heavy-handed to ask CNV users to update their k8s > configuration to work around this" > > This is not a k8s configuration, it is a CNV QE test environment > configuration. How would a regular user avoid hitting this issue with very large images in their default OCP installation?
Opened bug #1789259 that addresses the "timed out waiting for the condition" issue
Docs team: Could we add a release note to 2.2 with text similar to the following: ``` When using 'virtctl image-upload' to upload certain large qcow2 VM disk images an EOF error may be reported after all data has been transmitted even though the upload is progressing normally or has completed. The following command can be used to check the status of the upload on a given PVC: oc describe pvc <pvc-name> | grep cdi.kubevirt.io/storage.pod.phase ``` The underlying issue will be fixed in 2.3 with an enhancement to the upload process.
Moving this to ON_QA. See https://github.com/openshift/openshift-docs/pull/18680 for links to current test builds.
Removing completely from 2.2 Release Notes. It wasn't reported in 2.1 RN so there is no mention of the problem now in 2.2. Correct me if that's not right. Thanks, Bob
If QE agrees with comment 30, can QE change state to VERIFIED? Thanks. Bob
@nelly I think this is me confusing the "fixed in 2.2" as the real bug and not this bug. Sorry, adding this content back for 2.2: * When running `virtctl image-upload` to upload large VM disk images in `qcow2` format, an end-of-file (EOF) error may be reported after the data is transmitted, even though the upload is either progressing normally or completed. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1754920[*BZ#1754920*]) + Run the following command to check the status of an upload on a given PVC: + ---- $ oc describe pvc <pvc-name> | grep cdi.kubevirt.io/storage.pod.phase ----
Verified the section referring to this issue in https://github.com/openshift/openshift-docs/pull/18680/files
closing this bug manually; it should have been included in https://access.redhat.com/errata/RHEA-2020:0307